Power Analysis for the Wald, LR, Score, and Gradient Tests in a Marginal Maximum Likelihood Framework: Applications in IRT

The Wald, likelihood ratio, score, and the recently proposed gradient statistics can be used to assess a broad range of hypotheses in item response theory models, for instance, to check the overall model fit or to detect differential item functioning. We introduce new methods for power analysis and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Psychometrika 2023-12, Vol.88 (4), p.1249-1298
Hauptverfasser:	Zimmer, Felix, Draxler, Clemens, Debelak, Rudolf
Format:	Artikel
Sprache:	eng
Schlagworte:	Assessment Behavioral Science and Psychology Computer Simulation Educational Assessment Educational Measurement Humanities Hypotheses Item Response Theory Law Likelihood Functions Maximum Likelihood Statistics Psychology Psychometrics Psychometrics - methods Sample Size Statistical analysis Statistical Theory and Methods Statistics for Social Sciences Testing and Evaluation Theory and Methods
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The Wald, likelihood ratio, score, and the recently proposed gradient statistics can be used to assess a broad range of hypotheses in item response theory models, for instance, to check the overall model fit or to detect differential item functioning. We introduce new methods for power analysis and sample size planning that can be applied when marginal maximum likelihood estimation is used. This allows the application to a variety of IRT models, which are commonly used in practice, e.g., in large-scale educational assessments. An analytical method utilizes the asymptotic distributions of the statistics under alternative hypotheses. We also provide a sampling-based approach for applications where the analytical approach is computationally infeasible. This can be the case with 20 or more items, since the computational load increases exponentially with the number of items. We performed extensive simulation studies in three practically relevant settings, i.e., testing a Rasch model against a 2PL model, testing for differential item functioning, and testing a partial credit model against a generalized partial credit model. The observed distributions of the test statistics and the power of the tests agreed well with the predictions by the proposed methods in sufficiently large samples. We provide an openly accessible R package that implements the methods for user-supplied hypotheses.
ISSN:	0033-3123 1860-0980 1860-0980
DOI:	10.1007/s11336-022-09883-5