Supervised Latent Dirichlet Allocation With Covariates: A Bayesian Structural and Measurement Model of Text and Covariates

Text is a burgeoning data source for psychological researchers, but little methodological research has focused on adapting popular modeling approaches for text to the context of psychological research. One popular measurement model for text, topic modeling, uses a latent mixture model to represent t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Psychological methods 2023-10, Vol.28 (5), p.1178-1206
Hauptverfasser: Wilcox, Kenneth Tyler, Jacobucci, Ross, Zhang, Zhiyong, Ammerman, Brooke A.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Text is a burgeoning data source for psychological researchers, but little methodological research has focused on adapting popular modeling approaches for text to the context of psychological research. One popular measurement model for text, topic modeling, uses a latent mixture model to represent topics underlying a body of documents. Recently, psychologists have studied relationships between these topics and other psychological measures by using estimates of the topics as regression predictors along with other manifest variables. While similar two-stage approaches involving estimated latent variables are known to yield biased estimates and incorrect standard errors, two-stage topic modeling approaches have received limited statistical study and, as we show, are subject to the same problems. To address these problems, we proposed a novel statistical model-supervised latent Dirichlet allocation with covariates (SLDAX)-that jointly incorporates a latent variable measurement model of text and a structural regression model to allow the latent topics and other manifest variables to serve as predictors of an outcome. Using a simulation study with data characteristics consistent with psychological text data, we found that SLDAX estimates were generally more accurate and more efficient. To illustrate the application of SLDAX and a two-stage approach, we provide an empirical clinical application to compare the application of both the two-stage and SLDAX approaches. Finally, we implemented the SLDAX model in an open-source R package to facilitate its use and further study. Translational Abstract Text data is an increasingly popular data source in psychological research that can be analyzed with a variety of models and algorithms. Topic models are a popular measurement model that use latent variables to represent constructs underlying a set of documents (e.g., clinical interviews, survey open responses, written or spoken educational assessments). Recent applications have used estimates of these "topics" as predictors of other variables in a regression model, but the statistical behavior of this approach has not been well studied. Similar approaches with other latent variable models are known to yield incorrect regression coefficient estimates and incorrect inferences. We showed that the use of topic estimates as regression predictors is also prone to these problems. As a solution, we proposed a model that jointly estimates the topic model and regression model-superv
ISSN:1082-989X
1939-1463
DOI:10.1037/met0000541