Direct Bayesian linear regression for distribution-valued covariates
In this manuscript, we study scalar-on-distribution regression; that is, instances where subject-specific distributions or densities are the covariates, related to a scalar outcome via a regression model. In practice, only repeated measures are observed from those covariate distributions and common...
Gespeichert in:
Veröffentlicht in: | Electronic journal of statistics 2024, Vol.18 (2), p.3327-3375 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this manuscript, we study scalar-on-distribution regression; that is, instances where subject-specific distributions or densities are the covariates, related to a scalar outcome via a regression model. In practice, only repeated measures are observed from those covariate distributions and common approaches first use these to estimate subject-specific density functions, which are then used as covariates in standard scalar-on-function regression. We propose a simple and direct method for linear scalar-on-distribution regression that circumvents the intermediate step of estimating subject-specific covariate densities. We show that one can directly use the observed repeated measures as covariates and endow the regression function with a Gaussian process prior to obtain a closed form or conjugate Bayesian inference. Our method subsumes the standard Bayesian non-parametric regression using Gaussian processes as a special case, corresponding to covariates being Dirac-distributions. The model is also invariant to any transformation or ordering of the repeated measures. Theoretically, we show that, despite only using the observed repeated measures from the true density-valued covariates that generated the data, the method can achieve an optimal estimation error bound of the regression function. The theory extends beyond i.i.d. settings to accommodate certain forms of within-subject dependence among the repeated measures. To our knowledge, this is the first theoretical study on Bayesian regression using distribution-valued covariates. We propose numerous extensions including a scalable implementation using low-rank Gaussian processes and a generalization to non-linear scalar-on-distribution regression. Through simulation studies, we demonstrate that our method performs substantially better than approaches that require an intermediate density estimation step especially with a small number of repeated measures per subject. We apply our method to study association of age with activity counts. |
---|---|
ISSN: | 1935-7524 1935-7524 |
DOI: | 10.1214/24-EJS2275 |