Optimal subsampling for generalized additive models on large-scale datasets

In the age of big data, the efficient analysis of vast datasets is paramount, yet hindered by computational limitations such as memory constraints and processing duration. To tackle these obstacles, conventional approaches often resort to parallel and distributed computing methodologies. In this stu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Statistics and computing 2025, Vol.35 (1)
Hauptverfasser:	Li, Lili, Liu, Bingfan, Liu, Xiaodi, Shi, Haolun, Cao, Jiguo
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Asymptotic methods Asymptotic properties Computer Science Computing time Datasets Distributed memory Distributed processing Harnesses Original Paper Parameter estimation Probability and Statistics in Computer Science Statistical Theory and Methods Statistics and Computing/Statistics Programs
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In the age of big data, the efficient analysis of vast datasets is paramount, yet hindered by computational limitations such as memory constraints and processing duration. To tackle these obstacles, conventional approaches often resort to parallel and distributed computing methodologies. In this study, we present an innovative statistical approach that exploits an optimized subsampling technique tailored for generalized additive models (GAM). Our approach harnesses the versatile modeling capabilities of GAM while alleviating computational burdens and enhancing the precision of parameter estimation. Through simulations and a practical application, we illustrate the efficacy of our method. Furthermore, we provide theoretical support by establishing convergence assurances and elucidating the asymptotic properties of our estimators. Our findings indicate that our approach surpasses uniform sampling in accuracy, while significantly reducing computational time in comparison to utilizing complete large-scale datasets.
ISSN:	0960-3174 1573-1375
DOI:	10.1007/s11222-024-10546-x