Statistical Modeling of a Ligand Knowledge Base

A range of different statistical models has been fitted to experimental data for the Tolman electronic parameter (TEP) based on a large set of calculated descriptors in a prototype ligand knowledge base (LKB) of phosphorus(III) donor ligands. The models have been fitted by ordinary least squares usi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of chemical information and modeling 2006-11, Vol.46 (6), p.2591-2600
Hauptverfasser: Mansson, Ralph A, Welsh, Alan H, Fey, Natalie, Orpen, A. Guy
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A range of different statistical models has been fitted to experimental data for the Tolman electronic parameter (TEP) based on a large set of calculated descriptors in a prototype ligand knowledge base (LKB) of phosphorus(III) donor ligands. The models have been fitted by ordinary least squares using subsets of descriptors, principal component regression, and partial least squares which use variables derived from the complete set of descriptors, least angle regression, and the least absolute shrinkage and selection operator. None of these methods is robust against outliers, so we also applied a robust estimation procedure to the linear regression model. Criteria for model evaluation and comparison have been discussed, highlighting the importance of resampling methods for assessing the robustness of models and the scope for making predictions in chemically intuitive models. For the ligands covered by this LKB, ordinary least squares models of descriptor subsets provide a good representation of the data, while partial least squares, principal component regression, and least angle regression models are less suitable for our dual aims of prediction and interpretation. A linear regression model with robustly fitted parameters achieves the best model performance over all classes of models fitted to TEP data, and the weightings assigned to ligands during the robust estimation procedure are chemically intuitive. The increased model complexity when compared to the ordinary least squares linear model is justified by the reduced influence of individual ligands on the model parameters and predictions of new ligands. Robust linear regression models therefore represent the best compromise for achieving statistical robustness in simple, chemically meaningful models.
ISSN:1549-9596
1549-960X
DOI:10.1021/ci600212t