Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning

The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which che...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACS ES&T engineering 2022-07, Vol.2 (7), p.1211-1220
Hauptverfasser:	Zhong, Shifa, Lambeth, Dylan R, Igou, Thomas K, Chen, Yongsheng
Format:	Artikel
Sprache:	eng
Schlagworte:	data collection normal distribution prediction quantitative structure-activity relationships solubility standard deviation uncertainty
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.
ISSN:	2690-0645 2690-0645
DOI:	10.1021/acsestengg.1c00434