Coverage Score: A Model Agnostic Method to Efficiently Explore Chemical Space

Selecting the most appropriate compounds to synthesize and test is a vital aspect of drug discovery. Methods like clustering and diversity present weaknesses in selecting the optimal sets for information gain. Active learning techniques often rely on an initial model and computationally expensive se...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of chemical information and modeling 2022-09, Vol.62 (18), p.4391-4402
Hauptverfasser: Woodward, Daniel J., Bradley, Anthony R., van Hoorn, Willem P.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Selecting the most appropriate compounds to synthesize and test is a vital aspect of drug discovery. Methods like clustering and diversity present weaknesses in selecting the optimal sets for information gain. Active learning techniques often rely on an initial model and computationally expensive semi-supervised batch selection. Herein, we describe a new subset-based selection method, Coverage Score, that combines Bayesian statistics and information entropy to balance representation and diversity to select a maximally informative subset. Coverage Score can be influenced by prior selections and desirable properties. In this paper, subsets selected through Coverage Score are compared against subsets selected through model-independent and model-dependent techniques for several datasets. In drug-like chemical space, Coverage Score consistently selects subsets that lead to more accurate predictions compared to other selection methods. Subsets selected through Coverage Score produced Random Forest models that have a root-mean-square-error up to 12.8% lower than subsets selected at random and can retain up to 99% of the structural dissimilarity of a diversity selection.
ISSN:1549-9596
1549-960X
DOI:10.1021/acs.jcim.2c00258