A Modification of the Jaccard-Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings

Determination of molecular similarity plays an important role in analyzing large compound databases in chemical and pharmaceutical research. When molecules are described by binary vectors with bits corresponding to the presence or absence of structural features, the Tanimoto association coefficient...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Technometrics 2002-05, Vol.44 (2), p.110-119
Hauptverfasser: Fligner, Michael A, Verducci, Joseph S, Blower, Paul E
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Determination of molecular similarity plays an important role in analyzing large compound databases in chemical and pharmaceutical research. When molecules are described by binary vectors with bits corresponding to the presence or absence of structural features, the Tanimoto association coefficient is the most commonly used measure of similarity or chemical distance between two compounds. However, when used to select compounds for an optimal spread design, the Tanimoto coefficient produces an intrinsic bias toward smaller compounds. We have developed a new association coefficient that overcomes this bias. This article gives details of the new coefficient and contrasts the two coefficients for selecting diverse sets of compounds from a large collection. When the Tanimoto coefficient is modified as suggested to select a diverse set in the National Cancer Institute and Registry of Toxic Effects of Chemical Substances databases, the average number of features among the selected compounds increases by more than 50%.
ISSN:0040-1706
1537-2723
DOI:10.1198/004017002317375064