Application of the mol2vec Technology to Large‐size Data Visualization and Analysis

Generative Topographic Mapping (GTM) is a dimensionality reduction method, which is widely used for both data visualization and structure‐activity modeling. Large dimensionality of the initial data space may require significant computational resources and slow down the GTM construction. Therefore, i...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Molecular informatics 2020-06, Vol.39 (6), p.e1900170-n/a
Hauptverfasser: Shibayama, Shojiro, Marcou, Gilles, Horvath, Dragos, Baskin, Igor I., Funatsu, Kimito, Varnek, Alexandre
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Generative Topographic Mapping (GTM) is a dimensionality reduction method, which is widely used for both data visualization and structure‐activity modeling. Large dimensionality of the initial data space may require significant computational resources and slow down the GTM construction. Therefore, it may be meaningful to reduce the number of descriptors used for encoding molecular structures. The Principal Component Analysis (PCA), a standard preprocessing tool, suffers from the information loss upon the dimensionality reduction. As an alternative, we propose to use substructure vector embedding provided by the mol2vec technique. In addition to the data dimensionality reduction, this technology also accounts for proximity of substructures in molecular graphs. In this study, dimensionality of large descriptor spaces of ISIDA fragment descriptors or Morgan fingerprints were reduced using either the PCA or the mol2vec method. The latter significantly speeds up GTM training without compromising its predictive power in bioactivity classification tasks.
ISSN:1868-1743
1868-1751
DOI:10.1002/minf.201900170