A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of cheminformatics 2024-06, Vol.16 (1), p.71-9, Article 71
Hauptverfasser: Zheng, Xiaofan, Tomiura, Yoichi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction Scientific contribution The 2-encoder pretraining is proposed by focusing on the lower dependency of symbols to the contextual environment in a SMILES than one in a natural language sentence and the corresponding of one compound to multiple SMILES sequences. The model pretrained with 2-encoder shows higher robustness in tasks of molecular properties prediction compared to BERT which is adept at natural language.
ISSN:1758-2946
1758-2946
DOI:10.1186/s13321-024-00848-7