Transfer learning for small molecule retention predictions

•A transfer learning approach to predict retention time in HPLC is developed•Simplified Molecular Input Line Entry System (SMILES) strings are used as input for deep learning model•Self-supervised pre-training on a huge dataset was performed•The approach provides comparable accuracy with traditional...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of Chromatography A 2021-05, Vol.1644, p.462119, Article 462119
Hauptverfasser: Osipenko, Sergey, Botashev, Kazii, Nikolaev, Eugene, Kostyukevich, Yury
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•A transfer learning approach to predict retention time in HPLC is developed•Simplified Molecular Input Line Entry System (SMILES) strings are used as input for deep learning model•Self-supervised pre-training on a huge dataset was performed•The approach provides comparable accuracy with traditional machine learning methods•The approach can be applied on data with a limited number of training examples Small molecule retention time prediction is a sophisticated task because of the wide variety of separation techniques resulting in fragmented data available for training machine learning models. Predictions are typically made with traditional machine learning methods such as support vector machine, random forest, or gradient boosting. Another approach is to use large data sets for training with a consequent projection of predictions. Here we evaluate the applicability of transfer learning for small molecule retention prediction as a new approach to deal with small retention data sets. Transfer learning is a state-of-the-art technique for natural language processing (NLP) tasks. We propose using text-based molecular representations (SMILES) widely used in cheminformatics for NLP-like modeling on molecules. We suggest using self-supervised pre-training to capture relevant features from a large corpus of one million molecules followed by fine-tuning on task-specific data. Mean absolute error (MAE) of predictions was in range of 88-248 s for tested reversed-phase data sets and 66 s for HILIC data set, which is comparable with MAE reported for traditional machine learning models based on descriptors or projection approaches on the same data.
ISSN:0021-9673
1873-3778
DOI:10.1016/j.chroma.2021.462119