Explainable Graph Neural Networks with Data Augmentation for Predicting pK a of C–H Acids

The pK a of C–H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of pK a is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the pK a valu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of chemical information and modeling 2024-04, Vol.64 (7), p.2383-2392
Hauptverfasser: An, Hongle, Liu, Xuyang, Cai, Wensheng, Shao, Xueguang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The pK a of C–H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of pK a is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the pK a values of C–H acids is proposed on the basis of graph neural networks (GNNs) and data augmentation. A message passing unit (MPU) was used to extract the topological and target-related information from the molecular graph data, and a readout layer was utilized to retrieve the information on the ionization site C atom. The retrieved information then was adopted to predict pK a by a fully connected network. Furthermore, to increase the diversity of the training data, a knowledge-infused data augmentation technique was established by replacing the H atoms in a molecule with substituents exhibiting different electronic effects. The MPU was pretrained with the augmented data. The efficacy of data augmentation was confirmed by visualizing the distribution of compounds with different substituents and by classifying compounds. The explainability of the model was studied by examining the change of pK a values when a specific atom was masked. This explainability was used to identify the key substituents for pK a. The model was evaluated on two data sets from the iBonD database. Dataset1 includes the experimental pK a values of C–H acids measured in DMSO, while dataset2 comprises the pK a values measured in water. The results show that the knowledge-infused data augmentation technique greatly improves the predictive accuracy of the model, especially when the number of samples is small.
ISSN:1549-9596
1549-960X
DOI:10.1021/acs.jcim.3c00958