An attention-driven long short-term memory network for high throughput virtual screening of organic photovoltaic candidate molecules

•Introduced an Attention-LSTM network for predicting molecular properties.•A data augmentation routine is used to further enhance predictive accuracy.•State-of-the-art results are achieved on the NREL OPV and Harvard CEP datasets.•Generalizability shown on the ZINC-250k dataset for predicting drug-l...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Solar energy 2021-08, Vol.224, p.43-50
Hauptverfasser: Richards, Ryan J., Paul, Arindam
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Introduced an Attention-LSTM network for predicting molecular properties.•A data augmentation routine is used to further enhance predictive accuracy.•State-of-the-art results are achieved on the NREL OPV and Harvard CEP datasets.•Generalizability shown on the ZINC-250k dataset for predicting drug-like properties. Organic Photovoltaic (OPV) Solar Cells are a rapidly developing technology with promising capabilities over leading renewable energy sources. Screening methods for determining promising donor and acceptor molecules to augment the efficiencies of such cells can be substantially accelerated through deep learning. Textual descriptors, specifically Simplified Molecular Input Line Entry System (SMILES), are utilized as network inputs, while quantum-chemical calculations based on Density Function Theory (DFT) provide chemically-accurate targets for training and testing. We present a Long Short-Term Memory (LSTM) based network which uses a self-attention mechanism and a robust data augmentation routine to predict several OPV optoelectronic properties (e.g. highest occupied molecular orbital and lowest unoccupied molecular orbital). The LSTM cells, coupled with self-attention, learn the successive ordering and pairing of SMILES characters while attending to certain salient constituents of the molecule, which produce a robust understanding of the molecular graph. The Harvard Clean Energy Project (CEP) and National Renewable Energy Laboratory (NREL) OPV datasets are used for this study. The CEP dataset portion which we use contains ~1.2E6 candidate donor molecules with their respective DFT-computed properties, whereas the NREL OPV dataset possesses ~9.1E4 samples. Compared to contemporary graph-based model selections, our network reduces the MAE overall considered optoelectronic properties on the CEP and NREL OPV datasets by an average of 21.23% and 10.06% respectively. Furthermore, we demonstrate that our model generalizes well to the pharmaceutical drug discovery focused ZINC-250k dataset, reducing the MAE across all properties by an average of 28.2% from the current state-of-the-art model.
ISSN:0038-092X
1471-1257
DOI:10.1016/j.solener.2021.05.064