USPTO Dataset for: Fast Chemical Reaction Condition Suggestion via Rule-Based Classification and Similarity Search
USPTO database that is analyzed with Rxn-INSIGHT (https://github.com/mrodobbe/Rxn-INSIGHT). This gzip file contains a very large Pandas DataFrame that can be loaded via pd.read_parquet('uspto_rxn_insight.gzip'). Because of the large size of the data, PyArrow version 13.0 must be used. To...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Dataset |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | USPTO database that is analyzed with Rxn-INSIGHT (https://github.com/mrodobbe/Rxn-INSIGHT).
This gzip file contains a very large Pandas DataFrame that can be loaded via pd.read_parquet('uspto_rxn_insight.gzip'). Because of the large size of the data, PyArrow version 13.0 must be used.
To use parquet in Pandas, install PyArrow and fastparquet using pip:
pip install pyarrow==13.0pip install fastparquet
Fast Chemical Reaction Condition Suggestion via Rule-Based Classification and Similarity Search
The challenge of devising pathways for organic synthesis remains a central issue in the field of medicinal chemistry. Over the span of six decades, computer-aided synthesis planning has given rise to a plethora of potent tools for formulating synthetic routes. Nevertheless, a significant expert task still looms: determining the appropriate solvent, catalyst, and reagents when provided with a set of reactants to achieve and optimize the desired product for a specific step in the synthesis process. Typically, chemists identify key functional groups and rings that exert crucial influences at the reaction center, classify reactions into categories, and may assign them names. This research introduces Rxn-INSIGHT, an open-source algorithm based on the bond-electron matrix approach, with the purpose of automating this endeavor. Rxn-INSIGHT not only streamlines the process but also facilitates extensive querying of reaction databases, effectively replicating the thought processes of an organic chemist. The core functions of the algorithm encompass the classification and naming of reactions, extraction of functional groups, rings, and scaffolds from the involved chemical entities, and the provision of reaction condition recommendations based on the similarity and prevalence of reactions. The performance of our rule-based model has been rigorously assessed against a carefully curated benchmark dataset, exhibiting an accuracy rate exceeding 90% in reaction classification and surpassing 95% in reaction naming. Notably, it has been discerned that a pivotal factor in selecting analogous reactions lies in the analysis of ring structures participating in the reactions. An examination of ring structures within the USPTO chemical reaction database reveals that with just 35 unique rings, a remarkable 75% of all rings found in nearly 1 million products can be encompassed. Furthermore, Rxn-INSIGHT is proficient in suggesting appropriate choices for solvents, catalysts, and reagents in |
---|---|
DOI: | 10.5281/zenodo.10171744 |