White paper:

Synergy Between Expert and Machine-Learning Approaches Allows for Improved Retrosynthetic Planning

Adapted from
T. Badowski, E. P. Gajewska, K. Molga, B. A. Grzybowski, Angew. Chem. Int. Ed. 2020, 59, 725. https://onlinelibrary.wiley.com/doi/10.1002/anie.201912083
Published with the courtesy of Wiley.

Grzybowski and colleagues demonstrate that higher synthetic accuracy can be achieved in computer-designed multistep synthetic plans when artificial intelligence (AI) software that combines both expert knowledge and machine-extracted information from large repositories of reactions types.

Introduction

Artificial intelligence (AI) platforms for computer-designed synthetic plans seek commercially available precursor materials, assess individual synthetic steps, and evaluate the vast synthetic possibilities from their resource(s) and training materials. Integral components of AI are scoring functions (SFs) which guide the development of the plans. AI has historically developed chemical synthetic plans based on expert synthesis knowledge or on synthetic pathways reported in the literature such as chemical repositories. However, each dataset has advantages and limitations.

Although heuristic, expert synthetic knowledge usually reflects the successful chemical plans of chemists’ intuition, chemists’ preferences include central disconnections, reduced numbers of rings and stereocenters, and often multiple steps masking and unmasking of pertinent reactive groups.

In comparison, machine-learning functions based on the literature focus on popular reaction types with sufficient references, and AI uses neural network (NN) algorithms to identify one or more synthetic plans. The SFs of NNs compile information about reactions and final products from a specific database such as the USPTO (US Patent Trademark Office). The SFs output provides probability of specific reactions (identifiers, IDs) but may be overly burdened with popular reactions and miss more efficient reactions known by chemists.
 

Characteristics of AI training materials for combining machine learning from experts and NN

The NN is trained on analogous product and substrate data from both sources: reactions from the literature and high-quality reaction rules from experts. All analyses utilized approx. 1.6 million reactions reported to synthesize approx. 1.4 million unique products (simple chemicals to complex natural products). Protection and deprotection reactions from either source were not included to avoid their overuse in the synthetic plans. Grzybowski and colleagues required that each reaction included from the literature agreed with an expert’s reaction rule(s) from at least one of the 75000 procedures from Chematica The SF-based output may include a synthetic plan involving alternative reaction rules from Chematica, now called SynthiaTM and is commercially available.

The analyses provided an average of approx. 60 conflict-free, product-fitting re actions for a product. In total, Grzybowski and colleagues considered approx. 85 million reactions that were high-chemical quality and conflict-free in the development of synthetic plans for 1.4 million products. The product set was randomly divided into 70% for training, 10% for validation, and 20% for testing.

The authors’ program (ICHO) has a NN-based scoring function that contains four layers: three hidden layers that provide possible reactions for producing product 1 (P1), P2, and P3, and an output layer (Fig. 1 left panel). The enhanced program (ICHO+) augmented the NN ICHO architecture with the following expert knowledge of chemically intuitive reactions: number of created or destroyed rings, number of installed or removed stereocenters, selectivity of reaction, sizes of breakdown products (similar vs very disparate), and more. The ICHO+ program thus adjusts the frequency of specific reactions for a particular product in the literature with their frequencies in expert synthetic plans. During the training of the ICHO and ICHO+ training, the program assigns larger probabilities for specific reactions obtained in both literature and expert synthetic plans. In contrast, the program also adjusts the probability lower for a very popular chemist rule that is rarely used for synthesis of a particular product, suggesting the reaction may be tricky, challenging to execute, or inefficient.
 

Performance of AI platforms

Direct comparison of the NN architecture between ICHO/ICHO+ and the NN-based program by Segler and Walker, denoted as SW, is illustrated in Figure 1 [1,2]. The SW AI platform and other NN-based AI synthetic platforms published by 2019 learn only from reactions in literature precedents. Most AI programs including ICHO and SW use a popular machine-learning activation function called exponential linear unit (ELU). ELU accelerates training and increases performance of the program. The efficiency of the combined ICHO+ program was also compared to an updated heuristic scoring scheme originally called SMILES that assesses the simplicity of the synthesis plan. The updated program called SMALLER advances central disconnections which simulates the organic synthetic intuition and practice of chemists. One advantage of SMALLER is that the frequency of reactions in the literature has minimal influence on the ultimate proposed route.

 

Within the ICHO and SW programs, the inclusion of learning from the heuristics expert chemical rules (ICHO+, SW+) only marginally improved the efficiency of the synthetic plans. Limiting the SW programs to product-fitting reactions (SW2, SW2+) improved their performance. However, ICHO+ remained the highest ranked pathway, likely due to its additional knowledge of substrates.

The performance of the three types of programs were evaluated on developing synthetic pathways involving both experimentally established reactions and relatively advanced synthetic pathways. Synthetic plans for four complex products developed by the ICHO+, SW2+, and SMALLER programs are compared in Figure 2. ICHO+ ranked highest for the synthetic plans for the four products: the BRD 7/9 inhibitor, the serotonin–norepinephrine reuptake inhibitor (+)-synosutine, the natural product seimatopolide A, and the prostaglandin analogue bimatoprost.

Summary

Grzybowski and colleagues compared their NN-based ICHO+ scoring functions that combine chemical AI with expert knowledge including reaction rules with other NN-based scoring AI programs for development of synthetic plans of complex molecules. Their examples demonstrate a major advantage of combining chemical AI with expert knowledge: the program’s ability to propose synthetically powerful reactions that are listed only sparsely in the literature. Chematica is updated and now called SynthiaTM. It provides AI retrosynthesis software that also can utilize a custom inventory or database (e.g., in-house database of confidential reactions) in addition to several publicly available databases.

References

[1] Segler, M.H.S. et al. (2018). Planningchemical syntheses with deep neuralnetworks and symbolic AI. Nature. DOI:10.1038/nature25978.

[2] Segler, M.H.S. and Waller, M.P. (2017).Neural-Symbolic Machine Learning forRetrosynthesis and Reaction Prediction.Chemistry - A European Journal. DOI:10.1002/chem.201605499.