White paper:
Computational Analysis of Synthetic Planning: Past and Future

Adapted from
Wang, Z., Zhang, W. and Liu, B. (2021), Computational Analysis of Synthetic Planning: Past and Future. Chin. J. Chem., 39: 3127-3143. https://doi.org/10.1002/cjoc.202100273
Published with the courtesy of Wiley.
Computer-aided synthesis planning (CASP) can play a significant role in organizing and leveraging the flood of novel chemical reactions and expert reaction rules for planning novel and highly efficient synthesis of natural products and drug candidates. This review describes the progress in computational analysis of synthetic planning from the early stage focused on rule-based programs to machine learning and their combined capability.
Introduction
Chemists use retrosynthetic analysis to design a synthetic strategy for a target compound. Briefly, they use their experiences in breaking chemical bonds in the target compound and subsequent precursors in an iterative manner.
Various standardized tools (e.g., CML, SMILES, SMARTS, InChl and ECFP) translate chemical reactions and molecules into machine readable information. More advanced algorithms (e.g., neural networks, reinforcement learning) expand the data processing of chemical reactions.
This review covers three categories of CASP. Two categories use logical deduction from chemists’ intuitions and experiences: CASP algorithms based on hand-coded rules or on automatically extracted rules. The third CASP category uses chemical reaction databases for training of machine learning (ML) algorithm(s).
General Structure of CASP System
A typical CASP system has four modules. The reaction template database stores known reactions with rules of bond breaking. The retrosynthetic module aligns known reactions in the template database with structures of input molecules and provides the closest match to commercially available precursors in an iterative manner. The tree guide and evaluation module assess the fit of the candidate precursors to the synthetic routes. The commercially available compound database acts as a stop for the retrosynthetic module.
Hand-coded Rules Combined with Logical Algorithm
Representative CASP systems include LHASA, SECS, IGOR, CHIRON, and Chematica/ SynthiaTM. Both LHASA and SECS CASP systems included a communication module: interfaced writing pad so chemists could evaluate and select the best route from the synthetic tree.
IGOR (Intermediate Generation of Organic Reactions) did not restrict retrosynthetic analysis to empirically derived heuristic rules. IGOR includes all molecules participating in a reaction, requires extensive calculations and can simulate only simple retrosynthetic transformations.
CHIRON can decode complex stereochemistry and functionality which it can correlate to commercially available stereochemistry-enriched precursors. It searches for precursors with closely related skeletons, stereocenters, and functional groups to the target molecule.
Chematica (now called SynthiaTM) has expanded the Network of Organic Chemistry (NOC) to approx. 10 million compounds and manually added compatibility and context information (e.g., canonical conditions, intolerance of functional groups, regio- and stereoselectivity of specific reactions) using the SMILES/SMART coding method. Its hand-coded reaction rules increased to >100,000 in 2021. Chematica/SynthiaTMembedded an intelligent search functionand chemical scoring functions allow globally optimal outcomes (e.g., chiral precursorfor asymmetric synthesis.)
Chematica/SynthiaTM presents the synthetic tree in a dendritic way: each node denotes the retrosynthetic transformation and its associated substrate set (Fig. 1a). Chematica/SynthiaTM accelerates the analytic process with a priority queue for the lowest scoring nodes in the search algorithm (Fig. 1b).
Chematica/SynthiaTM includes various quantum mechanics and machine learning (ML) methods to optimize the searching algorithm, scoring functions, and stereoselective transformations. Chematica/SynthiaTM designed synthetic routes for eight drug-related molecules and several complex natural products. Their syntheses were experimentally accomplished. SynthiaTM program designed a more efficient synthetic route for OICR-9429 (Fig. 2). Literature reported a 1% yield of OICR-9429; but SynthiaTM route yielded 60%. Furthermore, SynthiaTM designed synthetic route simplified its purification from four chromatographic procedures to one recrystallization. Thus, Grzybowski and coworkers clearly demonstrate that Chematica/SynthiaTM can solve complex problems in synthetic chemistry.
Manual extraction of reaction templates can broaden the context information of chemical reactions and enhance retrosynthetic analyses. The choice between automatic and manual extraction depends on the consistent description of variables and the desired applications.

Automatically Extracted Rules Combined with Logical Algorithm
Auto-extraction of new chemical reactions and templates daily can efficiently maintain databases but it may miss adjacent functional groups and atoms.
SYNCHEM2 allows both backward and forward synthetic transformations with alternate coding. RETROSYN abstracts the reaction center and builds atomic correlation between products and reactants with a special graph-difference algorithm. RETROSYN searches and sorts the degree of matching with a high to low priority but ignores stereochemistry.
KOSP (Kowledge-base-Oriented System for Synthesis Planning) automatically extracts reaction templates including activating groups/atoms within three bond distances to populate the Reaction Knowledge Base. The new KOSP version enables regio- and stereoselective retrosynthesis analysis and updates have expanded the reaction contents by 10-fold.
ChemPlanner, successor of ARChem, has an exclusive cooperation with American Chemical Abstracts Service and Wiley for SciFinder, a highly accessible database of scientist-curated reaction content. The new ChemPlanner version enables regio- and stereoselective retrosynthesis analysis.

ICSYNTH represents its reaction knowledge database in graph-basis form. Users can include in-house chemical rules from its confidential reaction database and adapt ICSYNTH for various application scenarios by selecting and editing chemical rules.
ASKCOS calculates the similarity of reaction products with the target molecule to develop a retrosynthetic plan in a stepwise manner. Modules of ASKCOS include One-Step Retrosynthesis, Interactive Path Planning, Tree Builder, and Context Recommendation.
Automatically Extracted Rules Combined with Machine Learning Algorithm
ML algorithms are trained with chemical reaction databases including reactants. Reinforcement learning algorithms continuously interact with the environment which teaches them the optimal strategy via a penalty-reward approach.
The Bishop program combines rule-based retrosynthetic analysis and reinforcement learning. The Chemical Reaction Network compiles the intermediates, connects reactants and products, and has a reinforcement learning module to map a flexibly defined, optimal reaction pathway(s) with potential filters for cost, overall efficiency, and/or environmental impacts.
3N-MCTS (Monte Carlo Tree Search algorithm) uses the artificial neural networks trained by digital sequences of products and relevant precursors from the literature. The ANN-based CASP system reorganizes the specific learned reaction rules which simplifies the calculation process. Each MCTS round consists of Selection, Expansion, Rollout, and Update. Improvements are needed to predict stereoselectivities.
Seq2Seq model with Simplified Molecular Input Line-Entry System (SMILES) translation
can process massive dataset and simulate a reaction with global optimal output. AutoSynRoute evaluates synthetic pathways by applying MCTS algorithm with Chematica/SynthiaTM-inspired heuristic scoring functions. RXN uses two retrosynthetic ML models trained by two databases. RXN can predict suitable reaction conditions for the proposed synthetic route.
Conclusions
Several CASP programs apply heuristic reaction rules and reaction rules from the literature in their algorithms for retrosynthetic chemistry with or without scoring functions and ML (e.g., Chematica/SynthiaTM) Other CASP programs rely on ML or the combination of ML with heuristic reaction rules and/or literature-based chemical rules. These algorithms have already provided novel synthetic routes that improved yield for complicated molecules. Further improvements can provide novel synthetic routes for complex compounds with additional constraints such as lower cost, lower environmental footprint, and fewer hazardous reagents or solvents.