Pistachio "Fantastic reactions and how to use them" - John Mayfield, Ingvar Lagerstedt and Roger Sayle - NextMove Software
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Pistachio “Fantastic reactions and how to use them” John Mayfield, Ingvar Lagerstedt and Roger Sayle NextMove Software NIH Virtual Workshop on Reaction Informatics, May 2021
What is Pistachio? A document centric database of 13.3 million reactions Automatically extracted from U.S., European and WIPO patents JSON and SMILES provided for bulk analysis/model building Containerised WebApp for exploring and querying the data Aim is to extract reactions as described in the original document, Warts and all NIH Virtual Workshop on Reaction Informatics, May 2021
History DEPARTMENT OF CHEMISTRY Extraction of chemical structures and reactions from the literature Daniel Mark Lowe Pembroke College This dissertation is submitted for the degree of Doctor of Philosophy June 2012 Daniel’s PhD Thesis (2012) repository.cam.ac.uk/handle/1810/244727 Original Open-Source Project dan2097/patent-reaction-extraction USPTO CC-Zero Subset (3.7 million) Chemical_reactions_from_US_patents_1976-Sep2016_/5104873 We use an internal fork built using Pistachio (13.3 million) LeadMine instead of OSCAR4. nextmovesoftware.com/pistachio Primarily improves chemical entity and physical quantity recognition, spelling correction, etc. NIH Virtual Workshop on Reaction Informatics, May 2021
Data Impact Christos Nicolaou et al. The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space J. Chem. Inf. Model., 2016, 56 (7), pp 1253–1266 Nadine Schneider et al. Big Data from Pharmaceutical Patents: A Computational Analysis of Medicinal Chemists’ Bread and Butter. J. Med. Chem., 2016, 59 (9), pp 4385–440 Bowen Liu et al. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent. Sci., 2017, 3 (10), pp 1103–111 Philippe Schwaller et al. Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction ACS Cent. Sci., 2019, 5 (9), pp 1572–158 Connor Coley et al. Prediction of Organic Reaction Outcomes Using Machine Learning. ACS Cent. Sci., 2017, 3 (5), pp 434–44 Philippe Schwaller et al. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Sci. Adv., 2021, 7 (15 Alessandra Toniato et al. Unassisted noise reduction of chemical reaction datasets. Nat. Mach. Intell. 202 Amol Thakkar et al. Arti cial intelligence and automation in computer aided synthesis planning. React. Chem. Eng., 2021, 6 NIH Virtual Workshop on Reaction Informatics, May 2021 1 fi 3 ) 3 3 2
Important: The same reaction will occur in application/grant, related patents, sketches/text and different authorities (WIPO/ EPO/USPTO). Using RInChI without any role normalisation ~4.2 million. Often identical but not always - different description/yield/actions. NIH Virtual Workshop on Reaction Informatics, May 2021
Important: The same reaction will occur in application/grant, related patents, sketches/text and different authorities (WIPO/ EPO/USPTO). Using RInChI without any role normalisation ~4.2 million. Often identical but not always - different description/yield/actions. NIH Virtual Workshop on Reaction Informatics, May 2021
Amy Fried and Robert Wilkening Merck Sharp & Dohme Estrogen receptor modulators. US 7151196 B2 [0236] (19-Dec-2006) Example 2, Step 2 A solution of 2-(2-hydroxyethyl)-5-methoxy-1-indanone (105 mg, 0.51 mmol) in methanol (2.0 mL) at room temperature was treated with ethyl vinyl ketone (EVK, 0.102 mL) and 0.5M sodium methoxide in methanol (0.204 mL, 0.1 mmol). The mixture was stirred in a capped ask and heated in an oil bath at 60° C. for 8 hours. After cooling, the reaction mixture was diluted with EtOAc (25 mL), washed with 0.2N HCl (15 mL), water (15 mL), and brine (15 mL), dried over MgSO4, ltered, and evaporated under vacuum to a ord 2-(2-hydroxyethyl)-5- methoxy-2-(3-oxopentyl)-1-indanone as an oil. Dann Parker, Ronald Ratcli e, Kenneth Wildonger and Robert Wilkening Merck Sharp & Dohme Estrogen Receptor Modulators EP 1257264 B1 [0261] (14-Sep-2011) EXAMPLE 34, Step 2 A solution of 2-(2-hydroxyethyl)-5-methoxy-1-indanone (105 mg, 0.51 mmol) in methanol (2.0 mL) at room temperature was treated with ethyl vinyl ketone (EVK, 0.102 mL) and 0.5M sodium methoxide in methanol (0.204 mL, 0.1 mmol). The mixture was stirred in a capped ask and heated in an oil bath at 60°C for 8 hours. After cooling, the reaction mixture was diluted with EtOAc (25 mL), washed with 0.2N HCl (15 mL), water (15 mL), and brine (15 mL), dried over MgSO4, ltered, and evaporated under vacuum to a ord 2-(2-hydroxyethyl)-5- methoxy-2-(3-oxopentyl)-1-indanone (138 mg, 93% yield) as an oil. NIH Virtual Workshop on Reaction Informatics, May 2021 fl fl ff fi fi ff ff
USPTO vs Pistachio Data • Updated quarterly U.S. Grant Text 3,366,399 2021-05-18 • EPO, WIPO patents U.S. Appl. Text 3,629,411 2021-05-13 WIPO PCT Text 1,520,596 2021-05-06 • USPTO sketches Euro. Grant Text 1,074,590 2021-05-12 • NameRxn Classification/AAM Euro. Appl. Text 702,035 2021-05-12 – Improved role assignment U.S. Grant Sketch 1,211,521 2021-05-18 U.S. Appl. Sketch 1,834,132 2021-05-13 – NameRxn 71.5% coverage • Example/Step Labels • Solvent Mixtures • Solvent associations • Document Assignees, Targets and Diseases • Continual tweaks based on feedback NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data Pistachio is a “super-set” of USPTO but not strictly so… • NameRxn filtering/mapping • Improved/changed name-to- structure, roles, sectioning • Structure normalisation differences • Whack-a-mole/pachinko machine – Obvious sensible change can have unforeseen consequences The NextMove’s pachinko machine NIH Virtual Workshop on Reaction Informatics, May 2021
Regression Testing
USPTO vs Pistachio Data • Reactions from USPTO Application Text 2001-22nd Sep 2016 – 1,939,253 CC-Zero Subset – 2,568,513 Pistachio – 458,995 common (-1,480,258,+ 2,109,518) by SMILES NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data • Reactions from USPTO Application Text 2001-22nd Sep 2016 – 1,939,253 CC-Zero Subset – 2,568,513 Pistachio – 1,386,306 common (-552,947,+1,182,207) by ~RInChI NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data • Reactions from USPTO Application Text 2001-22nd Sep 2016 – 1,939,253 CC-Zero Subset – 2,568,513 Pistachio – 1,465,946 common (-473,307,+1,102,567) by norm SMILES NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data • Reactions from USPTO Application Text 2001-22nd Sep 2016 – 1,939,253 CC-Zero Subset – 2,568,513 Pistachio – 1,866,314 common (-72,939,+702,199) by paragraph Id NIH Virtual Workshop on Reaction Informatics, May 2021
Overview of extraction
Text Extraction Sectioning Tagging/Tokenization Parsing Action Phrases Reaction Assembly Example from US20020133011A1 [0070] NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction Sectioning Missed break Tagging/Tokenization Parsing Action Phrases Reaction Assembly Extra break Examples from WO 2020/239862 A1 PatentScope OCR NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction Sectioning Tagging/Tokenization Parsing Action Phrases Reaction Assembly UnitType.Percent UnitType.Mass UnitType.Percent QuantityType.Purity QuantityType.Yield NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction Sectioning Tagging/Tokenization Parsing Action Phrases Reaction Assembly NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction Sectioning Type Add Compounds Tagging/Tokenization • ethyl cyanoacetate (mass=13.56 g) • ethyl 4-fluorocinnamate (mass=19.4 g) • sodium ethoxide (mass=2.3 g, vol=50 ml) Parsing Conditions • 2-3 minutes • 60° C. Type Heat Action Phrases Conditions • 1 hour Reaction Assembly Type Cool Conditions Type Yield • 5° C. Compounds • 2-cyano-3-(flurophenyl)-glutarate (mass=23 g, yield=74%, purity=98%) NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction Segmentation Tagging/Tokenization Parsing Action Phrases Reaction Assembly Preliminary role assignment based on action, surrounding context and dictionaries (common solvents/catalysts) NIH Virtual Workshop on Reaction Informatics, May 2021
ChEMU 2020 Evaluation Lab Run Exact matching Relaxed matching F1-score Precision Recall F1-score Precision Recall Task 1 0.8983 0.9042 0.8924 0.9240 0.9301 0.9181 Task 2 0.8977 0.9441 0.8556 n/a n/a n/a end-2-end 0.8026 0.8492 0.7609 0.8196 0.8663 0.7777 end-2-end 0.8255 0.8746 0.7816 0.8420 0.8909 0.7983 (after deadline) Daniel Lowe and John May eld. Extraction of reactions from patents using grammars. 202 http://ceur-ws.org/Vol-2696/paper_221.pdf Nguyen D.Q. et al. (2020) ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In: Jose J. et al. (eds) Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12036. Springer, Cham. https://doi.org/10.1007/978-3-030-45442-5_74 NIH Virtual Workshop on Reaction Informatics, May 2021 fi 0
Sketch Extraction Example 26, US 9718816 B2 Step 1 Step 2 Step 3 NextMove’s Praline Step 4 etc.. US 09718816 B2 Example 26 John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on Cheminformatics. 2016 NIH Virtual Workshop on Reaction Informatics, May 2021
Overview of Filtering/Mapping
Reaction Filtering - Text Rebond 1 >= Precursors = Products = Num Precursors
Reaction Filtering - Sketch Specific Reaction 1 >= Precursors Reject 1 >= Products Sane NameRxn AAM Mapped Fix Roles NIH Virtual Workshop on Reaction Informatics, May 2021
ROLE FIX 1) Move all agents to reactants 2) Atom-Atom Mapping - Michael addition (3.11.92) 3) Move unmapped reactants back to agents NIH Virtual Workshop on Reaction Informatics, May 2021
Why NameRxn? • 1,543 rule based classes - easy to update a mapping disagreement 4.1.6 Cyclic Beckmann rearrangement • Higher precision/lower recall • Originally for pharmaceutical ELNs ~80% • Pistachio coverage is ~71.5% – >77% USPTO appl. text. • Fast ~380 reactions per second per core – A few hours to remap entire database – Speed depends on backend NIH Virtual Workshop on Reaction Informatics, May 2021
NameRxn - Magic functional groups NameRxn originally written as classification tool, AAM is a by product • For us no answer is better than a wrong answer • Lowest number of wrong answers (Disagreement with gold-standard) • Yellow bar is so called “magic group additions” where a product atom is unmapped: • We didn’t know where a group came from • Where there group came from was missing • Stoichometry (multiple groups from one reactant) • Aim to indicate this better in bulk data • AMAP bench Arkadii Lin et al. Atom-to-atom mapping: a benchmarking study of popular mapping algorithms and consensus strategies https://chemrxiv.org/articles/preprint/Atom-to-Atom_Mapping_A_Benchmarking_Study_of_Popular_Mapping_Algorithms_and_Consensus_Strategies/13012679/1 NIH Virtual Workshop on Reaction Informatics, May 2021
It’s a kind of magic… Bromo Grignard + nitrile ketone synthesis (3.7.10) EP0200736B1 [0072] Example 1, Step 1 RxnMapper/Indigo RxnMapper Water comes from the quenching “The reaction mixture is slowly poured into ice cold 10% hydrochloric acid “quenched slowly with 2N aq. HCl” (different paragraph) NIH Virtual Workshop on Reaction Informatics, May 2021 : ”
Symmetry/Stoichiometry NameRxn - 8.2.2 Sulfanyl to sulfonyl RxnMapper US20010000511A1 [0357] NIH Virtual Workshop on Reaction Informatics, May 2021
Symmetry/Stoichiometry Handle by reusing atom-maps in the reactant US20010000511A1 [0357] NIH Virtual Workshop on Reaction Informatics, May 2021
Symmetry/Stoichiometry Indigo RxnMapper US 03674855 A NIH Virtual Workshop on Reaction Informatics, May 2021
AMAP BENCH Indigo AMAP bench: Changed: 23, Broken: 13, C-C Broken: 7 RxnMapper AMAP bench: Changed: 5, Broken: 3, C-C Broken: 0 Daniel Lowe, Roger Sayle. Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms. 244th ACS National Meeting & Exposition. Aug 2012 US20200071310A1 [0487] Example 34 NIH Virtual Workshop on Reaction Informatics, May 2021
Ambiguous Names 8-(3,5-Bis-tri uoromethyl-benzoyl)-3-furan-2-yl-methyl-1-o-tolyl-1,3,8-triaza-spiro[4.5]decane-2,4-dione Indigo/RxnMapper AMAP bench: Changed: 4, Broken: 2, C-C Broken: 1 NIH Virtual Workshop on Reaction Informatics, May 2021 fl
Ambiguous Names 8-(3,5-Bis-tri uoromethyl-benzoyl)-3-furan-2-ylmethyl-1-o-tolyl-1,3,8-triaza-spiro[4.5]decane-2,4-dione 1.2.9 Alcohol + amine condensation NameRxn/Indigo/RxnMapper AMAP bench: Changed: 2, Broken: 1, C-C Broken: 0 NIH Virtual Workshop on Reaction Informatics, May 2021 fl
Case Study Example 1 Example 21 Example 2 Example 22 Example 3 Example 23 Example 4 Example 24 Example 5 Example 25 Example 6 Example 26 Example 7 Example 27 Example 8 Example 28 Example 9 Example 29 Example 10 Example 30 Example 11 Example 31 Example 12 Example 32 Example 13 Example 33 Example 14 Example 15 Example 16 US 2020/0087299 A1 Example 17 Example 18 Example 19 Example 20 NIH Virtual Workshop on Reaction Informatics, May 2021
Case Study Example 1 Example 21 Example 2 Example 22 Example 3 Example 23 Example 4 Example 24 Example 5 Example 25 Example 6 Example 26 Example 7 Example 27 Example 8 Example 28 Example 9 Example 29 Example 10 Example 30 Example 11 Example 31 Example 12 Example 32 Example 13 Example 33 Example 14 Example 15 Example 16 US 2020/0087299 A1 Example 17 Example 18 NameRxn 127/154 82.4% Example 19 Indigo 15/154 9.7% Example 20 Reject 12/154 7.7% NIH Virtual Workshop on Reaction Informatics, May 2021
Case Study Example 1 Example 21 Example 2 Example 22 Example 3 Example 23 Example 4 Example 24 Example 5 Example 25 Example 6 Example 26 Example 7 Example 27 Example 8 Example 28 Example 9 Example 29 Example 10 Example 30 Example 11 Example 31 Example 12 Example 32 Example 13 Example 33 Example 14 Example 15 Example 16 US 2020/0087299 A1 Example 17 Example 18 NameRxn 132/154 85.7% Example 19 Indigo 10/154 6.4% Example 20 Reject 12/154 7.7% NIH Virtual Workshop on Reaction Informatics, May 2021
Example 27 Typo: “tert-butyl” Typo: “tert-butyl” NIH Virtual Workshop on Reaction Informatics, May 2021
Example 12 Step 1 Small Product (8 heavy atoms) Typo: “methyl” NIH Virtual Workshop on Reaction Informatics, May 2021
Data STORAGE TIPS
Hierarchical Data Hierarchical data is index in the WebApp: NameRxn Tags, Assignees, Diseases (MESH), Targets (ChEMBL), IPC Codes A simple way of store and searching for the data is using a nested identifier string, e.g. LIKE ’11.%’ pulls back all AstraZeneca and related companies: 11 AstraZeneca 11.5 Imperial Chemical Industries 11.7 MedImmune ... NameRxn is handled slightly different, we pack the three level number into an integer (lvl1
NameRxn concepts and rxno 1 Heteroatom alkylation and arylation .7 O-substitution .1 Chan-Lam ether coupling .2 Diazomethane esterification .3 Ethyl esterification .4 Hydroxy to methoxy .5 Hydroxy to triflyloxy .6 Methyl esterification .n 2 Acylation and related processes .6 O-acylation to ester .1 Ester Schotten-Baumann .2 Esterification (generic) .3 Fischer-Speier esterification .4 Baeyer-Villiger oxidation .5 Yamaguchi esterification .6 Hydroxy to imidazolecarbonyloxy .7 Imidazolecarbonyl to ester .8 Hydroxy to acetoxy .9 Steglich esterification .n CINF 13, ACS Fall 2017, Washington, D.C.
NameRxn concepts and rxno 1 Heteroatom alkylation and arylation Esterification (7) .7 O-substitution .1 Chan-Lam ether coupling Chan-Lam coupling (3) .2 Diazomethane esterification .3 Ethyl esterification .4 Hydroxy to methoxy .5 Hydroxy to triflyloxy .6 Methyl esterification .n 2 Acylation and related processes .6 O-acylation to ester .1 Ester Schotten-Baumann Schotten-Baumann .2 Esterification (generic) Reaction (9) .3 Fischer-Speier esterification .4 Baeyer-Villiger oxidation .5 Yamaguchi esterification .6 Hydroxy to imidazolecarbonyloxy .7 Imidazolecarbonyl to ester .8 Hydroxy to acetoxy .9 Steglich esterification .n RXNO: http://github.com/rsc-ontologies/rxno CINF 13, ACS Fall 2017, Washington, D.C.
Summary We always welcome feedback if you spot a mistake! • It’s a long tail but many things are simple changes that are fixed when rerun • Lot’s of people “cleaning” the data, We’d rather know what was wrong and can we fix it Plans • Reaction sketch compound numbers • Better quality indication • Integrate RxnMapper, AMAP bench indicators, Boot-strapping sequences • Handled reactions from non-english patents • General procedures/example references, currently only resolve compounds Acknowledgements Daniel Lowe (MineSoft) Richard Gowers (NextMove Software) NIH Virtual Workshop on Reaction Informatics, May 2021
You can also read