Pistachio "Fantastic reactions and how to use them" - John Mayfield, Ingvar Lagerstedt and Roger Sayle - NextMove Software

Page created by Leslie Stevens
 
CONTINUE READING
Pistachio "Fantastic reactions and how to use them" - John Mayfield, Ingvar Lagerstedt and Roger Sayle - NextMove Software
Pistachio
     “Fantastic reactions and how to use them”
John Mayfield, Ingvar Lagerstedt and Roger Sayle
                  NextMove Software

         NIH Virtual Workshop on Reaction Informatics, May 2021
Pistachio "Fantastic reactions and how to use them" - John Mayfield, Ingvar Lagerstedt and Roger Sayle - NextMove Software
What is Pistachio?
A document centric database of 13.3 million reactions

 Automatically extracted from U.S., European and WIPO patents

 JSON and SMILES provided for bulk analysis/model building

 Containerised WebApp for exploring and querying the data

 Aim is to extract reactions as described in the original document,
                               Warts and all

                    NIH Virtual Workshop on Reaction Informatics, May 2021
Pistachio "Fantastic reactions and how to use them" - John Mayfield, Ingvar Lagerstedt and Roger Sayle - NextMove Software
History

                             DEPARTMENT OF CHEMISTRY

          Extraction of chemical
         structures and reactions
            from the literature

                            Daniel Mark Lowe
                                    Pembroke College

           This dissertation is submitted for the degree of Doctor of Philosophy

                                        June 2012

        Daniel’s PhD Thesis (2012)
repository.cam.ac.uk/handle/1810/244727
                                                                                             Original Open-Source Project
                                                                                           dan2097/patent-reaction-extraction                 USPTO CC-Zero Subset (3.7 million)
                                                                                                                                    Chemical_reactions_from_US_patents_1976-Sep2016_/5104873

                                                                                            We use an internal fork built using                    Pistachio (13.3 million)
                                                                                             LeadMine instead of OSCAR4.                       nextmovesoftware.com/pistachio
                                                                                          Primarily improves chemical entity and
                                                                                          physical quantity recognition, spelling
                                                                                                      correction, etc.

                                                                                   NIH Virtual Workshop on Reaction Informatics, May 2021
Pistachio "Fantastic reactions and how to use them" - John Mayfield, Ingvar Lagerstedt and Roger Sayle - NextMove Software
Data Impact
     Christos Nicolaou et al. The Proximal Lilly Collection: Mapping, Exploring and Exploiting
     Feasible Chemical Space J. Chem. Inf. Model., 2016, 56 (7), pp 1253–1266

     Nadine Schneider et al. Big Data from Pharmaceutical Patents: A Computational Analysis of
     Medicinal Chemists’ Bread and Butter. J. Med. Chem., 2016, 59 (9), pp 4385–440

     Bowen Liu et al. Retrosynthetic reaction prediction using neural sequence-to-sequence
     models. ACS Cent. Sci., 2017, 3 (10), pp 1103–111

     Philippe Schwaller et al. Molecular transformer: A model for uncertainty-calibrated chemical
     reaction prediction ACS Cent. Sci., 2019, 5 (9), pp 1572–158

     Connor Coley et al. Prediction of Organic Reaction Outcomes Using Machine Learning. ACS
     Cent. Sci., 2017, 3 (5), pp 434–44

     Philippe Schwaller et al. Extraction of organic chemistry grammar from unsupervised learning
     of chemical reactions. Sci. Adv., 2021, 7 (15

     Alessandra Toniato et al. Unassisted noise reduction of chemical reaction datasets. Nat. Mach.
     Intell. 202

     Amol Thakkar et al. Arti cial intelligence and automation in computer aided synthesis
     planning. React. Chem. Eng., 2021, 6

                           NIH Virtual Workshop on Reaction Informatics, May 2021
1

            fi
                     3

                                 )

                                        3

                                                      3

                                                                                2
Important: The same reaction will occur in application/grant,
related patents, sketches/text and different authorities (WIPO/
EPO/USPTO). Using RInChI without any role normalisation ~4.2
million.

Often identical but not always - different description/yield/actions.

                    NIH Virtual Workshop on Reaction Informatics, May 2021
Important: The same reaction will occur in application/grant,
related patents, sketches/text and different authorities (WIPO/
EPO/USPTO). Using RInChI without any role normalisation ~4.2
million.

Often identical but not always - different description/yield/actions.

                    NIH Virtual Workshop on Reaction Informatics, May 2021
Amy Fried and Robert Wilkening
     Merck Sharp & Dohme
     Estrogen receptor modulators.
     US 7151196 B2 [0236] (19-Dec-2006)
     Example 2, Step 2

     A solution of 2-(2-hydroxyethyl)-5-methoxy-1-indanone (105 mg, 0.51 mmol) in methanol (2.0 mL) at room temperature was treated with
     ethyl vinyl ketone (EVK, 0.102 mL) and 0.5M sodium methoxide in methanol (0.204 mL, 0.1 mmol). The mixture was stirred in a capped
      ask and heated in an oil bath at 60° C. for 8 hours. After cooling, the reaction mixture was diluted with EtOAc (25 mL), washed with 0.2N
     HCl (15 mL), water (15 mL), and brine (15 mL), dried over MgSO4, ltered, and evaporated under vacuum to a ord 2-(2-hydroxyethyl)-5-
     methoxy-2-(3-oxopentyl)-1-indanone as an oil.

     Dann Parker, Ronald Ratcli e, Kenneth Wildonger and Robert Wilkening
     Merck Sharp & Dohme
     Estrogen Receptor Modulators
     EP 1257264 B1 [0261] (14-Sep-2011)
     EXAMPLE 34, Step 2

     A solution of 2-(2-hydroxyethyl)-5-methoxy-1-indanone (105 mg, 0.51 mmol) in methanol (2.0 mL) at room temperature was treated with
     ethyl vinyl ketone (EVK, 0.102 mL) and 0.5M sodium methoxide in methanol (0.204 mL, 0.1 mmol). The mixture was stirred in a capped
      ask and heated in an oil bath at 60°C for 8 hours. After cooling, the reaction mixture was diluted with EtOAc (25 mL), washed with 0.2N
     HCl (15 mL), water (15 mL), and brine (15 mL), dried over MgSO4, ltered, and evaporated under vacuum to a ord 2-(2-hydroxyethyl)-5-
     methoxy-2-(3-oxopentyl)-1-indanone (138 mg, 93% yield) as an oil.

                                                 NIH Virtual Workshop on Reaction Informatics, May 2021
fl
fl
                             ff
                                                                 fi
                                                                 fi
                                                                                                             ff
                                                                                                             ff
USPTO vs Pistachio Data
• Updated quarterly                                  U.S. Grant Text          3,366,399 2021-05-18

• EPO, WIPO patents                                  U.S. Appl. Text          3,629,411 2021-05-13
                                                     WIPO PCT Text            1,520,596 2021-05-06
• USPTO sketches                                     Euro. Grant Text         1,074,590 2021-05-12
• NameRxn Classification/AAM                         Euro. Appl. Text          702,035 2021-05-12

   – Improved role assignment                        U.S. Grant Sketch        1,211,521 2021-05-18
                                                     U.S. Appl. Sketch        1,834,132 2021-05-13
   – NameRxn 71.5% coverage
• Example/Step Labels
• Solvent Mixtures
• Solvent associations
• Document Assignees, Targets and Diseases
• Continual tweaks based on feedback

                     NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data
                                     Pistachio is a “super-set” of USPTO
                                     but not strictly so…
                                     • NameRxn filtering/mapping
                                     • Improved/changed name-to-
                                       structure, roles, sectioning
                                     • Structure normalisation differences
                                     • Whack-a-mole/pachinko machine
                                       – Obvious sensible change can have
                                         unforeseen consequences

The NextMove’s pachinko machine

                        NIH Virtual Workshop on Reaction Informatics, May 2021
Regression Testing
USPTO vs Pistachio Data
• Reactions from USPTO Application Text 2001-22nd Sep 2016
   – 1,939,253 CC-Zero Subset
   – 2,568,513 Pistachio
   – 458,995 common (-1,480,258,+ 2,109,518) by SMILES

                 NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data
• Reactions from USPTO Application Text 2001-22nd Sep 2016
   – 1,939,253 CC-Zero Subset
   – 2,568,513 Pistachio
   – 1,386,306 common (-552,947,+1,182,207) by ~RInChI

                 NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data
• Reactions from USPTO Application Text 2001-22nd Sep 2016
   – 1,939,253 CC-Zero Subset
   – 2,568,513 Pistachio
   – 1,465,946 common (-473,307,+1,102,567) by norm SMILES

                 NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data
• Reactions from USPTO Application Text 2001-22nd Sep 2016
   – 1,939,253 CC-Zero Subset
   – 2,568,513 Pistachio
   – 1,866,314 common (-72,939,+702,199) by paragraph Id

                 NIH Virtual Workshop on Reaction Informatics, May 2021
Overview of extraction
Text Extraction
                                                                        Sectioning

                                                                       Tagging/Tokenization

                                                                             Parsing

                                                                         Action Phrases

                                                                       Reaction Assembly

Example from US20020133011A1 [0070]

              NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction
                                                                   Sectioning

Missed break                                                      Tagging/Tokenization

                                                                        Parsing

                                                                    Action Phrases

                                                                  Reaction Assembly

                                            Extra break

 Examples from WO 2020/239862 A1 PatentScope OCR

         NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction
                                                                       Sectioning

                                                                 Tagging/Tokenization

                                                                        Parsing

                                                                     Action Phrases

                                                                   Reaction Assembly

                                           UnitType.Percent
    UnitType.Mass
                    UnitType.Percent       QuantityType.Purity
                    QuantityType.Yield

 NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction
                                                              Sectioning

                                                          Tagging/Tokenization

                                                              Parsing

                                                            Action Phrases

                                                          Reaction Assembly

 NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction
                                                                                          Sectioning
     Type Add
     Compounds                                                                        Tagging/Tokenization
     • ethyl cyanoacetate (mass=13.56 g)
     • ethyl 4-fluorocinnamate (mass=19.4 g)
     • sodium ethoxide (mass=2.3 g, vol=50 ml)                                              Parsing
     Conditions
     • 2-3 minutes
     • 60° C.                                                Type Heat                Action Phrases
                                                             Conditions
                                                             • 1 hour
                                                                                      Reaction Assembly

Type Cool
Conditions                              Type Yield
• 5° C.                                 Compounds
                                        • 2-cyano-3-(flurophenyl)-glutarate
                                          (mass=23 g, yield=74%, purity=98%)

                             NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction
                                                                               Segmentation

                                                                            Tagging/Tokenization

                                                                                  Parsing

                                                                              Action Phrases

                                                                           Reaction Assembly

Preliminary role assignment based on action, surrounding
context and dictionaries (common solvents/catalysts)

                  NIH Virtual Workshop on Reaction Informatics, May 2021
ChEMU 2020 Evaluation Lab
      Run                      Exact matching                                            Relaxed matching
                        F1-score Precision                Recall           F1-score Precision              Recall
     Task 1             0.8983         0.9042             0.8924             0.9240               0.9301   0.9181
     Task 2             0.8977         0.9441             0.8556                n/a                n/a      n/a
  end-2-end             0.8026         0.8492             0.7609             0.8196               0.8663   0.7777
   end-2-end            0.8255         0.8746             0.7816             0.8420               0.8909   0.7983
(after deadline)

 Daniel Lowe and John May eld. Extraction of reactions from patents using grammars. 202
 http://ceur-ws.org/Vol-2696/paper_221.pdf

 Nguyen D.Q. et al. (2020) ChEMU: Named Entity Recognition and Event Extraction of
 Chemical Reactions from Patents. In: Jose J. et al. (eds) Advances in Information
 Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12036. Springer, Cham.
 https://doi.org/10.1007/978-3-030-45442-5_74

                                NIH Virtual Workshop on Reaction Informatics, May 2021
                   fi
                                                                                             0
Sketch Extraction
   Example 26, US 9718816 B2                                                            Step 1

                                                                                        Step 2

                                                                                        Step 3
                                                       NextMove’s Praline

                                                                                        Step 4

                                                                                        etc..

US 09718816 B2 Example 26

John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on
Cheminformatics. 2016
                                         NIH Virtual Workshop on Reaction Informatics, May 2021
Overview of Filtering/Mapping
Reaction Filtering - Text
           Rebond

  1 >= Precursors = Products = Num Precursors
Reaction Filtering - Sketch

       Specific Reaction
       1 >= Precursors                                                          Reject
       1 >= Products

Sane
         NameRxn AAM

Mapped
          Fix Roles

                       NIH Virtual Workshop on Reaction Informatics, May 2021
ROLE FIX

1) Move all agents to reactants

2) Atom-Atom Mapping - Michael addition (3.11.92)

3) Move unmapped reactants back to agents

                                    NIH Virtual Workshop on Reaction Informatics, May 2021
Why NameRxn?
•   1,543 rule based classes - easy to update a mapping disagreement

                      4.1.6 Cyclic Beckmann rearrangement

•   Higher precision/lower recall
•   Originally for pharmaceutical ELNs ~80%
•   Pistachio coverage is ~71.5%
     – >77% USPTO appl. text.
•   Fast ~380 reactions per second per core
     – A few hours to remap entire database
     – Speed depends on backend

                         NIH Virtual Workshop on Reaction Informatics, May 2021
NameRxn - Magic functional groups
                                                                             NameRxn originally written as classification
                                                                             tool, AAM is a by product

                                                                             • For us no answer is better than a wrong answer
                                                                                 • Lowest number of wrong answers (Disagreement
                                                                                   with gold-standard)

                                                                             • Yellow bar is so called “magic group additions”
                                                                               where a product atom is unmapped:
                                                                                 • We didn’t know where a group came from
                                                                                 • Where there group came from was missing
                                                                                 • Stoichometry (multiple groups from one reactant)

                                                                             • Aim to indicate this better in bulk data
                                                                                 • AMAP bench

Arkadii Lin et al. Atom-to-atom mapping: a benchmarking study of popular mapping algorithms and consensus strategies
https://chemrxiv.org/articles/preprint/Atom-to-Atom_Mapping_A_Benchmarking_Study_of_Popular_Mapping_Algorithms_and_Consensus_Strategies/13012679/1

                                                     NIH Virtual Workshop on Reaction Informatics, May 2021
It’s a kind of magic…

         Bromo Grignard + nitrile ketone synthesis (3.7.10)
             EP0200736B1 [0072] Example 1, Step 1

                      RxnMapper/Indigo

                           RxnMapper

Water comes from the quenching
  “The reaction mixture is slowly poured into ice cold 10% hydrochloric acid
  “quenched slowly with 2N aq. HCl” (different paragraph)

               NIH Virtual Workshop on Reaction Informatics, May 2021
    :

                                                ”
Symmetry/Stoichiometry

                         NameRxn - 8.2.2 Sulfanyl to sulfonyl

                                         RxnMapper

US20010000511A1 [0357]
                           NIH Virtual Workshop on Reaction Informatics, May 2021
Symmetry/Stoichiometry

                         Handle by reusing atom-maps in the reactant

US20010000511A1 [0357]
                             NIH Virtual Workshop on Reaction Informatics, May 2021
Symmetry/Stoichiometry

                                          Indigo

                                      RxnMapper

US 03674855 A
                    NIH Virtual Workshop on Reaction Informatics, May 2021
AMAP BENCH

                                                  Indigo
                  AMAP bench: Changed: 23, Broken: 13, C-C Broken: 7

                                    RxnMapper
                     AMAP bench: Changed: 5, Broken: 3, C-C Broken: 0
Daniel Lowe, Roger Sayle. Evaluating the Quality and Performance of Automatic Atom Mapping
Algorithms. 244th ACS National Meeting & Exposition. Aug 2012

US20200071310A1 [0487] Example 34
                                  NIH Virtual Workshop on Reaction Informatics, May 2021
Ambiguous Names

8-(3,5-Bis-tri uoromethyl-benzoyl)-3-furan-2-yl-methyl-1-o-tolyl-1,3,8-triaza-spiro[4.5]decane-2,4-dione

                              Indigo/RxnMapper
                 AMAP bench: Changed: 4, Broken: 2, C-C Broken: 1

                            NIH Virtual Workshop on Reaction Informatics, May 2021
 fl
Ambiguous Names

8-(3,5-Bis-tri uoromethyl-benzoyl)-3-furan-2-ylmethyl-1-o-tolyl-1,3,8-triaza-spiro[4.5]decane-2,4-dione

                             1.2.9 Alcohol + amine condensation
                                NameRxn/Indigo/RxnMapper
                           AMAP bench: Changed: 2, Broken: 1, C-C Broken: 0
                              NIH Virtual Workshop on Reaction Informatics, May 2021
   fl
Case Study
Example   1                 Example 21
Example   2                 Example 22
Example   3                 Example 23
Example   4                 Example 24
Example   5                 Example 25
Example   6                 Example 26
Example   7                 Example 27
Example   8                 Example 28
Example   9                 Example 29
Example 10                  Example 30
Example 11                  Example 31
Example 12                  Example 32
Example 13                  Example 33
Example 14
Example 15
Example 16                  US 2020/0087299 A1
Example 17
Example 18
Example 19
Example 20

              NIH Virtual Workshop on Reaction Informatics, May 2021
Case Study
Example   1                 Example 21
Example   2                 Example 22
Example   3                 Example 23
Example   4                 Example 24
Example   5                 Example 25
Example   6                 Example 26
Example   7                 Example 27
Example   8                 Example 28
Example   9                 Example 29
Example 10                  Example 30
Example 11                  Example 31
Example 12                  Example 32
Example 13                  Example 33
Example 14
Example 15
Example 16                  US 2020/0087299 A1
Example 17
Example 18                  NameRxn 127/154 82.4%
Example 19                  Indigo   15/154 9.7%
Example 20
                            Reject   12/154 7.7%

              NIH Virtual Workshop on Reaction Informatics, May 2021
Case Study
Example   1                 Example 21
Example   2                 Example 22
Example   3                 Example 23
Example   4                 Example 24
Example   5                 Example 25
Example   6                 Example 26
Example   7                 Example 27
Example   8                 Example 28
Example   9                 Example 29
Example 10                  Example 30
Example 11                  Example 31
Example 12                  Example 32
Example 13                  Example 33
Example 14
Example 15
Example 16                  US 2020/0087299 A1
Example 17
Example 18                  NameRxn 132/154 85.7%
Example 19                  Indigo   10/154 6.4%
Example 20
                            Reject   12/154 7.7%

              NIH Virtual Workshop on Reaction Informatics, May 2021
Example 27

Typo: “tert-butyl”

Typo: “tert-butyl”

                     NIH Virtual Workshop on Reaction Informatics, May 2021
Example 12

                 Step 1 Small Product (8 heavy atoms)

Typo: “methyl”

                                  NIH Virtual Workshop on Reaction Informatics, May 2021
Data STORAGE TIPS
Hierarchical Data
Hierarchical data is index in the WebApp: NameRxn Tags, Assignees, Diseases (MESH),
Targets (ChEMBL), IPC Codes

A simple way of store and searching for the data is using a nested identifier string, e.g.
LIKE ’11.%’ pulls back all AstraZeneca and related companies:
                         11           AstraZeneca
                         11.5         Imperial Chemical Industries
                         11.7         MedImmune
                         ...

NameRxn is handled slightly different, we pack the three level number into an integer
                         (lvl1
NameRxn concepts and rxno
1 Heteroatom alkylation and arylation
 .7 O-substitution
   .1 Chan-Lam ether coupling
   .2 Diazomethane esterification
   .3 Ethyl esterification
   .4 Hydroxy to methoxy
   .5 Hydroxy to triflyloxy
   .6 Methyl esterification
   .n
2 Acylation and related processes
 .6 O-acylation to ester
   .1 Ester Schotten-Baumann
   .2 Esterification (generic)
   .3 Fischer-Speier esterification
   .4 Baeyer-Villiger oxidation
   .5 Yamaguchi esterification
   .6 Hydroxy to imidazolecarbonyloxy
   .7 Imidazolecarbonyl to ester
   .8 Hydroxy to acetoxy
   .9 Steglich esterification
   .n

                          CINF 13, ACS Fall 2017, Washington, D.C.
NameRxn concepts and rxno
1 Heteroatom alkylation and arylation                    Esterification (7)
 .7 O-substitution
   .1 Chan-Lam ether coupling                                    Chan-Lam coupling (3)
   .2 Diazomethane esterification
   .3 Ethyl esterification
   .4 Hydroxy to methoxy
   .5 Hydroxy to triflyloxy
   .6 Methyl esterification
   .n
2 Acylation and related processes
 .6 O-acylation to ester
   .1 Ester Schotten-Baumann                                     Schotten-Baumann
   .2 Esterification (generic)                                   Reaction (9)
   .3 Fischer-Speier esterification
   .4 Baeyer-Villiger oxidation
   .5 Yamaguchi esterification
   .6 Hydroxy to imidazolecarbonyloxy
   .7 Imidazolecarbonyl to ester
   .8 Hydroxy to acetoxy
   .9 Steglich esterification
   .n

RXNO: http://github.com/rsc-ontologies/rxno
                            CINF 13, ACS Fall 2017, Washington, D.C.
Summary
We always welcome feedback if you spot a mistake!
   • It’s a long tail but many things are simple changes that are fixed when rerun
   • Lot’s of people “cleaning” the data, We’d rather know what was wrong and can we
     fix it
Plans
   • Reaction sketch compound numbers
   • Better quality indication
        • Integrate RxnMapper, AMAP bench indicators, Boot-strapping
          sequences
   • Handled reactions from non-english patents
   • General procedures/example references, currently only resolve
     compounds
Acknowledgements
 Daniel Lowe (MineSoft)
 Richard Gowers (NextMove Software)
                          NIH Virtual Workshop on Reaction Informatics, May 2021
You can also read