2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service

Page created by Jeffery Lynch
 
CONTINUE READING
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
2018 Predictive Analytics Symposium

 Session 33: Commercializing a Data Science Model as
  Application Programming Interface (API) or Batch
                      Service

SOA Antitrust Compliance Guidelines
SOA Presentation Disclaimer
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Commercializing a Data
Science Model as API or
Batch Service

Jeffrey Heaton, Ph.D. and Ed Deuser

September 2018
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Agenda

 Intro

 Operational Readiness

 Model Methodology

 Partnerships

 Example

                          2
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Intro

        3
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Presenters

 Jeffrey Heaton, Ph.D. – Lead Data Scientist - RGA
       Jeff Heaton is a lead data scientist at Reinsurance Group of America (RGA), an adjunct instructor for the Sever Institute at Washington University, and the
       author of several books about artificial intelligence. Jeff holds a Master of Information Management (MIM) from Washington University and a Ph.D. in
       computer science from Nova Southeastern University. Over twenty years of experience in all aspects of software development allows Jeff to bridge the
       gap between complex data science problems and proven software development. Working primarily with the Python, R, Java/C#, and JavaScript
       programming languages he leverages frameworks such as TensorFlow, Scikit-Learn, Numpy, and Theano to implement deep learning, random forests,
       gradient boosting machines, support vector machines, T-SNE, and generalized linear models (GLM). Jeff holds numerous certifications and credentials,
       such as the Johns Hopkins Data Science certification, Fellow of the Life Management Institute (FLMI), ACM Upsilon Pi Epsilon (UPE), a senior
       membership with IEEE. He has published his research through peer reviewed papers with the Journal of Machine Learning Research and IEEE.

 Ed Deuser – Technical Architect and Developer - RGA
       Ed Deuser is a Technical Architect with RGA Reinsurance Company. In this role, Ed is responsible for technical solutions that support RGA’s global business units, including Valuation,
       Financial Solutions, Underwriting, and Global Research, Development and Analytics. He also served as the technical lead for B3i, the Blockchain
       Insurance Industry Initiative, and guides other digital objectives for RGA. In addition to his experience in the insurance sector, Ed has worked in financial services, government and law
       enforcement. Accomplished in the emerging field of distributed ledger technology, Ed has participated in RGA sponsored hackathons as a coach and was part of the winning team at the
       Office of the National Coordinator (ONC) for Health Information Technology’s first-ever hackathon.
       Ed received his Bachelor of Science in Information Systems from the University of Missouri–St. Louis. His article “From R Studio to Real-Time Operations,” which he co-authored with
       RGA Lead Data Scientist Jeff Heaton, was published in the December 2017 issue of the Society of Actuaries’ Predictive Analytics and Futurism Section newsletter.

RGA Reinsurance Company
The security of experience. The power of innovation. www.rgare.com                                                                                                                                  4
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Science is good, but how do my customers use it ?

                                                    5
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Operational Readiness

                        6
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Operational Readiness
                             Readiness occurs throughout the
    Project     Project       project; most importantly when it
   Inception   Execution      starts.
                             End User Journey – Contract and
                              Service Level Agreement (SLA)
   Workload    Project at    Security is first and last thing we
    Reality      Risk         think of.
                             Agreed on patterns of use
                              • Batch
                              • Real Time
                Project
                Failure       • Web

                                                                    7
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Contract Management
End Users Journey to a delivered Service level agreement (SLA)

 Clear Expectation Management in Contractual Terms

 End User Journey and Expectations

 Standard Service level agreement as basis

                                                                 8
2018 Predictive Analytics Symposium - Session 33: Commercializing a Data Science Model as Application Programming Interface (API) or Batch Service
Security in Depth
Should be first and last thing we think of
 Threat Modeling
  • How could it be compromised ?
  • How to protect compromised sections ?

 Logging, Monitoring and Alerting
  • Forensic logging of the item to be protected
    and where it is housed.
  • Monitor and Alert on suspicious activities and logs.

 Pen Testing
  • Contract with someone to ensure the item is protected.

                        “According to Microsoft, the potential cost of
                        cyber-crime to the global community is a mind-
                        boggling $500 billion, and a data breach will cost
                        the average company about $3.8 million. “
                                                                             9
API in English Please
API stands for Application Programming Interface.

                                                    Cohort – 100
                                                    Id, gender,
                   Scores                           conditions

                                  API
  Compute
   Score

                                                    Cohort – 100
                                                    Id, gender,
                                                    conditions, score
                                  API
                                                                        10
What is an API ?
Model Development Methodology

                                11
Model Development Methodology

                            Model Scoping and
                             Business
                             Understanding
                            Data Understanding
                            Data Discovery and
                             Enrichment
                            Model Fitting /
                             Validation
                            Model Deployment

                                                  12
Input Format for Model
For an API, data input must be very standardized

 Clients tend to vary the format of input data during model development.

 Columns provided might change.

 Column names might change.

 Date formats may not be consistent.

 For an automated API, this format must become consistent.

                                                                            13
Use Excel as a Tool, Not a Format
For tabular data, we prefer CSV (UTF-8)

 Excel is a powerful data exploration tool for rapid analysis.

 However, Excel can be a problematic data exchange format.
  • Inability to specify export encoding (UTF-8, Unicode, etc.).
  • Excel often mangles input by inferring data format. Such as treating SNOMED codes as
    numbers.
  • Different tools generate Excel files differently.
  • Many more ways to confuse automated imports with Excel than CSV.

                                                                                           14
Input Format for Model
JSON, CSV, or XML?

 Input from the client is usually in JSON, XML, or CSV format.

 For real time API’s we prefer JSON/XML
  • JSON and XML provide a hierarchical view of data.
  • JSON and XML do not always easily fit into Excel.

 For batch, we generally prefer CSV (sometimes Excel)
  • CSV and Excel both store data in tabular format.

                                                                  15
The XML Format
Verbose and Hierarchical

                           16
The JSON Format
Concise and JavaScript-like

                              17
Data Discovery and Enrichment
Augmenting the input data with additional data sources

 Client input data usually will not contain all necessary information for a
  model.
  • If identity of individual is known (PII), we might augment with:
     o 3rd party marketing data on individual.
     o 3rd party credit data on individual.
  • If identity of individual is unknown (PII-less):
     o RGA severity scores for drugs or medical diagnosis.
     o RGA mortality tables.

                                                                               18
Model Fitting
Teaching a model from data

 Model fitting is where a data scientist trains a model based on data.

 Fitting is usually a very manual process that can go on for days, weeks, or
  months.

 The final output from fitting is a model that can be deployed for client use.

                                                                                  19
Model Deployment
Making your model available to clients

 How will your model be used?
  •   Will the model be used directly by individual human users?
  •   Will the model be integrated into a system developed by client’s IT?
  •   Will the model be used as part of a client’s mobile application?
  •   Will users upload files that a client will upload?

 Manual steps from fitting must be automated.

 Input data must be checked for errors.

                                                                             20
Personally Identifiable Information (PII)
and Data Retention
What data should we retain? (and where)

 Some input data contains PII, others do not.

 Some clients request us to retain no data.

 We prefer to keep some data.

 We usually do not store PII data on the model side.

                                                        21
Ongoing Model Validation
Keeping the model relevant

 Client data distributions can change over time.

 Baseline truth can change.

 Models must be evaluated over time to ensure they remain relevant.

 Calibration is an ongoing process.

                                                                       22
Partnerships

               23
Know your strengths
Partnerships in Place to Ensure success

Questions to ask :                        Types of partnerships :
• Do you have data scientists in your     • Internal
  organization ?
                                             “Partnering with different parts of your
• Are you experienced in cloud              organization “
  deployments ?
                                          • External
• Can you sustain the DevOps practice ?     “ i.e. Staff Augmentation, Client Partner
                                            (i.e. RGA) “
• Do you understand where your attack
  vectors are ?

                                                                                        24
Example Commercialization

                            25
Commercialization example
EXAMPLE. models

 Swagger Hub – Create an API first, what's on the menu

 Upload API to API gateway on AWS.

 Pre- templated NodeJS Lamda to compute score on cohort.

                                                            26
Questions

            27
Appendix

           28
Resources to use for creating your own API

Disclaimer:
The resources provided are intended for educational purposes only and do not replace independent professional judgment.
Statements of fact and opinions expressed are those of the participants individually and, unless expressly stated to the contrary, are
not the opinion or position of Reinsurance Group of America, its cosponsors, or its committees. Reinsurance Group of America does
not endorse or approve, and assumes no responsibility for, the content, accuracy or completeness of the information presented. The
above resources do not provide all security measures that are recommended; such that appropriate security measures are not
provided use freely at your own risk.

 https://github.com/eddeuser2017/commercialize_api

                                                                                                                                         29
You can also read