Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara

Page created by Gladys Erickson
 
CONTINUE READING
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
Data Science 101
Arik Pelkey
Pentaho Senior Director – Product Marketing, Hitachi Vantara
Scott Cooley
Pentaho Data Scientist, Hitachi Vantara
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
Agenda
This session will provide an introduction to data science fundamentals.
• What is Data Science?
• Common Use Cases and Algorithms
• The Data Science Process
• Building a Data Science Team
• The Future
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
AI, Machine Learning, and Deep Learning

  • AI: Getting machines
    to do what humans
    are good at

  • Machine Learning:
    Feeding an algorithm
    data to learn and
    predict something

  • Deep Learning: A type
    of machine learning

Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
Data Science: Solving Problems with Data

Computer science,                                         HACKING                               MATH AND                   Algorithms and
data engineering and                                       SKILLS                     Machine   STATISTICS                      numerical
wrangling, coding                                                                     Learning KNOWLEDGE                     techniques to
                                                                                                                            derive insights
                                                                                       DATA
                                                                                      SCIENCE
                                                                          Danger               Traditional
                                                                          Zone!                 Research
             Understanding of the
             underlying assumptions                                                                                    Domain knowledge,
                                                                                SUBSTANTIVE                  business acumen, experience,
                                                                                 EXPERIENCE                           value to the business

Diagram from Drew Conway: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
What’s all the fuss?
This stuff was created many many years ago

• Bayes Theorem                              • Thomas Bayes mid 1700’s

• Regression                                 • Legendre, Gauss and Galton
                                               early 1800’s

• Neural Networks                            • McCulloch and Pitts early 1940s

Here is a sample footnote.
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
Think about All Our Data and Compute
                                                                                                                           SKA - 2020
                                                                                                               (Square Kilometer Array Telescope)

                                                                                    It is still
                                                                                    GROWING!

                                                                                                                  Will generate as much data in
                                                                                                                  a day as the entire PLANET
                                                                                                                  does in a year!
https://www.computerworld.com.au/article/392735/ska_telescope_generate_more_data_than_entire_internet_2020/.
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
Types of Machine Learning

                                 ✕
                                         Regression – Looking for                        Classification – Similar to
                                                                        ✕✕
                             ✕           a statistical relationship   ✕                  regression but looking for
                     ✕                                                  ✕
                 ✕                       across variables that               △
                                                                                         separations in the data
                                                                         △
             ✕                           may give us an estimate             △
                                                                                 △
                                                                                     △
                                                                                         given predefined classes.
                                         of a particular outcome.                        (Supervised)

         ✕
                                         Clustering – Do not have                        Anomaly Detection –
                                     ◇
         ✕                   ◇           predefined classes but       △ △△       ?       Identification of outliers
             ✕                                                          △△
                                  ◇                                   △△ △
                                 ◇       trying to find groups or     △ △
                                                                      △ △
                                                                        △
                                                                                         based upon expected
                                                                      △ △
              △ △△
                                         sets based upon data at                         ranges of data.
                                                                                 ?
                       △                 hand. (Unsupervised)

Here is a sample footnote.
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
Labelled vs Unlabelled
Lets say we want to Classify Houses by Size                             Supervised
           Given Features or Feature Set                                Learning
                                                                        Use the labels
                                                                        to build a
       FullBath HalfBath Bedrooms Home Age       Size           Label   model. Model
          1        0         2         56         M                     used to classify
          1        1         3         59         L                     new house size
          2        1         3         20         M
                                                                        based ONLY on
          2        1         3         19         S                     the known
                                                                        feature set.
Unsupervised
SIZE is missing! We need to look for similarities in the data
and group them into clusters.
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
More on Machine Learning
Machine Learning is a methodology to create a model based on sample data and
use the model to make a prediction or strategy using a more algorithmic approach.
           SUPERVISED LEARNING MODEL
                                                     Historical records that contain
                                                     square feet, number of
                                                     bathrooms, zip code….

                                                     Records that contain the price
                                                     the house sold for

                                                     Iterate the algorithm over the
                                                     combined data to train the model

                                                     Use the trained model to predict
                                                     outcome on new records
Data Science 101 Arik Pelkey Pentaho Senior Director - Product Marketing, Hitachi Vantara Scott Cooley Pentaho Data Scientist, Hitachi Vantara
The Data Science Process: Getting from Raw Data to Outcomes
Formal Framework CRISP–DM                                                      The Data Science Workflow
        Cross Industry Standard Process
                for Data Mining

Joe Blizstein and Hanspeter Pfister created for Harvard Data Science course.
Specialist Traditional Data Science Team
   Data Scientist (DS)
     – Prepares data, engineers features, most valuable skill: training models.

   Data Engineer (DE)
     – Data acquisition focus. Build data pipelines. Not uncommon to have 5:1 ratio
       DE:DS

   Data Analyst (DA)
     – Assist DS with data prep

   Application architect (AA)
     – Design complete solution; deploy and maintain models in production
Mythical Creatures
Trends
• Automation
• Tools for Citizen Data Scientists
• Pre-trained models in the cloud

Here is a sample footnote.
Hiring Guidance

Here is a sample footnote.
Defining Success
• Easy for the tangible
     – Search order optimization
     – Recommendation engine or CTR
• Hard for others
     – Lead scoring
     – Attrition

• Try to measure direct outcomes
• Rarely a silver bullet
• Think ROI

Here is a sample footnote.
Typical Data Science Project

        DS             DS             DS            DS                          DS

                       DE              DA

        AA             AA                                         AA

Understand       ID and      Prepare data   Train        Deploy        Update
 business       procure        and build    model        models        models
objectives   training data   new features
Preventive Maintenance:
Caterpillar
Marine Asset Intelligence

   Fleet Data via                                               Data Scientist
      Satellite                                                Data Mining and
                                                                  Predictive
                                                                Maintenance
                            Data          Data
                         Integration   Integration

                                                     Data    Business User (COO)
                                                                 Reporting on
 Local Equipment
                                                     Marts     Operations and
    sensor and                                                    Efficiency
   Server Data

                                                              Dashboards and
                                                             Reports on Machine
                                                                Performance
Cross Department                                               (Onboard and
 Operations Data                                                 Onshore)
 Scheduling/ERP
The Future
• Scaling up / enabling more data scientists
• Model management
• Improved productivity
• Support for containerized applications.

Here is a sample footnote.
Pentaho ML Orchestration

                           • Makes data science
                             teams more productive
                           • Broad support for open
                             source libraries in
                             various languages
Summary
• What is Data Science
• Common Use Cases and Algorithms
• The Data Science Process
• Building a Data Science Team
• The Future
Next Steps
Want to learn more?
• Schedule a Meet the Expert
• Read Mark Hall’s Machine Learning with Pentaho Blog
You can also read