INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY

 
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
INTEGRATE MACHINE
LEARNING MODELS
WITH PYTHON AND
MICROSTRATEGY
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
Thank you for participating in a workshop at
 MicroStrategy World 2019. If you missed or did
 not finish an exercise and want to complete it
 after the conference, use this workbook and
 access the supporting files at
 microstrategy.com/world-2019-workshops

*Workshop files will expire 4/30/2019
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
Integrate Machine Learning Models
Integrating Python Machine Learning Models and MicroStrategy

              with Python and MicroStrategy

                    MicroStrategy offers features that enable analysts and data scientists to use
                    machine learning to extract meaningful insights from their data for a variety of use
                    cases and problems, including out-of-the-box functions and basic algorithms.
                    However, data scientists often rely on a wide range of tools, especially open-
                    source coding languages like R and Python. To support those tools, data scientists
                    can now code the language of their choice and continue to use MicroStrategy via
                    open-source packages.

                    The purpose of this session is to show you how MicroStrategy and Python can
                    work together to produce machine learning results within the context of business
                    intelligence. In this workshop, you will:

                    •   Learn what actually happens when building a machine learning model and
                        explore a framework for thinking about the model building life cycle.

                    •   Train a deep learning network to predict flight delays in Python.

 Learn why a BI system is a core piece of the technology stack that enables data science teams
 to be successful. Machine Learning 101

 Broad definition
                    Machine learning (ML) can be loosely defined as statistical and mathematical
                    techniques that allow computer systems to learn from data.

  3 Walk-through: Modeling                                                         © 2018 MicroStrategy, Inc.

4 | MicroStrategy
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                    Machine Learning
                                                                                                         1

                Machine learning implies that the performance of a specific task is progressively
                improved. To achieve this, different algorithms can be exposed to historical data
                to create a trained model, and then tested on unseen data to evaluate how well
                the model performs.

Common examples
                Examples of ML are more common than you think. Some you may be aware of:

                •   Selecting the next song in a playlist on a streaming service.
                •   Granting or denying a loan when you apply.
                •   Curating the news feed on a social media site.
                •   Product placement ads in your browser.

Exercise 1: Training with iris data
                To get our feet wet with machine learning, let’s look at an example with a dataset
                often used to introduce data science techniques: the iris dataset. This data, shown
                below, contains the dimensions of the sepals and petals of a flower, and the
                species these sets of measurements belong to within the iris genus.

                     The dimensions of an iris can be used to learn how to classify it into the
                     species it belongs to. Let’s use an interactive site to explore and observe the
                     data points.

© 2018 MicroStrategy, Inc.                                                      Walk-through: Modeling   4
                                                                                                  MicroStrategy | 5
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                       Machine Learning
   1

                    1 Open a browser and navigate to:
                       https://plot.ly/~AmenRadix/128.embed
                             The page displays a graph similar to the image below. You can
                             intuitively identify three clusters.
                                                                           Exercise 1: Training with iris data

                             Without having to create any algorithms, your brain trained itself to
                             view the clusters. Our brains are good at abstracting up to three
                             dimensions when we have a visual representation of the model. But
                             what happens when you have five dimensions—or ten? It becomes
                             much harder. Now imagine a weather system where thousands of
                             factors are taken into account.

                             This is why it is important to have a framework to manage, train, and
                             evaluate models. Among those frameworks, many data scientists will
                             operate using the CRISP-DM, discussed below.

  5 Walk-through: Modeling                                                           © 2018 MicroStrategy, Inc.
6 | MicroStrategy
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                  Machine Learning
                                                                                                       1

CRISP-DM framework
                Companies around the world use machine learning to create insights into their
                businesses. But how does one create a machine learning system?

                Many data science teams use a process framework, CRISP-DM, to guide their work.
                CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It lays out
                the fundamental steps common to nearly every machine learning project. Here’s a
                diagram that visualizes the framework:

                The process is a cycle, and starts with business understanding.

Business understanding
                Business understanding is about trying to identify both the core drivers of and
                problems with your business. In this part of the process, it is vital to spend time
                examining the business problems that might be good candidates to approach with
                machine learning.

© 2018 MicroStrategy, Inc.                                                    Walk-through: Modeling   6
                                                                                                MicroStrategy | 7
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                    Machine Learning
   1

                                                                                     CRISP-DM framework

                    You may find it helpful to imagine what it would be like to solve the business
                    problem you’re working on with machine learning. What would it mean to use
                    algorithms to help find a solution? How would they be adapted internally?

 Data understanding
                    The next step is data understanding. Once we’ve identified our problem, we need
                    to take inventory of the data that might be useful for analysis. We need to seek
                    out high quality, reliable, and reproducible sources of data. We need to also spend
                    a lot of time understanding what the data contains, and more importantly, what it
                    doesn’t contain.

                    Sometimes at this stage, you need to go back to the business understanding step
                    and re-examine the problem in light of the available data. You’ll see that this back
                    and forth behavior is common in the CRISP-DM framework and in data science
                    projects in the real world.

 Data preparation
                    Many data scientists use a rule of thumb that you should expect to spend about
                    80% of a data science project on data preparation. This includes data clean-up,
                    creating new variables, writing code to extract data from databases, and
                    reorganizing the data in a way that machine learning algorithms need it to be
                    structured.

                    This is a critical step of the project because you’re building the “plumbing” that
                    every subsequent step in the machine learning process relies on. It is not
                    uncommon to get to the modeling step of a project only to realize that something
                    critical was missed during data preparation. For this reason, many data scientists
                    build automated pipelines to manage data preparation.

 Modeling
                    In the modeling phase, we choose from among hundreds of algorithms that might
                    work for our problem. For example, in a time-series forecasting problem, you
                    might want to use a moving-average based model or something that takes
                    seasonality and time dependence into account, such as ARIMA.

  7 Walk-through: Modeling                                                          © 2018 MicroStrategy, Inc.
8 | MicroStrategy
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                 Machine Learning
                                                                                                       1

                It’s common to decide on a basket of modeling approaches, rather than relying on
                just one, and quantitatively evaluating which model is best.

Evaluation
                In the evaluation stage, you assess how each of the algorithms performed using an
                objective scoring approach. You may have heard of r-squared or mean squared
                error for example. These are all evaluation metrics used to help data scientists
                understand whether the algorithm has been successful in generalizing the
                problem.

                Another important part of this step is checking the quality of the model in
                business terms. This means that we revisit our business problem and ask ourselves
                if the model will be helpful in our business context.

Deployment
                Finally, we have to deploy our model! The real return on investment from
                machine learning comes from organizations successfully deploying their models
                into production and integrating them in the decision-making fabric of their
                organization.

                Now that we are familiar with the CRISP-DM framework, let’s put it in action for a
                specific problem: predicting flight delays.

© 2018 MicroStrategy, Inc.                                                    Walk-through: Modeling   8
                                                                                               MicroStrategy | 9
INTEGRATE MACHINE LEARNING MODELS WITH PYTHON AND MICROSTRATEGY
Integrating Python Machine Learning Models and MicroStrategy
         Machine Learning                                                                    Machine Learning
     1

   Walkthrough: Business understanding

                     Let’s assume we are members of the analytics team at a major US airport. We have
                     data on every inbound and outbound flight for an entire year—over 5 million flights.
                     We also have the outcome of each flight: whether it was canceled, delayed, or left
                     on-time.

                     Our primary goal is to train a model that predicts the probability that each flight
                     will be canceled, delayed, or leave on time. Our secondary goal is to use those
                     predictions to display this information to passengers, so they are proactively
                     informed about their flight’s status.

                     For this workshop, we will be using four datasets. We will also pull external region
                     data into MicroStrategy to create a new Intelligent Cube.

   9 Walk-through: Modeling                                                          © 2018 MicroStrategy, Inc.
10 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                   Machine Learning
                                                                                                        1

                •   Flights: This dataset contains data on over 5 million individual flights from
                    2015. The data contains dates and times for each flight, flight destination,
                    flight departure airport, the airline, and other core data.
                •   Airlines dataset: This dataset contains 14 airline names and their
                    corresponding airline code.

                •   Airport: This dataset contains data on 322 airports, along with their state, city,
                    latitude, and longitude.

                •   US States: This dataset contains information about each of the 50 US states
                    and territories.

                At this step of the process, it’s good practice to start making a list of potential
                features—new variables—to add that are not available in the raw data. Later in
                the data prep process, we’ll add that data to the raw data using Python functions.

                A good initial hypothesis is that weather—especially winter weather—has a lot to
                do with flight cancellations. Another hypothesis is that there are groups of states
                in regions like the northeast that experience intense winter weather, and
                therefore might have a higher chance of experiencing flight delays. We will use
                region codes to help our algorithm learn that some states share similar weather
                patterns.

Walk-through: Data preparation
                We’ll spend a fair amount of time in the data preparation stage. There are a few
                sub-steps in the data prep phase.

                •   Getting data
                •   Data preparation
                •   Data splitting

Get data
                The first step in our process is to physically get data to train our model with. Let’s
                discuss how this is done when MicroStrategy is the data source for training the
                model.

© 2018 MicroStrategy, Inc.                                                     Walk-through: Modeling   10
                                                                                                 MicroStrategy | 11
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                    Machine Learning
   1

                     MicroStrategy solutions to these challenges
                     We’ll need a way to connect our machine learning server with our BI system and
                     extract the data. Using the MicroStrategy REST API, we’ll extract the data in our
                     MicroStrategy cubes and make a copy of the data locally to train our model.

                     We’ll use Python and a few popular machine learning packages throughout this
                     workshop.

                     We’ll use scikit-learn and its helper functions to help structure our data and
                     evaluate the accuracy of our final models. We will also use Keras, which is a neural
                     network library that interfaces with Google’s TensorFlow library, to create and
                     train our model.

  Exercise 1: Get your own ML environment
                      To start, we need to connect to a MicroStrategy on AWS environment that’s
                      been pre-configured with the ML tools and data needed to complete the
                      exercises below.

                      Access the provisioning console

                      1    Navigate to the MicroStrategy on AWS provisioning console at:
                     https://provision.customer.cloud.microstrategy.com/
                      2    On the provisioning console login page, enter the credentials provided
                           below:

  11 Walk-through: Modeling                                                         © 2018 MicroStrategy, Inc.
12 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                   Machine Learning
                                                                                                        1

                              •   Username: Cloudmicrostrategy@gmail.com

                              •   Password: workshopmstr!

                   3   Find the environment with the number that your instructor provided you
                       with at the beginning of the workshop.

                   4   In the Actions section, select the ellipses icon and click Edit Contact.

                   5 In the Edit Contact Information window, replace the information in the
                     boxes with your first name, last name, and email address. Then click
                     Apply. You will receive an email with your environment credentials.

© 2018 MicroStrategy, Inc.                                                     Walk-through: Modeling   12
                                                                                                  MicroStrategy | 13
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                               Machine Learning
   1

                              Make sure to use an email address that you can access
                              immediately, as you will be sent an email with your environment
                              credentials.
                     6   From the MicroStrategy on AWS email, select Access MicroStrategy
                         Platform . Login with your MicroStrategy Badge or enter your credentials.

                     7   On the landing page, scroll down and hover your cursor over
                         Remote Desktop Gateway, then click the Launch icon that is
                         displayed.

                     8 In the Remote Desktop Connection window, in the Username and
                       Password boxes, type the user name and password listed in your
                       Welcome to MicroStrategy on AWS email.

                     9 Click Login.

                     10 On the web page, under All Connections, click Developer Instance RDP.

  13 Walk-through: Modeling                                                    © 2018 MicroStrategy, Inc.
14 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                     Machine Learning
                                                                                                         1

                                Your remote desktop session opens. Complete the rest of this
                                workshop in this environment.

Exercise 2: Review datasets and start analysis
                Review the datasets in MicroStrategy

                1     Log into MicroStrategy Web using the credentials you received in
                      your MicroStrategy on AWS email.

https://env-XXXXX.trial.cloud.microstrategy.com/MicroStrategy/servlet/mstrWeb
                      The XXXXX above represents the environment number you received in your
                      Welcome to MicroStrategy on AWS email.

                2        Click MicroStrategy Tutorial.

                3        Select Go to MicroStrategy Web in the right corner of the screen.

                4        Follow these steps to create three MicroStrategy cubes from .csv files and
                         retrieve their IDs. These steps must be accomplished for the three datasets
                         (Airlines, Airports, and States) we want to import into MicroStrategy. We
                         will use the Airlines as an example.

                     a       Click Create                , then Add External Data.

© 2018 MicroStrategy, Inc.                                                      Walk-through: Modeling   14
                                                                                                  MicroStrategy | 15
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                    Machine Learning
   1

                       b       Click File from Disk.

                      c        Click Choose Files.

                     d         Navigate to C:\Users\mstr\Desktop\Demo\Data\Raw, and click
                               airlines.csv. Click Open.
                      e        For each of the MicroStrategy cubes (Airlines, Airports, and States), do
                               the following:

                           a     Save in Shared Reports under the name airlines, airports or
                                 states. Pay attention to the name, as casing is important. Save the
                                 dataset name in lowercase for all three of them. A

                           b     Review the data at the bottom of the screen.

                           c     Close the window.

                           d     Right-click the cube and select Properties.

                           e     Copy the ID and click OK.

                           f     Search for Notepad++ in the search bar. Open a Notepad++
                                 document, paste the ID and write the name of the cube beside it.
                                 You’ll use this information later.

                     Your Notepad++ document should resemble the following:

  15 Walk-through: Modeling                                                         © 2018 MicroStrategy, Inc.
16 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                Machine Learning
                                                                                                     1

                We will conduct our analysis via a Windows server in the environment we
                previously configured, using a Jupyter Notebook to render the Python scripts in a
                web browser through a distribution called Anaconda. Jupyter Notebooks are
                interactive, showing code output in real time and making troubleshooting easier.

                1 Click the Start menu and start typing the word Anaconda.

                2 When you see Anaconda Prompt, right click it and select Run as
                  administrator.

                3 Click Yes to accept the message that opens.

                4 To ensure that we’re working in the correct directory, type cd C:\Users\mstr\ ,
                  and hit Enter.

                5 Type jupyter notebook, then click Enter.

© 2018 MicroStrategy, Inc.                                                  Walk-through: Modeling   16
                                                                                              MicroStrategy | 17
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                     Machine Learning
   1

                     If this is your first time using a Jupyter Notebook, here is a short introduction:

                     •     Each cell contains a snippet of Python code to run.
                     •     To run a cell, click to select it and then either press Shift+Enter on the
                           keyboard or click Run in the toolbar.

                     •     You can use the + in the toolbar to add a cell to the notebook for your own
                           code.

                     •     You can use the Up and Down arrows to move a cell in the notebook.
                     •     When a cell completes its processing, a number appears in the brackets [ ]
                           on the left. When it’s processing, you will see a star (*). You can also look
                           at the circle located at the top right of the page next to the name Python 3,
                           as shown below. An empty circle means it’s ready, while a full circle means
                           it’s busy processing.

 17 Walk-through: Modeling                                                           © 2018 MicroStrategy, Inc.
18 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                     Machine Learning
                                                                                                          1

                     •   If there is output from the code in a cell, it appears below the cell.

                     If you want to clear the output of a cell, use the Cell menu under Current
                     output or All output.

                     •   Use the Kernel menu to reset the notebook to its initial state.

                6 Click Desktop, then click Demo, then Code, then Notebooks.
                     This folder contains the Python scripts we’ll use in our analysis.

                7 Click 01_prep_raw_cubes.ipynb.
                     The notebook will open.

© 2018 MicroStrategy, Inc.                                                       Walk-through: Modeling   18
                                                                                                   MicroStrategy | 19
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                      Machine Learning
   1

                     Import the cubes from MicroStrategy

                     Let’s examine the code as we execute the pieces of the script.

                     1 Locate the cell containing:

                                  import warnings
                                  warnings.simplefilter('ignore')

                                  import pandas as
                                  pd import os
                                  import sys

                                  import time

  19 Walk-through: Modeling                                                           © 2018 MicroStrategy, Inc.
20 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                    Machine Learning
                                                                                                           1

                                            from mstrio import microstrategy

                                 home_dir='C:/Users/mstr/Desktop/Demo'

                     The first two lines disable warnings so we won’t be distracted during this
                     workshop.

                     The next few lines initialize some of the libraries needed to create the
                     connection between the Intelligence Server and our server.

                     The next line loads the MicroStrategy mstrio library, which uses
                     MicroStrategy’s REST API to connect Python and the Intelligence Server.

                     The last line defines our home directory, the location we will run our files
                     from.

                2 Execute the cell by pressing Shift+Enter, or click Run.

                3 Locate the next cell containing:

                                         # API / server
                                         params username
                                         = "mstr"
                                         password =
                                         "password"
                                         base_url = "https://
                                         env-
                                         XXXXX.customer.cloud.microstrategy.com/
                                         MicroStrategyLibrary/api"
                                         project_name = "MicroStrategy Tutorial"                       #
                                         Tutorial Project
                             Best       This cell contains a series of variables used to connect to
                             Practice   your Intelligence Server, including your user credentials. It is
                                        a good practice to define variables this way, as changing
                                        them will allow you to adapt the code to another
                                        environment quickly.

© 2018 MicroStrategy, Inc.                                                      Walk-through: Modeling     20
                                                                                                  MicroStrategy | 21
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                    Machine Learning
   1

                     4 Replace the password value with the password you received in your
                       MicroStrategy on AWS email.

                     5 Replace XXXXX with your environment number.

                     6 Execute the cell.

                     7 Locate the cell containing:

                                 #conn = microstrategy.Connection(base_url=base_url,
                                 username=username, password=password, project_name=
                                 project_name)
                                 conn = microstrategy.Connection(base_url, username,
                                 password, project_name) conn.connect()

                        This sends a request to the MicroStrategy Intelligence Server establishing a
                        REST API connection between our machine learning server and the Intelligence
                        Server. The mstrio library takes care of managing the authentication token and
                        cookies needed to access the REST API server.

                     8 Execute the cell.

                     9 Locate the cell containing:

                                 # Cubes to download cube_names = ['airlines',
                                 'airports', 'states'] cube_ids =
                                 ['C983680C11E8D236B87F0080EF35FE86',
                                 'EADC795811E8D236B83F0080EF15BE86',
                                                  '5946FA1211E8D237BB800080EFB5FF89']

                        This cell makes a list of the cube IDs that we need to download from the
                        MicroStrategy environment to our machine learning server. The objects’
                        values must be in the same order as their titles. For example, the first value in
                        cube_names is “airlines,” the first value of cube_ids must be the ID of the
                        airlines cube, and so on.

                     10 Execute the cell.

                     11 Locate the cell containing:

  21 Walk-through: Modeling                                                         © 2018 MicroStrategy, Inc.
22 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                        Machine Learning
                                                                                             1

                             #Persist cubes on disk

                             for cube_id, cube_name in zip(cube_ids,
                             cube_names):

                                 print("Fetching the " + cube_name + " cube from
                             the Intelligence Server..." + "\n")         cube =
                             conn.get_cube(cube_id=cube_id)
                                 cube.drop(labels=list(cube.filter(like='Row
                             Count')), inplace=True, axis=1)         cube.columns =
                             cube.columns.str.lower()       cube.columns =
                             cube.columns.str.replace(' ','_')
                             print("Preview of the data:")
                             print(cube.head())       print("\n")
                             time.sleep(3)
                                 #Adjust for missing data
                             if cube_name=='airports':
                                     cube.ix[cube.iata_code == 'ECP',
                             ['latitude', 'longitude']] = "30.3416666667",
                             "-85.7972222222"
                                     cube.ix[cube.iata_code == 'UST',
                             ['latitude', 'longitude']] = "29.95861111",
                             "-81.33888888"
                                     cube.ix[cube.iata_code == 'PBG',
                             ['latitude', 'longitude']] = "44.65083", "-
                             73.46806"

                                 print("Saving " + cube_name + " data
                             locally..."
                             + "\n")
                                 with
                             pd.HDFStore(os.path.join(home_dir,'Data\\
                             clean.h5')) as hdf:
                             hdf.append(key=cube_name, value=cube)

                             # Close MicroStrategy connection
                             conn.close()

© 2018 MicroStrategy, Inc.                                          Walk-through: Modeling   22
                                                                                      MicroStrategy | 23
Integrating Python Machine Learning Models and MicroStrategy
      Machine Learning                                                                       Machine Learning
 1

                      This cell is long, so let’s split our explanation. Focusing on the following lines:

                               #Persist cubes on disk
                               for cube_id, cube_name in zip(cube_ids,
                               cube_names):
                                   print("Fetching the " + cube_name + " cube from
                               the Intelligence Server..." + "\n")          cube =
                               conn.get_cube(cube_id=cube_id)
                                     cube.drop(labels=list(cube.filter(like='Row
                               Count')), inplace=True, axis=1)                      cube.columns =
                               cube.columns.str.lower()                   cube.columns =
                               cube.columns.str.replace(' ','_')
                               print("Preview of the data:")
                               print(cube.head())                print("\n")
                               time.sleep(3)

                      This is a loop that will iterate through the linked list of cube names and IDs.
                      A line will be printed in the notebook to tell us which cube is being retrieved.

                      The next line uses our conn object (our connection to MicroStrategy through
                      REST) to extract the cube data from the MicroStrategy Server into Python and
                      stored in a Python dataframe. Then the code drops row count values from
                      each row.

                      The next line sets every string in the cube’s column names in lowercase. After
                      that, we replace the spaces with underscore.

                      Then, three print statements offer us some feedback inside the notebook
                      itself. Note the use of the time.sleep() function. The sleep command waits
                      three seconds before moving on to the next cube. This is done deliberately to
                      allow us slow humans to read the output on screen. Focusing on the next
                      portion of this cell:

                                     #Adjust for missing data
                               if cube_name=='airports':
                               cube.ix[cube.iata_code ==

23 Walk-through: Modeling                                                            © 2018 MicroStrategy, Inc.
 24 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                      Machine Learning
                                                                                                           1

                              'ECP', ['latitude',
                              'longitude']] =
                              "30.3416666667",
                              "-85.7972222222"
                                      cube.ix[cube.iata_code == 'UST',
                              ['latitude', 'longitude']] = "29.95861111",
                              "-81.33888888"
                                      cube.ix[cube.iata_code == 'PBG',
                              ['latitude', 'longitude']] = "44.65083", "-
                              73.46806"

                     This code executes within the for loop. For the airports cube, the code inserts
                     three specific values to correct some data quality errors. This is performed
                     here instead of at the source as you may not always have access to the source
                     data.

                     The next few lines in the cell are as follows:

                                  print("Saving " + cube_name + " data
                              locally..."
                              + "\n")
                                  with pd.HDFStore(os.path.join(home_dir,'Data\\
                              clean.h5')) as hdf:
                              hdf.append(key=cube_name, value=cube)

                     This section warns the user that we are about to save the data from the
                     current cube to our server.

                     Finally, we write the file called clean.h5. Notice the way we used the home
                     directory path defined earlier and join it here. It is in an HDF5 format, a file-
                     based tabular format similar to the Hadoop native format.

                     The last lines of the cell are:

                              # Close MicroStrategy connection                    conn.close()

© 2018 MicroStrategy, Inc.                                                        Walk-through: Modeling   24
                                                                                                    MicroStrategy | 25
Integrating Python Machine Learning Models and MicroStrategy
       Machine Learning                                                              Machine Learning
   1

                     These lines terminate the API session after the data has been moved
                     successfully from the Intelligence Server to our machine learning server.

                 12 Execute the cell. Keep an eye on the output, as shown below:

 25 Walk-through: Modeling                                                   © 2018 MicroStrategy, Inc.
26 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                    Machine Learning
                                                                                                        1

                13 In File Explorer, navigate to C:\Users\mstr\Desktop\Demo\Data to locate the
                   file clean.h5.

                Notice the small size of the file—the data from our three cubes was not very large.
                This will change with the flight data we are about to load.

                Import the local flights data

                Our next notebook will walk us through loading a local file containing our flight data.

                1 Return to the main Jupyter notebook tab.

                2 Click 02_prep_raw_flights.ipynb.
                     The notebook opens as shown below.

© 2018 MicroStrategy, Inc.                                                     Walk-through: Modeling   26
                                                                                                 MicroStrategy | 27
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                  Machine Learning
   1

                        Let’s walk through the cells and execute them together.

                     3 Locate the cell containing the following code:

                                 import pandas as
                                 pd import numpy as
                                 np import os
                                 import sys

                                 home_dir='C:/Users/mstr/Desktop/Demo'

                        As before, this cell imports a few necessary libraries, including Pandas and
                        numpy, to be used in our code, as well as setting our home directory. We are
                        not loading mstrio here, as this data is not located on an Intelligence Server.

                     4 Execute the cell.

  27 Walk-through: Modeling                                                       © 2018 MicroStrategy, Inc.
28 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                       Machine Learning
                                                                                                            1

                5 Locate the cell containing this code:

                              flights = pd.read_csv(home_dir+'/Data/Raw/
                              flights.csv') flights.columns =
                              flights.columns.str.lower()
                              flights.columns = flights.columns.str.replace('
                              ','_')

                     The first line retrieves the data from the file stored locally in a CSV file.

                     The next two lines make every column name lower case and replaces spaces
                     with underscores. A warning is displayed; you can ignore it.

                6 Execute the cell.

                7 Locate the cell containing:

                              flights.head()

                     This line displays a few rows from the flights dataset in the notebook.

                8 Execute the line.
                     The output below the cell should look like the image below:

                9 Locate the cell containing this code:

© 2018 MicroStrategy, Inc.                                                         Walk-through: Modeling   28
                                                                                                     MicroStrategy | 29
Integrating Python Machine Learning Models and MicroStrategy
     Machine Learning                                                    Machine Learning
 1

                            flights.origin_airport =
                            flights.origin_airport.astype(str)
                            flights.destination_airport =
                            flights.destination_airport.astype(str)

                            with pd.HDFStore(os.path.join(home_dir,
                            'Data\\ clean.h5')) as hdf:     airports =
                            hdf.get(key="airports")

                            # Drop flights to/from airports that are not in the
                            airports list

                            flights =
                            flights.ix[np.isin(flights.origin_airport,
                            airports.iata_code), :]

                            flights =
                            flights.ix[np.isin(flights.destination_airport,
                            airports.iata_code), :]

                            # Drop (6) flights without scheduled time
                            flights =
                            flights.ix[np.isnan(flights.scheduled_time)==False,
                            :]

                            # Add a unique ID for each flight
                            flights['FL_ID'] = ["FL_" + str(x) for x in
                            np.arange(0, len(flights))]
                            # Delete critical future leak. Do not give model info
                            it is looking for.
                            drop_cols = ['departure_time', 'taxi_out',
                            'wheels_off', 'elapsed_time', 'air_time',
                            'wheels_on', 'taxi_in', 'arrival_time',
                            'arrival_delay', 'diverted',
                            'cancellation_reason', 'air_system_delay',
                            'security_delay', 'airline_delay',
                            'late_aircraft_delay', 'weather_delay', 'year',
                            'tail_number']

29 Walk-through: Modeling                                        © 2018 MicroStrategy, Inc.
 30 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                     Machine Learning
                                                                                                          1

                              flights.drop(drop_cols, inplace=True, axis=1)

                     Let’s digest this long cell in smaller chunks. Focus on the following lines:

                              flights.origin_airport =
                              flights.origin_airport.astype(str)
                              flights.destination_airport =
                              flights.destination_airport.astype(str)

                     These lines change the data type of the origin and destination airports into text
                     data, known as strings, to ensure they retain their original format.

                              with pd.HDFStore(os.path.join(home_dir, 'Data\\
                              clean.h5')) as hdf:     airports =
                              hdf.get(key="airports")

                     These lines retrieve the airports table from the clean.h5 file we saved in the
                     previous script.

                              # Drop flights to/from airports that are not in the
                              airports list
                              flights =
                              flights.ix[np.isin(flights.origin_airport,
                              airports.iata_code), :]
                              flights =
                              flights.ix[np.isin(flights.destination_airport,
                              airports.iata_code), :]

                     These lines use the airports codes to keep only the flights related to the
                     airports in the table.

                              # Drop (6) flights without scheduled time
                              flights =
                              flights.ix[np.isnan(flights.scheduled_time)==False,
                              :]

© 2018 MicroStrategy, Inc.                                                       Walk-through: Modeling   30
                                                                                                    MicroStrategy | 31
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                      Machine Learning
   1

                        This line removes flights that do not have a scheduled time listed, so our dataset
                        doesn’t have incomplete data.

                                 flights['FL_ID'] = ["FL_" + str(x) for x in
                                 np.arange(0, len(flights))]

                        This line iterates over the flights table and adds a new identifier for each flight
                        in the FL_ID column.

                                 # Delete critical future leak. Do not give model info
                                 it is looking for.
                                 drop_cols = ['departure_time', 'taxi_out',
                                 'wheels_off', 'elapsed_time', 'air_time',
                                              'wheels_on', 'taxi_in',
                                 'arrival_time', 'arrival_delay', 'diverted',
                                              'cancellation_reason',
                                 'air_system_delay', 'security_delay',
                                 'airline_delay',
                                              'late_aircraft_delay',
                                 'weather_delay',
                                 'year', 'tail_number']
                                 flights.drop(drop_cols, inplace=True, axis=1)

                        These lines delete the listed columns from the flight dataset. These columns
                        contain information about the flights that will bias the model. These variables
                        are often called “future leaks.”

                     10 Execute the cell. It may take a few minutes to run.

                     11 Locate the cell containing:

                                 with pd.HDFStore(os.path.join(home_dir,
                                 'Data\\ clean.h5')) as hdf:
                                 hdf.append(key="flights", value=flights)

                        This cell commits the new flights table and appends it to the clean.h5 file we
                        have locally.

  31 Walk-through: Modeling                                                           © 2018 MicroStrategy, Inc.
32 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                    Machine Learning
                                                                                                         1

                12 Execute the cell. Saving may take a few moments because the data is large.

                13 Click Kernel, then click Shutdown. Confirm the shutdown.

                14 Close the browser tab.

                15 In File Explorer, locate the clean.h5 file.

Data preparation
                When we train a model, typically we use a single table that contains all of the data
                we want the model to use. In that table, we add features—new data that wasn’t
                present in the raw data—that reflect our knowledge of the business problem or
                our hypotheses about what is correlated with the business problem.

                This is done through joins, transformations, merges, and lookups using the
                available data sources.

                As an example, our Flights table had the origin and destination airports, and our
                Airports data contained the latitude and longitude for each airport. We want to
                use the latitude and longitude in our model, in case there’s a relationship between
                those variables and the outcome for each flight.

                To do that, we have to join airports with flights.

                After we do this a couple times, we end up with a very large table. In this case, we
                have our 5 million flights, accompanied by 190 columns! In total these columns
                reflect all of our ideas, hypotheses, and thoughts we want the model to learn from
                to understand what causes flights to be delayed.

© 2018 MicroStrategy, Inc.                                                      Walk-through: Modeling   32
                                                                                                  MicroStrategy | 33
Integrating Python Machine Learning Models and MicroStrategy
      Machine Learning                                                                     Machine Learning
  1

                Prepare the data

                            1   On the main Jupyter Notebook page, click

                                03_prep_training. The notebook will open.

                            Rather than running each cell individually, we will run the entire script to
                            save time. Please see the Appendix for individual steps.

                            2   In the Cell menu, select Run All.

                            3   Once completed, click Kernel, then click Shutdown. Confirm the
                                shutdown.
                            4   Close the browser tab.

Data splitting
                This is the last step in data preparation.

33 Walk-through: Modeling                                                          © 2018 MicroStrategy, Inc.
34 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                    Machine Learning
                                                                                                         1

                We divide our data into different sets:

                •   We need to create a partition of the data to use for training the model. This is
                    the data the machine learning algorithm uses to fit the model.

                •   We also need a test set. The test set is used to evaluate the quality of the
                    model after it has been trained with an objective metric like r-squared or root
                    mean squared error.

                •   The final set is the validation set. The validation set serves a similar purpose to
                    the test set. It is used to provide a confirmation that the error rate from both
                    the validation set and the test set are similar.

                The purpose of the training-test split is to estimate the reliability of our model
                when using unseen data. We do this to get a sense of how well our model will
                perform when it’s in production.

                Another way of doing this is through cross-validation, where this train-test-
                validation splitting process is repeated numerous times, and produces a distribution
                of error estimates instead of a single error estimate.

                In thjs case, we’ll do a random split of the data. About 65% of the data is used for
                training, and the remaining data is split evenly across the test and validation sets.

                    Due to time limitations for this workshop, we will use smaller sets to train and
                    evaluate the data. Normally the model would use the entirety of the data.

                Split the data

                1 Locate the following cell:

© 2018 MicroStrategy, Inc.                                                      Walk-through: Modeling   34
                                                                                                  MicroStrategy | 35
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                      Machine Learning
   1

                                 # ########################### #
                                 # Train-Test-Validation Split #
                                 # ########################### #
                                 prod = flights[np.logical_and(flights.month == 12,
                                 flights.day == 31)]
                                 flights = flights[~np.logical_and(flights.month ==
                                 12, flights.day == 31)]
                                 train, test = train_test_split(flights, test_size=
                                 0.35) test, oos = train_test_split(test,
                                 test_size=0.5)

                        This cell creates our train, test, and validation splits. You see the creation of
                        the train and test sets, followed by a split of the test dataset to create the
                        validation set. Additionally, a data frame called prod is created, which will
                        represent the data our model will not see: these are all the flights for
                        December 31st. We will use this data later in the workshop as an out of
                        sample test.

                     2 Execute this cell. This may take a few moments.

                     3 Locate the cell containing the following:

                                 # ############# #
                                 # Store in HDF5 #
                                 # ############# #
                                 with pd.HDFStore(os.path.join(home_dir, 'Data\\
                                 ready.h5')) as hdf:
                                       hdf.append(key="train", value=train)
                                 hdf.append(key="test", value=test)
                                 hdf.append(key="oos", value=oos)
                                 hdf.append(key="prod", value=prod)

                        Now that we have some data prepared, we will store in our home directory, so
                        it can be modified later if needed.

                     4 Execute this cell. Saving may take a few moments.

                     5 Click Kernel, then click Shutdown. Confirm the shutdown.

  35 Walk-through: Modeling                                                           © 2018 MicroStrategy, Inc.
36 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                  Machine Learning
                                                                                                       1

                6 Close the browser tab.

                7 In the File Explorer, locate the ready.h5 file and notice the file is now two and
                  a half gigabytes.

Walkthrough: Modeling
                In the modeling phase, you select an algorithm or collection of algorithms suitable
                for our problem. Thanks to our earlier work, the data is structured for use in our
                machine learning algorithms, but we might not know exactly what algorithm is
                going to work the best.

                Classes of machine learning problems
                There are three main classes of machine learning algorithms, which are
                represented in the diagram below:

© 2018 MicroStrategy, Inc.                                                    Walk-through: Modeling   36
                                                                                                MicroStrategy | 37
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                     Machine Learning
   1

                     •   Unsupervised learning algorithms are used most often for pattern discovery,
                         including, for example, when you have data but aren’t exactly sure what
                         question is being asked.

                     •   Supervised learning algorithms are used when we want to infer a relationship
                         between input and output pairs. They comprise many of the applications
                         you’ve probably read about. For example, supervised learning tasks include
                         regression analysis and data classification. Unlike unsupervised learning, these
                         tasks require labeled data. In other words, we must know the true outcome
                         for each record in our dataset.

                     •   Reinforcement learning algorithms are intended to optimize performance
                         outcomes. Unlike supervised learning algorithms, they do not require the
                         correct input and output pairs. They are used in many industrial and software
                         applications, including in manufacturing and automation.

                     Neural networks
                     Since we are trying to predict whether a flight is likely to be on time, delayed, or
                     canceled, and we have labeled data, we are now working on a supervised learning
                     problem classification. To do so, we will use a neural network.

                     Neural networks are one of the most important machine learning techniques. As
                     diagram below shows, a neural network consists of three parts:
                         1. An input layer, composed of the labeled real-world observations from our
                            dataset, such as origin airport and departure time.

                         2. An output layer, which contains our predictions regarding the probable
                            status of a flight given its input characteristics.

                         3. Multiple hidden layers, composed of sequential algorithms that analyze
                            and process data from the input layer and previous layers in order to
                            generate the outputs. Because we are using multiple hidden layers, this is a
                            deep learning model.

  37 Walk-through: Modeling                                                          © 2018 MicroStrategy, Inc.
38 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                   Machine Learning
                                                                                                        1

                To use our neural network, we must first “train” it. We do so by feeding input
                layer data into an activation function. The hidden layers of the neural network will
                then automatically perform a series of calculations to tune the weights of each
                node in the network, ensuring that the output layer most closely matches the true
                output that we observed in our data—in other words, making sure that our model
                produces the most accurate predictions possible.

                Train the model

                To train our model, we start in our 04_train.ipnyb notebook.

                1 On the Jupyter notebook main page, click 04_train.

                     The notebook opens.

                2 Locate the cell at the top containing the following code:

© 2018 MicroStrategy, Inc.                                                     Walk-through: Modeling   38
                                                                                                 MicroStrategy | 39
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                     Machine Learning
   1

                                 # package imports
                                 import sys import
                                 os import gc
                                 import numpy as
                                 np import pandas
                                 as pd import
                                 pickle

                                 from keras.models import Sequential from
                                 keras.layers.core import Dense, Dropout
                                 from keras.layers.normalization
                                 import BatchNormalization from
                                 keras.callbacks import EarlyStopping
                                 from keras import regularizers

                                 from sklearn.preprocessing import MaxAbsScaler
                        Next, load the packages that we need in order to train our model. You will
                        receive some warnings from TensorFlow, but you can safely ignore them.

                     3 Execute this cell.

                     4 Locate the cell containing the following code:

                                 # helper function for                  returning       the target
                                 variables   from the                   training         data  def
                                 get_targets(df):
                                            targets = ['cancelled', 'delayed',
                                                      'on_time']
                                             return df.filter(items=targets,
                                 axis=1)
                        This contains a support function we will use later. It returns the target
                        variables when we run the model.

                     5 Execute this cell.

  39 Walk-through: Modeling                                                          © 2018 MicroStrategy, Inc.
40 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                   Machine Learning
                                                                                                        1

                6 Locate the cell containing the following code:

                              # helper function for dropping columns that
                              we do not wish to train the model on def
                              drop_cols(df):
                                  drop = ['month', 'day', 'day_of_week',
                              'airline', 'flight_number', 'iata_code_orig',
                              'state_orig', 'iata_code_dest',
                              'state_dest', 'origin_airport',
                              'destination_airport', 'scheduled_departure',
                              'departure_delay', 'scheduled_time',
                              'distance',
                              'scheduled_arrival', 'FL_ID',
                              'cancelled', 'delayed', 'on_time']

                                      return df.drop(drop, axis=1)
                     This cell also contains a support function. It removes, or “drops,” some
                     columns from a data frame that we do not need.

                7 Execute this cell.

                8 Locate the cell containing the following code:

                              # set home directory home_dir =
                              "C:\\Users\\mstr\\Desktop\\Demo"

                              #        set        seed         for        reproducibility
                              np.random.seed(91919)

                     In this cell, we define the path to our files and set a defined seed number that
                     will allow us to reproduce the same results every time. This is because the
                     activation function starts with a pseudo-random value that we can choose.

                9 Execute this cell.

                10 Locate the cell containing the following code:

© 2018 MicroStrategy, Inc.                                                     Walk-through: Modeling   40
                                                                                                 MicroStrategy | 41
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                       Machine Learning
    1

                                     # ################ #
                                     # Load in the data
                                     # #
                                     ################ #
                                     with pd.HDFStore(os.path.join(home_dir,
                                     'Data\\ ready.h5')) as hdf:
                                         train = hdf.get(key="train")
                                   test = hdf.get(key="test")
                        This cell loads in our training data as well as in our test data.

                     11 Execute this cell. Note that these datasets are quite large, so you should not be
                        concerned if they take time to load.

                     12 Locate the cell containing the following:

                                     # ######### #
                                     # Data prep #
                                     # ######### #
                                     if True:
                                         # To speed-up model training, we're taking a
                                     subset of the data
                                         # If you want to train the full model set the
                                     previous line 'False'
                                         train, test = [df.sample(n=25000) for df in
                                     [train, test]]

                                     # x-vars and y-vars
                                     x_train, x_test = [np.array(drop_cols(df=df)) for
                                     df in [train, test]]
                                     y_train, y_test = [np.array(get_targets(df=df)) for
                                     df in [train, test]]

                        *Note that if this condition is set to True, the set is limited to 25,000
                         rows. To train on the complete data frame, you should set this condition
                         to False.

  41 Walk-through: Modeling                                                            © 2018 MicroStrategy, Inc.
42 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                   Machine Learning
                                                                                                        1

                     We’ll use a subset of our data to train and test our model. Depending on the
                     application and the dataset, neural networks can exhibit a lot of variation in
                     the time they take to fit a model. By only using a sample of the data, we can
                     ensure that this process will only take a few minutes.

                     We also want to manage the program’s memory profile as it runs. The more
                     data we want analyze, the more system resources the program must
                     consume.

                13 Execute this cell. By doing so, we split our data into x and y sets. The x set
                   contains the input observations that the model will use to learn from. The y set
                   contains the outcome observations.

                14 Locate the cell containing the following code:

                              # ####### #
                              # Scaling # # ####### # scaler =
                              MaxAbsScaler() x_train =
                              scaler.fit_transform(x_train)
                              x_test = scaler.transform(x_test)

                              # clean-up
                              del train,
                              test
                              gc.collect()

                     This cell scales our values to make them proportionally smaller so they are
                     easier to calculate.

                     Note that gc.collect is a garbage collection command used here to help reduce
                     memory usage.

                15 Execute this cell.

                16 Locate the cell containing the following code:

                              # ############################ #

© 2018 MicroStrategy, Inc.                                                     Walk-through: Modeling   42
                                                                                                 MicroStrategy | 43
Integrating Python Machine Learning Models and MicroStrategy
     Machine Learning                                                      Machine Learning
 1

                            # Configure the neural network # #
                            ############################ # dnn = Sequential()
                            dnn.add(Dense(1024,                  input_dim=190,
                            activation='relu',
                            kernel_initializer='uniform',
                            bias_initializer='normal',
                                          kernel_regularizer=
                            regularizers.l2(0.001),

43 Walk-through: Modeling                                          © 2018 MicroStrategy, Inc.
  44 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                      Machine Learning
                                                                                           1

                                           activity_regularizer=
                             regularizers.l2(0.001)))
                             dnn.add(Dropout(0.2))

                             dnn.add(Dense(512,
                             activation='tanh',
                             kernel_initializer='normal',
                             bias_initializer='uniform',
                                           kernel_regularizer=
                             regularizers.l2(0.001),
                                           activity_regularizer=
                             regularizers.l2(0.01)))
                             dnn.add(BatchNormalization())
                             dnn.add(Dropout(0.2))

                             dnn.add(Dense(128,
                             activation='relu',
                             kernel_initializer='zeros',
                             bias_initializer='uniform',
                                           kernel_regularizer=
                             regularizers.l2(0.001),
                                           activity_regularizer=
                             regularizers.l2(0.1)))
                             dnn.add(BatchNormalization())

                             dnn.add(Dense(3,
                             activation='softmax',
                             kernel_initializer='normal',
                             bias_initializer='ones'))
                             dnn.compile(loss='categorical_crossentropy',
                             optimizer='adagrad',
                             metrics=['categorical_accuracy'])

© 2018 MicroStrategy, Inc.                                        Walk-through: Modeling   44
                                                                                    MicroStrategy | 45
Integrating Python Machine Learning Models and MicroStrategy
        Machine Learning                                                                     Machine Learning
   1

                                  earlystopping = EarlyStopping(monitor='val_loss',
                                  min_delta=0, patience=0, verbose=0, mode='auto')

                         This cell configures the parameters of our neural network, which we must do
                         in order to initialize its structure.
                         Notice the parameter beginning with dnn.add(Dense. This sets the number of
                         nodes in each layer: from 1024, to 512, and ultimately down to 3. The selection
                         of this “network topology” is a common area of debate. Data scientists typically
                         spend a long time selecting and fine tuning these kinds of parameters.

                     17 Execute this cell.

                     18 Locate the cell containing the following:

                                  # ######## #
                                  # Training # # ######## # dnn.fit(x=x_train,
                                  y=y_train,                batch_size=2500,
                                  epochs=10,                verbose=True,
                                  validation_data=(x_test, y_test),
                                  callbacks=[earlystopping])

                         This cell will initiate the training of our neural network. You should see
                         notifications in the console informing you of what is happening as the model
                         attempts to optimize and achieve the lowest lost statistic. These notifications
                         will look something like this:

                         Note that as each training “epoch” is completed the loss statistic declines. This
                         means our model is becoming a more accurate predictor compared to
                         previous iterations.

                     19 Open the Task Manager (right click the taskbar and click Task Manager).

  45 Walk-through: Modeling                                                          © 2018 MicroStrategy, Inc.
46 | MicroStrategy
MicroStrategy World 2019 Workshop Book
Machine Learning                                                                  Machine Learning
                                                                                                       1

                20 Keep an eye on the name Python in the Processes tab.

                     This provides a preview of the CPU and memory resources that our
                     calculations are consuming. In general, this demand will scale with amount of
                     data we are trying to process. More demanding tasks, in other words, require
                     more system resources.

                     Note that training the model will require significant CPU resources. Indeed,
                     you should not be surprised to see Python frequently utilizing close to 100% of
                     those available.

                21 Execute the cell from Step 19. This will take a few minutes.

                22 Locate the cell containing the following code:

                             # ############ #
                             # Save to disk #
                             # ############ #

© 2018 MicroStrategy, Inc.                                                    Walk-through: Modeling   46
                                                                                                MicroStrategy | 47
Integrating Python Machine Learning Models and MicroStrategy

                                  # save the model and pre-processing scaler to
                                                       disk
                                  dnn.save(os.path.join(home_dir, 'Data\Model\
                                  dnn_weights.h5'))
                                              pickle.dump(scaler,
                                          open(os.path.join(home_dir,
                                  'Data\Model\dnn_scaler.pkl'), 'wb'))
                        Once the model is trained, we must save it somewhere. This cell does so in
                        your home directory’s model subfolder of the data folder.

                     23 Execute this cell. Saving may take a few moments.

                     24 Click Kernel, then click Shutdown. Confirm the shutdown.

                     25 Close the browser tab.

                     26 In File Explorer, locate the Model folder. You should see two saved model files,
                        as shown below:

                            •   “dnn_scaler.pkl” contains the scaler, a pre-processing utility used to
                                structure the data for the neural network.
                            •   “dnn_weights.h5” contains the weights of the neural network.

 Walkthrough: Evaluation
                     Now that we have trained our model, we can evaluate how well it predicts flight
                     departure status.

                     To help with the discussion, we have calculated the Area Under the Curve (AUC)
                     statistic for the model on each of the outcomes. We have used this statistic to plot
                     the Receiver Operating Characteristics (ROC), which visualizes the ratio of true
                     positives (correctly predicted examples) to false positives (incorrectly predicted
                     examples). An excellent model should look like a sharp curve approaching the

  © 2018 MicroStrategy, Inc.                                                                           47
48 | MicroStrategy
MicroStrategy World 2019 Workshop Book

                upper left-hand corner of the graph. The dashed line is a reference point for a
                model that randomly guesses the outcome.

                The first ROC graph visualizes the AUC for the canceled flights. The model’s AUC
                score of 0.71 means it did much better than random guesses. A more accurate
                model would have an AUC score even closer to 1, and have a curve more sharply
                sloped towards the upper left-hand corner of the graph.

                Next, we have the AUC for delayed flights, which came in at 0.64.

                And finally, we have the AUC for on-time flights which was also 0.64.

© 2018 MicroStrategy, Inc.                                                                        48
                                                                                             MicroStrategy | 49
Integrating Python Machine Learning Models and MicroStrategy
                                                                                            Machine Learning

                     Before putting these models into production, we want to spend more time
                     comprehensively assessing the performance of the model under a number of
                     different assumptions.

                     Let’s assume for this workshop that we have successfully tested this model.

                     Congratulations! You just trained a model in Python.

                     Let’s now use this model to predict departure statuses for flights that we have not
                     yet observed.

                     Predict flight status

                     We’ll use our previously trained neural network to predict the departure status of
                     unobserved flights. To do this, we’ll open our 05_predict.ipynb script. This script
                     acquires the new data, applies the neural network to the new data, and then creates
                     a cube containing predictions inside of MicroStrategy, which we can use inside a
                     dossier or dashboard.

                     1 In the Jupyter notebook main page, click 05_predict.

                         The notebook should now open and you should see the following:

                     2 Locate the cell at the top containing the following code:

                                   import pandas as pd import numpy as np
                                   import os import sys import gc import
                                   pickle from keras.models import load_model

  49                                                                                © 2018 MicroStrategy, Inc.
50 | MicroStrategy
MicroStrategy World 2019 Workshop Book

                              from mstrio import microstrategy

                     This cell will import the libraries we need for the notebook. Notice the return of
                     mstrio, as we will be interacting with MicroStrategy towards the end of this
                     phase.

                3 Execute this cell.

                4 Locate the cell containing the following:

                              # helper function for returning the target variables
                              from the training data def get_targets(df):

                                         targets = ['cancelled', 'delayed',
                                                   'on_time']
                                          return df.filter(items=targets,
                              axis=1)

                     This filters out the dependent variables (“cancelled,” “delayed,” and “on
                     time”) from our test dataset.

                5 Execute this cell.

                6 Locate the cell containing the following code:

                              # helper function for dropping columns that we do
                              not wish to train the model on def drop_cols(df):
                                  drop = ['month', 'day', 'day_of_week',
                              'airline', 'flight_number', 'iata_code_orig',
                              'state_orig', 'iata_code_dest',
                                          'state_dest', 'origin_airport',
                              'destination_airport', 'scheduled_departure',
                                          'departure_delay', 'scheduled_time',
                              'distance', 'scheduled_arrival', 'FL_ID',
                              'cancelled', 'delayed', 'on_time']

                                   return df.drop(drop, axis=1)

                     This cell will drop columns that we do not wish to use when training our
                     model.

© 2018 MicroStrategy, Inc.                                                                            50
                                                                                                 MicroStrategy | 51
Integrating Python Machine Learning Models and MicroStrategy
                                                                                              Machine Learning

                     7 Execute this cell.

                     8 Locate the cell containing the following code:

                                  # set home directory home_dir =
                                  "C:\\Users\\mstr\\Desktop\\Demo"

                                  #          set      seed          for        reproducibility
                                  np.random.seed(91919)

                         In this cell, we set our path to our files and set the initializing pseudo-random
                         seed. Note that you are using the same number as before.

                     9 Execute this cell.

                     10 Locate the cell containing the following code:

                                  # ######################################### #
                                  # Load in the network and data
                                  preprocessor # #
                                  #########################################
                                  #
                                  dnn =
                                  load_model(filepath=os.path.join(home_dir,
                                  'Data\Model\dnn_weights.h5'))
                                   scaler =
                                   pickle.load(open(os.path.join(home_dir,
                                   'Data\Model\dnn_scaler.pkl'), 'rb'))
                         This cell will load in the model weights and the pre-processor scaler.

                     11 Execute this cell.

                     12 Locate the cell containing the following code:

  51                                                                                  © 2018 MicroStrategy, Inc.
52 | MicroStrategy
You can also read
Next slide ... Cancel