ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia

Page created by Michael Goodwin
 
CONTINUE READING
ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
ML Inference Serving
Preview for USENIX ATC / OSDI 2021

Arpan Gujarati, University of British Columbia
ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
Background
Machine Learning as a Service                         Music   Recommendations
                                    Pictures   Tags
                                                                Application Developers
• Providers                                                           / End Users
  -   Azure Machine Learning
  -   Machine Learning on AWS                                          Sensor Data

  -   IBM Watson Machine Learning
                                               Cloud                   Health Report
  -   Google Cloud AI
ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
Background
Machine Learning as a Service                                  Music   Recommendations
                                             Pictures   Tags
                                                                         Application Developers
• Providers                                                                    / End Users
  -    Azure Machine Learning
  -    Machine Learning on AWS                                                  Sensor Data

  -    IBM Watson Machine Learning
                                                        Cloud                   Health Report
  -    Google Cloud AI

            Training Phase

            +               =
 Dataset        Untrained          Trained
                 model             model

 • Long-running batch operations
 • Searching and ne-tuning model weights
 • No completion deadlines
       fi
ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
Background
Machine Learning as a Service                                                          Music   Recommendations
                                                                   Pictures    Tags
                                                                                                 Application Developers
• Providers                                                                                            / End Users
  -    Azure Machine Learning
  -    Machine Learning on AWS                                                                          Sensor Data

  -    IBM Watson Machine Learning
                                                                               Cloud                    Health Report
  -    Google Cloud AI

            Training Phase                   Inference / Prediction

            +               =                        +                =
 Dataset        Untrained          Trained   Query       Trained              Answer
                 model             model                 model

 • Long-running batch operations
 • Searching and ne-tuning model weights
 • No completion deadlines
       fi
ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
Background
     Machine Learning as a Service                                                              Music   Recommendations
                                                                            Pictures    Tags
                                                                                                          Application Developers
     • Providers                                                                                                / End Users
           -    Azure Machine Learning
           -    Machine Learning on AWS                                                                             Sensor Data

           -    IBM Watson Machine Learning
                                                                                        Cloud                      Health Report
           -    Google Cloud AI

                                                                                                   }
                     Training Phase                   Inference / Prediction
                                                                                                          ML Inference Serving
                     +               =                        +                =                          Computing predictions and
                                                                                                          responding to prediction requests
                                                                                                          from di erent users and
                                                                                                          for di erent models in real time.
          Dataset        Untrained          Trained   Query       Trained              Answer
                          model             model                 model

          • Long-running batch operations
          • Searching and ne-tuning model weights
          • No completion deadlines
ff
     ff
                fi
ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
Background
     Machine Learning as a Service                                                                              Music       Recommendations
                                                                                            Pictures    Tags
                                                                                                                               Application Developers
     • Providers                                                                                                                     / End Users
           -    Azure Machine Learning
           -    Machine Learning on AWS                                                                                                    Sensor Data

           -    IBM Watson Machine Learning
                                                                                                        Cloud                             Health Report
           -    Google Cloud AI

                                                                                                                     }
                     Training Phase                             Inference / Prediction
                                                                                                                                ML Inference Serving
                     +               =                                   +                     =                                Computing predictions and
                                                                                                                                responding to prediction requests
                                                                                                                                from di erent users and
                                                                                                                                for di erent models in real time.
          Dataset        Untrained          Trained            Query              Trained              Answer
                          model             model                                 model

          • Long-running batch operations
          • Searching and ne-tuning model weights
          • No completion deadlines
                                                           ML models: linear regression, cluster analysis, collaborative ltering,
                                                           Bayesian inference, and deep neural network (DNN) inference              }   Focus on
                                                                                                                                        DNN prediction
ff
     ff
                fi
                                                      fi
ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
Background
Inference Serving at the Cloud Scale is Di cult

                                       ffi
ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
Background
Inference Serving at the Cloud Scale is Di cult

1000s of trained models of di erent
types and resource requirements
                       ff
                                       ffi
Background
Inference Serving at the Cloud Scale is Di cult

1000s of trained models of di erent   Requests arrive at di erent
types and resource requirements       rates and regularity

                                             Periodic

                                      Rate
                                                        Time
               ff
                       ff
                                                           ffi
Background
Inference Serving at the Cloud Scale is Di cult

1000s of trained models of di erent   Requests arrive at di erent
types and resource requirements       rates and regularity
                                                              Bursty

                                             Periodic

                                      Rate
                                                        Time
               ff
                       ff
                                                           ffi
Background
Inference Serving at the Cloud Scale is Di cult

1000s of trained models of di erent   Requests arrive at di erent
types and resource requirements       rates and regularity

                                             Sustained + High Rate

                                      Rate
                                                        Time
               ff
                       ff
                                                           ffi
Background
Inference Serving at the Cloud Scale is Di cult

1000s of trained models of di erent   Requests arrive at di erent
types and resource requirements       rates and regularity
                                                      Arbitrary

                                      Rate
                                                     Time
               ff
                       ff
                                                        ffi
Background
Inference Serving at the Cloud Scale is Di cult

1000s of trained models of di erent   Requests arrive at di erent   Each request has an
types and resource requirements       rates and regularity          inherent deadline
                                                      Arbitrary
                                                                    Latency SLOs
                                                                    (e.g., 100ms)

                                      Rate
                                                     Time
               ff
                       ff
                                                        ffi
Background
Inference Serving at the Cloud Scale is Di cult

1000s of trained models of di erent   Requests arrive at di erent                        Each request has an
types and resource requirements       rates and regularity                               inherent deadline
                                                           Arbitrary
                                                                                         Latency SLOs
                                                                                         (e.g., 100ms)

                                      Rate
                                                          Time

                                      Heterogeneous
                                      backends
                                      CPU                          ResNet-50   Latency   Throughput      Cost
                                             TPU

                                                   FPGA                CPU     175 ms     6 req/s         $
                                       GPU

                                                                       GPU     2.8 ms    350 req/s       $$$
               ff
                       ff
                                                             ffi
Overview
Papers at ATC ’21 and OSDI ‘21
Overview
Papers at ATC ’21 and OSDI ‘21

• PET @ OSDI ’21
 - How can DNN executions be optimized for speci c backend types, minimizing their
   computation costs?

                                         fi
Overview
Papers at ATC ’21 and OSDI ‘21

• PET @ OSDI ’21
 - How can DNN executions be optimized for speci c backend types, minimizing their
   computation costs?

• INFaaS @ ATC ’21
 - How can cloud providers e ciently schedule resources while meeting di erent types of
   SLOs for di erent sets of users?
      ff
                     ffi
                                          fi
                                                                ff
Overview
Papers at ATC ’21 and OSDI ‘21

• PET @ OSDI ’21
 - How can DNN executions be optimized for speci c backend types, minimizing their
   computation costs?

• INFaaS @ ATC ’21
 - How can cloud providers e ciently schedule resources while meeting di erent types of
   SLOs for di erent sets of users?

• Palleon and JumpStarter @ ATC ’21
 - How can prediction accuracy and runtime performance be improved for speci c
   applications like video processing and anomaly detection?
      ff
                     ffi
                                          fi
                                                                ff
                                                                     fi
PET — Optimizing Tensor Programs
Partially Equivalent Transformations and Automated Corrections
Goal: Optimize DNN executions for speci c backends, reduce the execution costs

                                   fi
PET — Optimizing Tensor Programs
 Partially Equivalent Transformations and Automated Corrections
 Goal: Optimize DNN executions for speci c backends, reduce the execution costs

Apache TVM Compiler

                                    fi
PET — Optimizing Tensor Programs
      Partially Equivalent Transformations and Automated Corrections
      Goal: Optimize DNN executions for speci c backends, reduce the execution costs

     Apache TVM Compiler
                Developer-friendly model representations
                like TensorFlow and Keras

                                                           Executables
                                                           optimized for
                                                           di erent backends
ff
                                                           fi
PET — Optimizing Tensor Programs
      Partially Equivalent Transformations and Automated Corrections
      Goal: Optimize DNN executions for speci c backends, reduce the execution costs

     Apache TVM Compiler
                Developer-friendly model representations
                like TensorFlow and Keras

                                                                               equivalent

                                                                                             NVIDIA

                                                           Executables                      TensorRT
                                                           optimized for
                                                           di erent backends
ff
                                                           fi
PET — Optimizing Tensor Programs
      Partially Equivalent Transformations and Automated Corrections
      Goal: Optimize DNN executions for speci c backends, reduce the execution costs

     Apache TVM Compiler
                Developer-friendly ML representations
                like TensorFlow and Keras
                                                     equivalent          equivalent        equivalent           equivalent
      Existing frameworks: Pinitial                               P1                  P2                P3                   Poptimized

                                                                                                 equivalent

                                                                                                               NVIDIA

                                                          Executables                                         TensorRT
                                                          optimized for
                                                          di erent backends
ff
                                                          fi
PET — Optimizing Tensor Programs
      Partially Equivalent Transformations and Automated Corrections
      Goal: Optimize DNN executions for speci c backends, reduce the execution costs

     Apache TVM Compiler
                Developer-friendly ML representations
                like TensorFlow and Keras
                                                     equivalent          equivalent        equivalent            equivalent
      Existing frameworks: Pinitial                               P1                  P2                P3                    Poptimized

                                                      partially           partially         partially
                                                                                                  equivalent      partially
               (Step 1) PET: Pinitial               equivalent    P1     equivalent   P2   equivalent   P   3    equivalent   Poptimized
                                                                                                                NVIDIA

                                                                                                                • More e cient!
                                                                                                                • May not be equal to Pinitial
                                                                                                                  ➡ Accuracy loss
                                                          Executables                                       TensorRT
                                                          optimized for
                                                          di erent backends
ff
         ffi
                                                           fi
PET — Optimizing Tensor Programs
            Partially Equivalent Transformations and Automated Corrections
            Goal: Optimize DNN executions for speci c backends, reduce the execution costs

           Apache TVM Compiler
                      Developer-friendly ML representations
                      like TensorFlow and Keras
                                                           equivalent              equivalent        equivalent            equivalent
            Existing frameworks: Pinitial                                    P1                 P2                P3                    Poptimized

                                                            partially               partially         partially
                                                                                                            equivalent      partially
                     (Step 1) PET: Pinitial               equivalent         P1    equivalent   P2   equivalent   P   3    equivalent   Poptimized
                                                                                                                          NVIDIA

                     (Step 2) PET: Poptimized
                                                                      automatic
                                                                                   Poptimized-and-correct                 • More e cient!
                                                                      correction
                                                                                                                          • May not be equal to Pinitial
                                                            • E cient and equivalent to Pinitial                            ➡ Accuracy loss
                                                                                                                      TensorRT
                                                                Executables
                                                                optimized for
                                                                di erent backends
ff
     ffi
               ffi
                                                                 fi
INFaaS
Automated Model-less Inference Serving
INFaaS
Automated Model-less Inference Serving
       Users                            Cloud    Cloud        RAM
                                     Scheduler   Backends

                                                              GPU Memory

 1000s of users,   How should the
                                                                                  GPU
  varying SLOs!
                                                              GPU Exec
                   scheduler prioritize

                                                                  }
                   these requests?

                                                    CPU

                                                                         If heterogeneous
                                                            TPU          backends are available,
                                                                         which one should be
                                                                         used?
                                                   FPGA
INFaaS
Automated Model-less Inference Serving

                                                               {
       Users                            Cloud    Cloud                RAM
                                     Scheduler   Backends
                                                     Which models
                                                 should be cached
                                                  in RAM and GPU
                                                         Memory?
                                                                      GPU Memory

 1000s of users,   How should the
                                                                                          GPU
  varying SLOs!
                                                                      GPU Exec
                   scheduler prioritize

                                                                          }
                   these requests?

                                                       CPU

                                                                                 If heterogeneous
                                                                    TPU          backends are available,
                                                                                 which one should be
                                                                                 used?
                                                     FPGA
INFaaS
                                                                          If there are di erent variants of the △ model, optimized
                                                                          for a single input, for a batch of 8 inputs, and for a batch
                                                                          of 16 inputs, which one should be used for inference
                                                                          Should we wait for more user requests to arrive?
Automated Model-less Inference Serving

                                                                    {
               Users                         Cloud    Cloud                  RAM
                                          Scheduler   Backends
                                                          Which models
                                                      should be cached
                                                       in RAM and GPU
                                                              Memory?
                                                                             GPU Memory

      1000s of users,   How should the
                                                                                                        GPU
       varying SLOs!
                                                                             GPU Exec
                        scheduler prioritize

                                                                                   }
                        these requests?

                                                                                                        Frameworks like TVM and
                                                            CPU                                         PET can optimize a model
                                                                                                            for speci c scenarios
                                                                                              If heterogeneous
                                                                         TPU                  backends are available,
                                                                                              which one should be
                                                                                              used?
                                                          FPGA
 fi
          ff
INFaaS
                                                                                        If there are di erent variants of the △ model, optimized
                                                                                        for a single input, for a batch of 8 inputs, and for a batch
                                                                                        of 16 inputs, which one should be used for inference
                                                                                        Should we wait for more user requests to arrive?
Automated Model-less Inference Serving

                                                                                  {
               Users                                       Cloud    Cloud                  RAM
                                                        Scheduler   Backends
                                                                        Which models
                                                                    should be cached
                                                                     in RAM and GPU
                                                                            Memory?
                                                                                           GPU Memory

      1000s of users,                 How should the
                                                                                                                      GPU
       varying SLOs!
                                                                                           GPU Exec
                                      scheduler prioritize

                                                                                                 }
                                      these requests?

                                                                                                                      Frameworks like TVM and
                                                                          CPU                                         PET can optimize a model
                                                                                                                          for speci c scenarios
         INFaaS takes these decisions at runtime                                                            If heterogeneous

            - based on the individual request SLOs                                     TPU                  backends are available,
                                                                                                            which one should be
            - Autoscaling: increase / decrease number and               FPGA
                                                                                                            used?
                 type of backends based on the workload
 fi
          ff
Palleon
Runtime System for Efficient Video Processing
Focus: Cloud-backed mobile platforms
Palleon
Runtime System for Efficient Video Processing
Focus: Cloud-backed mobile platforms
ImageNet dataset → 1,200,000 images from 1,000 classes
Palleon
Runtime System for Efficient Video Processing
Focus: Cloud-backed mobile platforms
ImageNet dataset → 1,200,000 images from 1,000 classes
Palleon                                                  Large memory and power footprint
                                                            - Prohibitive for mobile and edge platforms
Runtime System for Efficient Video Processing               - For video processing, latency constraints
                                                               are extremely tight

Focus: Cloud-backed mobile platforms
ImageNet dataset → 1,200,000 images from 1,000 classes
Palleon                                                                      Large memory and power footprint
                                                                                     - Prohibitive for mobile and edge platforms
     Runtime System for Efficient Video Processing                                   - For video processing, latency constraints
                                                                                        are extremely tight

     Focus: Cloud-backed mobile platforms
     ImageNet dataset → 1,200,000 images from 1,000 classes

                                                                              {
                                                           Smaller models
                                                            o er relatively
                                                          lower accuracy!
ff
Palleon                                                                       Large memory and power footprint
                                                                                      - Prohibitive for mobile and edge platforms
     Runtime System for Efficient Video Processing                                    - For video processing, latency constraints
                                                                                         are extremely tight

     Focus: Cloud-backed mobile platforms
     ImageNet dataset → 1,200,000 images from 1,000 classes

     Key idea
        - Videos frames have temporal locality
        - Classi cation output is skewed in favour of a

                                                                               {
            small number of classes (unlike the training    Smaller models
            dataset with 1,000 classes)                      o er relatively
          - If the class skew is known, a more compact     lower accuracy!
            model can be used instead of a generic model
ff
     fi
Palleon                                                                        Large memory and power footprint
                                                                                       - Prohibitive for mobile and edge platforms
     Runtime System for Efficient Video Processing                                     - For video processing, latency constraints
                                                                                          are extremely tight

     Focus: Cloud-backed mobile platforms
     ImageNet dataset → 1,200,000 images from 1,000 classes

     Key idea
        - Videos frames have temporal locality
        - Classi cation output is skewed in favour of a

                                                                                {
             small number of classes (unlike the training    Smaller models
             dataset with 1,000 classes)                      o er relatively
           - If the class skew is known, a more compact     lower accuracy!
             model can be used instead of a generic model

          Palleon: Detects the class skew in videos
          and dynamically adapts the ML model
ff
     fi
Program
Wednesday, July 14

• INFaaS, Palleon, JumpStarter, FTPipe @ ATC ‘21
  - Session 3, Track 2: I'm Old But I Learned a New Trick: Machine Learning
  - Time: 12:15 pm - 1:45 pm PDT
• PET @ OSDI ‘21
  - Session 1: Optimizations and Scheduling for Machine Learning
  - Time: 8:45 am - 10:00 am PDT
Program
Wednesday, July 14

Like Palleon, focuses on a speci c application — anomaly detection,
but proposes to use signal processing instead of ML!

• INFaaS, Palleon, JumpStarter, FTPipe @ ATC ‘21
   - Session 3, Track 2: I'm Old But I Learned a New Trick: Machine Learning
   - Time: 12:15 pm - 1:45 pm PDT
• PET @ OSDI ‘21
   - Session 1: Optimizations and Scheduling for Machine Learning
   - Time: 8:45 am - 10:00 am PDT
                          fi
Program
Wednesday, July 14

Like Palleon, focuses on a speci c application — anomaly detection,
but proposes to use signal processing instead of ML!
                                                       For training giant models on multiple
                                                       GPUs in a pipelined fashion

• INFaaS, Palleon, JumpStarter, FTPipe @ ATC ‘21
   - Session 3, Track 2: I'm Old But I Learned a New Trick: Machine Learning
   - Time: 12:15 pm - 1:45 pm PDT
• PET @ OSDI ‘21
   - Session 1: Optimizations and Scheduling for Machine Learning
   - Time: 8:45 am - 10:00 am PDT
                          fi
You can also read