Accelerating Microsoft's AI Ambitions

Page created by Brent Sherman
 
CONTINUE READING
Accelerating Microsoft's AI Ambitions
Accelerating Microsoft’s AI Ambitions
Accelerating Microsoft's AI Ambitions
2
Accelerating Microsoft's AI Ambitions
Text Analytics
                     Personalizer
                                                                                                          Translator Text               Bing Spell Check
     Decision
                  Content Moderator
                                                    Ink Recognizer
                                                                       Computer
                                                                        Vision                                      Language                     Language
                                                                                       Face                                                    Understanding
                                                                                                           Content
                                                                 Vision
Anomaly Detector
                                                                                                          Moderator            QnA Maker
                                                                                    Custom
                                                        Video
                                                                                     Vision
                                                       Indexer
                                                                     Form Recognizer

      Conversation
 transcription capability           Custom Speech                                              Bing Custom                  Bing Entity Search
                                                                                                                   Bing
                                                                                                  Search       Video Search
                                                                                  Bing News                                                Bing

                    Speech             Speech transcription                         Search
                                                                                              Bing Web
                                                                                                         Web search                   Local Business
                                                                                                                                          Search
 Text-to-Speech                                                                                Search
                                                                                                         Bing Image Search         Bing Autosuggest
                            Neural Text-to-Speech
                                                                                                                      Bing Visual Search
Accelerating Microsoft's AI Ambitions
Classic                  Deep
           ML                     CNNs

Figure sources:
1. Han et al., Pre-Trained AlexNet Architecture with Pyramid Pooling and Supervision for High Spatial Resolution Remote Sensing Image Scene Classification
2. Vaswani et al., “Attention is all you need”
                                                                                                                                                             4
3. https://tkipf.github.io/graph-convolutional-networks/
Accelerating Microsoft's AI Ambitions
100000
                         10000                                        Megatron
                                                                                                                                                  Megatron
Millions of parameters

                                    ~325x ResNet50                     GPT-2
                                                                                                    10000   ~2200x ResNet50
                         1000                                                                                                                      GPT-2

                                                                                 Billions of ops
                                                       GNMT
                                                                  BERT-L                             1000
                          100                                                                                                                 BERT-L
                                        AlexNet
                                                                                                      100
                                                   ResNet-50
                           10
                                                                                                       10                      ResNet-50

                            1                                                                           1          AlexNet

                             2010    2012   2014   2016        2018   2020                               2010   2012    2014   2016        2018    2020

                                                                                                                                                             5
Accelerating Microsoft's AI Ambitions
Registers
Contro
 l Unit CPUs           GPUs
  (CU)
           Arithmeti                      FPGAs                                 ASICs
            c Logic                                           NPUs
             Unit
            (ALU)

Cloud DNN training and batched inferencing on NVIDIA GPUs (CUDA, PyTorch, TensorFlow)

Cloud and heavy edge inferencing performed on Intel CPUs (ONNX) and MS-NPUs (FPGA)

Light edge inferencing on commodity and custom silicon (e.g., Hololens, etc.)

                                                                                        6
Accelerating Microsoft's AI Ambitions
Inside Bing’s AI Inference Supercomputer:
Project Brainwave
Accelerating Microsoft's AI Ambitions
2011: Project Catapult Launched
Field Programmable
     Gate Arrays     2013: Bing pilot runs decision trees 40X faster
                     2015: Bing ranking throughput increased 2X
                     2016: Azure Accelerated Networking delivers industry-leading cloud
                     performance
                     2017: Over 1M servers deployed with FPGAs at hyperscale
                     2017: Hardware Microservices harness FPGAs for distributed
                     computing
                     2017: FPGAs enable real-time AI, ultra-low latency inferencing without
                     batching; Bing launches first FPGA-accelerated Deep Neural Network
                     2018: Project Brainwave launched in Azure Machine Learning
Accelerating Microsoft's AI Ambitions
T2

                               T1                        T1

                        TOR                                       TOR
                              50G            9x50G

                                             F       F    F        F             F         F
         F         F                F
                                             F   F   F    F   F    F             F     F   F

                                             F   F   F    F   F    F             F     F   F
         C         C                C

        50G                                               50G

                                                                         CP      50G
                                                                  FPGA   U       NIC           FPGA
         FPGA

   PCIe Gen3
      x16                                                         FPGA        FPGA             FPGA
                       50G
                       NIC

     Dual Socket                                                  FPGA        FPGA             FPGA
        CPUs

Bing Compute Server
                                                          Bing FPGA Appliances
                                                                                                      9
Accelerating Microsoft's AI Ambitions
22μs latency
                                                                                                    T2
                                                                     8μs latency     T1                                 T1

                                                                3                          TOR
                                                                                                                                TOR

                                                                 3μs latency   TOR

         CPU         QPI          CPU
                                                                                                                  2
                                                                                                                             Hardware acceleration plane
                                                                                                            NLP (RNN)
                                                                                                             Models                   Web Ranking

                                                                                          Image Detection
                                                                                              (CNN)                     Text to Speech

                              FPGA
                                                                                            Bing Serving
                                                                                               Stack

        QSFP               QSFP      QSFP            1
                50Gb/s                      50Gb/s       ToR
                                                                    Traditional software (CPU) server plane

1 FPGAs are network connected. Used and                   2    Interconnected FPGAs form a separate plane of computation built
   managed independently from the CPU.                         on Hardware as a Service (HaaS).
                                                          3    Direct FPGA to FPGA communication using Lightweight 10
                                                                                                                   Transport
                                                               Layer (LTL) at ultra low latencies.
Brainwave v1      Brainwave v2       Brainwave v3    Brainwave v4

Low latency LSTM   Narrow Precision    Convolution    Generalized ISA,
   inference        Breakthrough      Optimizations    Transformers

     2016               2017               2018            2019

                                                                         11
12
msfp8
        int8

               float16             msfp8   int8   float16   float32

                         float32

    Multiplier Area & Energy

                                                                      13
14
Sub-millisecond FPGA compute
     latencies at batch 1
https://www.microsoft.com/en-us/research/uploads/prod/2018/03/mi0218_Chung-
2018Mar25.pdf

https://blogs.bing.com/search/2017-12/search-2017-12-december-ai-update

                                                                 16
Hardware for Future AI
Must solve real customer problems – solutions including non-AI pieces, not just AI components

Must be differentiated E2E including system overheads

Want durable and “horizontally-capable” architectures with long shelf lives (3-5 years)

Compatible and friendly to deploy in diverse environments (SKUs, datacenters, etc)

Must be easy to develop software/models for and integrate seamlessly with AI tools ecosystem

Improved cost of ownership at system-scale vs general-purpose commodity hardware

                                                                                                18
1. H.T. Kung, “Why Systolic Arrays?”, 1982                     19
2. https://datascience.stackexchange.com/questions/49522/what-is-gelu-activation
20
Closing thoughts and predictions
Q/A & Discussion

erchung@microsoft.com
You can also read