"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION

Page created by Ashley Lloyd
 
CONTINUE READING
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
“AutoML for TinyML with Once-for-All Network”

                 Song Han – MIT
                  April 28, 2020
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
tinyML Talks Sponsors

Additional Sponsorships available – contact Bette@tinyML.org for info
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
Next tinyML  Talk
                       Next tinyML Talk
Date Date Presenter
                Presenter
                                                    Topic  / Title
                                                    Topic / Title
Thursday    Hans
     Thursday     Reyserhove
                Hans Reyserhove                     Embedded      Computer
                                                    Embedded Computer         Vision
                                                                           Vision     Hardware
                                                                                  Hardware
May 14
     May 14 Postdoctoral
                PostdoctoralResearch
                              Research              through   theEyes
                                                    through the    Eyes
                                                                      of of
                                                                         an an
                                                                            AR AR  Glass
                                                                                Glass
            Scientist, Facebook
                Scientist, FacebookReality
                                     Reality
            LabsLabs
                Jamie  Campbell
            Jamie Campbell                          Using TensorFlow   Lite for Microcontrollers
                                                    Using TensorFlow Lite for Microcontrollers
                Software   Engineering
            Software Engineering                    for High-Efficiency NN  Inference on Ultra-
                                                    for High-Efficiency NN Inference on Ultra-
                Manager, Synopsys                   Low Power Processors
            Manager, Synopsys                       Low Power Processors
                              Webcast start time is 8 am Pacific time
                           Webcast     startis approximately
                           Each presentation   time is 8 am      Pacific
                                                             30 minutes     time
                                                                        in length
                         Each presentation is approximately 30 minutes in length
               Please contact talks@tinyml.org if you are interested in presenting
           Please contact talks@tinyml.org if you are interested in presenting
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
Song Han
           Song Han is an assistant professor in MIT’s Department of
           Electrical Engineering and Computer Science. His research
           focuses on efficient deep learning computing. He has proposed
           “deep compression” that can reduce neural network size by an
           order of magnitude, and the hardware implementation “efficient
           inference engine” that first exploited model compression and
           weight sparsity in deep learning accelerators. He received a best
           paper award at the ICLR’16 and FPGA’17. He is a recipient of NSF
           CAREER Award and MIT Technology Review Innovators Under 35.
           Many of the pruning, compression, and acceleration techniques
           have been integrated into commercial AI chips. He was the co-
           founder and chief scientist of DeePhi Tech that was acquired by
           Xilinx. He earned a PhD in electrical engineering from Stanford
           University.
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
AutoML for TinyML
with Once-for-All Network

                Song Han
  Massachusetts Institute of Technology

            Once-for-All, ICLR’20
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
AutoML for TinyML
                        with Once-for-All Network

                                                                   fewer engineers        small model

many engineers                                                                       less computation
                             large model
                                                           Less Engineer Resources: AutoML
                                                           Less Computational Resources: TinyML

      A lot of computation
                                           Once-for-All, ICLR’20
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
Our 1st generation solution

                                                                                                                                                                                                                       HAQ: Hardware-aware
                                           DAC ’18, June 24–29, 2018, San Francisco, CA, USA                                                                                                                          Automated                 Quantization
                                                                                                                                                                                                                        Song Han and William J. Dally
                                                                                                                                                  Peaks: our RL agent automatically learns 1x1 convolutions
                                                                                                                                                  have less redundancy and can be pruned less.
                                                                                                                                                                                                                                        [CVPR 2019], oral

                                                                                                                                    Crests: our RL agent automatically learns 3x3 convolutions have

                                       AMC: AutoML for Model Compression
                                                                                                                                    more redundancy and can be pruned more.
                                                                                                                             Residual Block 1          Residual Block 2                Residual Block 3              Residual Block 4

                                                                                                     [ECCV 2018]
                                                                                         Figure 14: The pruning policy (sparsity ratio) given by our reinforcement learning agent for ResNet-50.
                                                                                   50%
                                                                                         50%           ResNet50 Density Pruned by Human Expert                                        neural network. In Proceedings of the 43rd International Symposium on Computer

                                             (#non-zero weights/ total #weights)
                                                                                                       ResNet50 Density Pruned by AMC (the lower the better)                          Architecture. IEEE Press, 243–254.
                                                                                               43%
                                                                                                                                                                                [9]   Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing
                                                                                   38%
                                                                                                                                                                                      deep neural networks with pruning, trained quantization and hu�man coding.
                                                                                                                                                                                      arXiv preprint arXiv:1510.00149 (2015).
                                                                                                     31%         31%
                                                                                                                                                                               [10]   Song Han, Je� Pool, John Tran, and William Dally. 2015. Learning both weights
                                                          Density

                                                                                                                              30%         30%                     29%
                                                                                                           28%         28%
                                                                                   25%
                                                                                                                                                                                      and connections for e�cient neural network. In Advances in Neural Information
                                                                                                                                    23%
                                                                                                                                                      20%               20%
                                                                                                                                                                                      Processing Systems. 1135–1143.
                                                                                                                                                19%
                                                                                                                                                                               [11]   Yihui He and Song Han. 2018. AMC: Automated Model Compression and Accel-
                                                                                   13%
                                                                                                                                                                                      eration with Reinforcement Learning. arXiv preprint arXiv:1802.03494 (2018).
                                                                                                                                                            10%                [12]   Geo�rey Hinton, Oriol Vinyals, and Je� Dean. 2015. Distilling the knowledge in
                                                                                                                                                                                      a neural network. arXiv preprint arXiv:1503.02531 (2015).
                                                                                   0%
                                                                                         Conv1       ResBlock1 ResBlock2 ResBlock3 ResBlock4            FC         Total
                                                                                                                                                                               [13]   Matthew B Johnson and Beth Stevens. 2018. Pruning hypothesis comes of age.
                                                                                                                                                                                      (2018).
Proxyless Neural Architecture Search       Figure 15: Our reinforcement learning agent (AMC) can
                                           prune the model to a lower density than achieved by human
                                                                                                                                                                               [14]   Norman P. Jouppi, Cli� Young, Nishant Patil, David Patterson, Gaurav Agrawal,
                                                                                                                                                                                      and et al. . 2017. In-Datacenter Performance Analysis of a Tensor Processing
                                                                                                                                                                                      Unit. In ISCA.
                                           experts without loss of accuracy. (Human expert: 3.4⇥ com-
             [ICLR 2019]                   pression on ResNet50. AMC : 5⇥ compression on ResNet50.)
                                                                                                                                                                               [15]   Alex Krizhevsky, Ilya Sutskever, and Geo�rey E Hinton. 2012. Imagenet classi�-
                                                                                                                                                                                      cation with deep convolutional neural networks. In NIPS.
                                                                                                                                                                               [16]   Mary Yvonne Lanzerotti, Giovanni Fiorenza, and Rick A Rand. 2005. Micro-
                                           5 CONCLUSION                                                                                                                               miniature packaging and integrated circuitry: The work of EF Rent, with an
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
Challenge: Efficient Inference on Diverse Hardware Platforms

         Cloud AI                         Mobile AI               Tiny AI (AIoT)

                               less                          less
                             resource                      resource

• Memory: 32GB                     • Memory: 4GB                 • Memory:
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
Challenge: Efficient Inference on Diverse Hardware Platforms

                                                        Design Cost (GPU hours)

                                           200
for training iterations:
    forward-backward();

                           The design cost is calculated under the assumption of using MobileNet-v2.
                                                                                                       10
"AUTOML FOR TINYML WITH ONCE-FOR-ALL NETWORK" - SONG HAN - MIT APRIL 28, 2020 - TINYML FOUNDATION
Challenge: Efficient Inference on Diverse Hardware Platforms

                                                                   Design Cost (GPU hours)
(1)for search episodes: //meta controller                  40K
        for training iterations:
                                   expensive!
            forward-backward();
        if good_model: break;
    for post-search training iterations:        expensive!
          forward-backward();
                                 The design cost is calculated under the assumption of using MnasNet.
                                                                Once-for-All, ICLR’20
                                 [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.   11
Challenge: Efficient Inference on Diverse Hardware Platforms

                                                      Diverse Hardware Platforms

                                                              2019         2017        2015        2013

(2)
  for devices:                                                       Design Cost (GPU hours)
  (1)for search episodes: //meta controller                  40K
          for training iterations:
                                     expensive!                                 160K
              forward-backward();
          if good_model: break;
      for post-search training iterations:        expensive!
            forward-backward();
                                   The design cost is calculated under the assumption of using MnasNet.
                                                                  Once-for-All, ICLR’20
                                   [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.   12
Challenge: Efficient Inference on Diverse Hardware Platforms

                                                     Diverse Hardware Platforms

                                                                                                                                                      …

                            12                                                        9                                                    6
             Cloud AI (10        FLOPS)                     Mobile AI (10 FLOPS)                                                 Tiny AI (10 FLOPS)

(2)
  for many devices:                                                 Design Cost (GPU hours)
 (1)for search episodes: //meta controller                  40K
         for training iterations:
                                    expensive!                                 160K
             forward-backward();
         if good_model: break;                                                                                                     1600K

     for post-search training iterations:        expensive!
           forward-backward();
                                  The design cost is calculated under the assumption of using MnasNet.
                                                                 Once-for-All, ICLR’20
                                  [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.                  13
Challenge: Efficient Inference on Diverse Hardware Platforms

                                                  Diverse Hardware Platforms

                                                                                                                                                            …

                            12                                                     9                                                          6
             Cloud AI (10        FLOPS)                  Mobile AI (10 FLOPS)                                                 Tiny AI (10 FLOPS)

(2)
  for many devices:                                              Design Cost (GPU hours)
 (1)for search episodes: //meta controller               40K     → 11.4k lbs CO2 emission
         for training iterations:
                                    expensive!                              160K     → 45.4k lbs CO2 emission
             forward-backward();
         if good_model: break;                                                                                                   1600K   → 454.4k lbs CO2 emission
     for post-search training iterations:     expensive!
           forward-backward();
                                        1 GPU hour translates to 0.284 lbs CO2 emission according to
                                                              Once-for-All, ICLR’20
                                        Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.                               14
Problem:
TinyML comes at the cost of BigML
(inference)                 (training/search)

We need Green AI:
Solve the Environmental Problem of NAS

                                           Evolved Transformer                     ICML’19, ACL’19
                    Ours   52               4 orders of magnitude                  ACL’20
                                                                Hardware-Aware Transformer
Our new solution, OFA: Decouple Training and Search

       Conventional NAS                                              Once-for-All:

(2)for devices:                                         for OFA training iterations:
                                                            forward-backward();           training
  (1) for search episodes:
                                                                                  decouple
         for training iterations:            =>         for devices:
             forward-backward();                            for search episodes:
                                                                                           search
         if good_model: break;                                  sample from OFA;
      for post-search training iterations:                      if good_model: break;
            forward-backward();                             direct deploy without training;

                                             Once-for-All, ICLR’20                                   16
Challenge: Efficient Inference on Diverse Hardware
                                              Platforms
                                           Diverse Hardware Platforms

                                                                                                                                                        …

                          12                                                 9                                                              6
           Cloud AI (10        FLOPS)             Mobile AI (10 FLOPS)                                                  Tiny AI (10 FLOPS)

for OFA training iterations:                              Design Cost (GPU hours)
    forward-backward();             training
                                                   40K    → 11.4k lbs CO2 emission
                               decouple
for devices:
                                  search                              160K    → 45.4k lbs CO2 emission
    for search episodes:
        sample from OFA;                                                                                                  1600K      → 454.4k lbs CO2 emission
        if good_model: break;
    direct deploy without training;                                     Once-for-All Network
                                           1 GPU hour translates to 0.284 lbs CO2 emission according to
                                                       Once-for-All, ICLR’20
                                           Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.                        17
Once-for-All Network:
Decouple Model Training and Architecture Design
                 once-for-all network

                Once-for-All, ICLR’20             18
Once-for-All Network:
Decouple Model Training and Architecture Design
                 once-for-all network

                Once-for-All, ICLR’20             19
Once-for-All Network:
Decouple Model Training and Architecture Design
                 once-for-all network

                Once-for-All, ICLR’20             20
Once-for-All Network:
Decouple Model Training and Architecture Design
                 once-for-all network

                                                  …

                Once-for-All, ICLR’20                 21
Challenge: how to prevent different subnetworks
       from interfering with each other?

                  Once-for-All, ICLR’20           22
Solution: Progressive Shrinking
               19
• More than 10 different sub-networks in a single once-for-all network, covering
  4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging
  than training a normal neural network given so many sub-networks to support.

                                Once-for-All, ICLR’20                                  23
Solution: Progressive Shrinking
                19
• More than 10 different sub-networks in a single once-for-all network, covering
  4 different dimensions: resolution, kernel size, depth, width.
• Directly optimizing the once-for-all network from scratch is much more challenging
  than training a normal neural network given so many sub-networks to support.

                              Progressive Shrinking
                                                  Jointly fine-tune
    Train the          Shrink the model                                     once-for-all
                                                   both large and
   full model           (4 dimensions)                                       network
                                                 small sub-networks

• Small sub-networks are nested in large sub-networks.
• Cast the training process of the once-for-all network as a progressive shrinking and
  joint fine-tuning process.

                                Once-for-All, ICLR’20                                      24
Connection to Network Pruning
                                Network Pruning
   Train the         Shrink the model               Fine-tune        single pruned
  full model           (only width)               the small net         network

                             Progressive Shrinking
                                                    Fine-tune
    Train the         Shrink the model                                 once-for-all
                                                  both large and
   full model          (4 dimensions)                                   network
                                                  small sub-nets

• Progressive shrinking can be viewed as a generalized network pruning with much
  higher flexibility across 4 dimensions.

                               Once-for-All, ICLR’20                                  25
Progressive Shrinking

  Full      Elastic    Full     Elastic      Full     Elastic   Full   Elastic
          Resolution          Kernel Size             Depth            Width
Partial

Randomly sample input image
    size for each batch

                                      Once-for-All, ICLR’20                      26
Progressive Shrinking

  Full      Elastic    Full     Elastic      Full     Elastic   Full   Elastic
          Resolution          Kernel Size             Depth            Width
Partial

Randomly sample input image
    size for each batch

                                      Once-for-All, ICLR’20                      27
Progressive Shrinking

  Full      Elastic    Full     Elastic      Full     Elastic   Full   Elastic
          Resolution          Kernel Size             Depth            Width
Partial

Randomly sample input image
    size for each batch

                                      Once-for-All, ICLR’20                      28
Progressive Shrinking

  Full      Elastic    Full     Elastic      Full     Elastic   Full   Elastic
          Resolution          Kernel Size             Depth            Width
Partial

Randomly sample input image
    size for each batch

                                      Once-for-All, ICLR’20                      29
Progressive Shrinking

  Full      Elastic    Full     Elastic      Full     Elastic   Full   Elastic
          Resolution          Kernel Size             Depth            Width
Partial

Randomly sample input image
    size for each batch

                                      Once-for-All, ICLR’20                      30
Progressive Shrinking

  Full      Elastic    Full     Elastic      Full     Elastic   Full   Elastic
          Resolution          Kernel Size             Depth            Width
Partial

Randomly sample input image
    size for each batch

                                      Once-for-All, ICLR’20                      31
Progressive Shrinking

  Full      Elastic         Full     Elastic       Full       Elastic     Full   Elastic
          Resolution               Kernel Size                Depth              Width
Partial                  Partial

                   7x7
                                          5x5
                                                             3x3

                              Transform          Transform
                                Matrix             Matrix
                                25x25               9x9

                          Start with full kernel size
      Smaller kernel takes centered weights via a transformation matrix

                                           Once-for-All, ICLR’20                           32
Progressive Shrinking

  Full      Elastic         Full     Elastic       Full       Elastic     Full   Elastic
          Resolution               Kernel Size                Depth              Width
Partial                  Partial

                   7x7
                                          5x5
                                                             3x3

                              Transform          Transform
                                Matrix             Matrix
                                25x25               9x9

                          Start with full kernel size
      Smaller kernel takes centered weights via a transformation matrix

                                           Once-for-All, ICLR’20                           33
Progressive Shrinking

  Full      Elastic         Full     Elastic       Full       Elastic     Full   Elastic
          Resolution               Kernel Size                Depth              Width
Partial                  Partial

                   7x7
                                          5x5
                                                             3x3

                              Transform          Transform
                                Matrix             Matrix
                                25x25               9x9

                          Start with full kernel size
      Smaller kernel takes centered weights via a transformation matrix

                                           Once-for-All, ICLR’20                           34
Progressive Shrinking

  Full      Elastic         Full     Elastic       Full       Elastic     Full   Elastic
          Resolution               Kernel Size                Depth              Width
Partial                  Partial

                   7x7
                                          5x5
                                                             3x3

                              Transform          Transform
                                Matrix             Matrix
                                25x25               9x9

                          Start with full kernel size
      Smaller kernel takes centered weights via a transformation matrix

                                           Once-for-All, ICLR’20                           35
Progressive Shrinking

  Full      Elastic         Full     Elastic       Full       Elastic     Full   Elastic
          Resolution               Kernel Size                Depth              Width
Partial                  Partial

                   7x7
                                          5x5
                                                             3x3

                              Transform          Transform
                                Matrix             Matrix
                                25x25               9x9

                          Start with full kernel size
      Smaller kernel takes centered weights via a transformation matrix

                                           Once-for-All, ICLR’20                           36
Progressive Shrinking

  Full      Elastic         Full     Elastic       Full       Elastic     Full   Elastic
          Resolution               Kernel Size                Depth              Width
Partial                  Partial

                   7x7
                                          5x5
                                                             3x3

                              Transform          Transform
                                Matrix             Matrix
                                25x25               9x9

                          Start with full kernel size
      Smaller kernel takes centered weights via a transformation matrix

                                           Once-for-All, ICLR’20                           37
Progressive Shrinking

  Full      Elastic       Full     Elastic             Full             Elastic      Full       Elastic
          Resolution             Kernel Size                            Depth                   Width
Partial                Partial                       Partial

                                                                                                     O1
                                     unit i                    unit i               unit i
                                                                             O1
                                                                                                     O2
                                                                             O2
                                                                                                     O3
                             train with full depth      shrink the depth          shrink the depth

                                              Gradually allow later layers in each unit
                                                to be skipped to reduce the depth

                                              Once-for-All, ICLR’20                                       38
Progressive Shrinking

  Full      Elastic       Full     Elastic             Full             Elastic      Full       Elastic
          Resolution             Kernel Size                            Depth                   Width
Partial                Partial                       Partial

                                                                                                     O1
                                     unit i                    unit i               unit i
                                                                             O1
                                                                                                     O2
                                                                             O2
                                                                                                     O3
                             train with full depth      shrink the depth          shrink the depth

                                              Gradually allow later layers in each unit
                                                to be skipped to reduce the depth

                                              Once-for-All, ICLR’20                                       39
Progressive Shrinking

  Full      Elastic       Full     Elastic             Full             Elastic      Full       Elastic
          Resolution             Kernel Size                            Depth                   Width
Partial                Partial                       Partial

                                                                                                     O1
                                     unit i                    unit i               unit i
                                                                             O1
                                                                                                     O2
                                                                             O2
                                                                                                     O3
                             train with full depth      shrink the depth          shrink the depth

                                              Gradually allow later layers in each unit
                                                to be skipped to reduce the depth

                                              Once-for-All, ICLR’20                                       40
Progressive Shrinking

  Full      Elastic       Full     Elastic             Full             Elastic      Full       Elastic
          Resolution             Kernel Size                            Depth                   Width
Partial                Partial                       Partial

                                                                                                     O1
                                     unit i                    unit i               unit i
                                                                             O1
                                                                                                     O2
                                                                             O2
                                                                                                     O3
                             train with full depth      shrink the depth          shrink the depth

                                              Gradually allow later layers in each unit
                                                to be skipped to reduce the depth

                                              Once-for-All, ICLR’20                                       41
Progressive Shrinking

  Full      Elastic       Full     Elastic             Full             Elastic      Full       Elastic
          Resolution             Kernel Size                            Depth                   Width
Partial                Partial                       Partial

                                                                                                     O1
                                     unit i                    unit i               unit i
                                                                             O1
                                                                                                     O2
                                                                             O2
                                                                                                     O3
                             train with full depth      shrink the depth          shrink the depth

                                              Gradually allow later layers in each unit
                                                to be skipped to reduce the depth

                                              Once-for-All, ICLR’20                                       42
Progressive Shrinking

  Full      Elastic       Full     Elastic             Full             Elastic      Full       Elastic
          Resolution             Kernel Size                            Depth                   Width
Partial                Partial                       Partial

                                                                                                     O1
                                     unit i                    unit i               unit i
                                                                             O1
                                                                                                     O2
                                                                             O2
                                                                                                     O3
                             train with full depth      shrink the depth          shrink the depth

                                              Gradually allow later layers in each unit
                                                to be skipped to reduce the depth

                                              Once-for-All, ICLR’20                                       43
Progressive Shrinking

  Full      Elastic       Full      Elastic                   Full           Elastic              Full         Elastic
          Resolution              Kernel Size                                Depth                             Width
Partial                Partial                           Partial                              Partial

                                                              channel                                       channel
                                                            importance                                    importance
                                                              0.02                                          0.82
                                                                                                  channel       reorg.
                                                     channel 0.15 reorg.                                   0.11
                                                                                                   sorting
                                                      sorting 0.85                                         0.46                              O3
                                                                                                    O2
                                                              0.63                                                                            O2
                                                                                                     O1                                       O1
                                                       O1
                             train with full width                       progressively shrink the width            progressively shrink the width
                                                       Gradually shrink the width
                                   Keep the most important channels when shrinking via channel sorting
                                                 Once-for-All, ICLR’20                                                                              44
Progressive Shrinking

  Full      Elastic       Full      Elastic                   Full           Elastic              Full         Elastic
          Resolution              Kernel Size                                Depth                             Width
Partial                Partial                           Partial                              Partial

                                                              channel                                       channel
                                                            importance                                    importance
                                                              0.02                                          0.82
                                                                                                  channel       reorg.
                                                     channel 0.15 reorg.                                   0.11
                                                                                                   sorting
                                                      sorting 0.85                                         0.46                              O3
                                                                                                    O2
                                                              0.63                                                                            O2
                                                                                                     O1                                       O1
                                                       O1
                             train with full width                       progressively shrink the width            progressively shrink the width
                                                       Gradually shrink the width
                                   Keep the most important channels when shrinking via channel sorting
                                                 Once-for-All, ICLR’20                                                                              45
Progressive Shrinking

  Full      Elastic       Full      Elastic                   Full           Elastic              Full         Elastic
          Resolution              Kernel Size                                Depth                             Width
Partial                Partial                           Partial                              Partial

                                                              channel                                       channel
                                                            importance                                    importance
                                                              0.02                                          0.82
                                                                                                  channel       reorg.
                                                     channel 0.15 reorg.                                   0.11
                                                                                                   sorting
                                                      sorting 0.85                                         0.46                              O3
                                                                                                    O2
                                                              0.63                                                                            O2
                                                                                                     O1                                       O1
                                                       O1
                             train with full width                       progressively shrink the width            progressively shrink the width
                                                       Gradually shrink the width
                                   Keep the most important channels when shrinking via channel sorting
                                                 Once-for-All, ICLR’20                                                                              46
Progressive Shrinking

  Full      Elastic       Full      Elastic                   Full           Elastic              Full         Elastic
          Resolution              Kernel Size                                Depth                             Width
Partial                Partial                           Partial                              Partial

                                                              channel                                       channel
                                                            importance                                    importance
                                                              0.02                                          0.82
                                                                                                  channel       reorg.
                                                     channel 0.15 reorg.                                   0.11
                                                                                                   sorting
                                                      sorting 0.85                                         0.46                              O3
                                                                                                    O2
                                                              0.63                                                                            O2
                                                                                                     O1                                       O1
                                                       O1
                             train with full width                       progressively shrink the width            progressively shrink the width
                                                       Gradually shrink the width
                                   Keep the most important channels when shrinking via channel sorting
                                                 Once-for-All, ICLR’20                                                                              47
Progressive Shrinking

  Full      Elastic       Full      Elastic                   Full           Elastic              Full         Elastic
          Resolution              Kernel Size                                Depth                             Width
Partial                Partial                           Partial                              Partial

                                                              channel                                       channel
                                                            importance                                    importance
                                                              0.02                                          0.82
                                                                                                  channel       reorg.
                                                     channel 0.15 reorg.                                   0.11
                                                                                                   sorting
                                                      sorting 0.85                                         0.46                              O3
                                                                                                    O2
                                                              0.63                                                                            O2
                                                                                                     O1                                       O1
                                                       O1
                             train with full width                       progressively shrink the width            progressively shrink the width
                                                       Gradually shrink the width
                                   Keep the most important channels when shrinking via channel sorting
                                                 Once-for-All, ICLR’20                                                                              48
Progressive Shrinking

  Full      Elastic       Full      Elastic                   Full           Elastic              Full         Elastic
          Resolution              Kernel Size                                Depth                             Width
Partial                Partial                           Partial                              Partial

                                                              channel                                       channel
                                                            importance                                    importance
                                                              0.02                                          0.82
                                                                                                  channel       reorg.
                                                     channel 0.15 reorg.                                   0.11
                                                                                                   sorting
                                                      sorting 0.85                                         0.46                              O3
                                                                                                    O2
                                                              0.63                                                                            O2
                                                                                                     O1                                       O1
                                                       O1
                             train with full width                       progressively shrink the width            progressively shrink the width
                                                       Gradually shrink the width
                                   Keep the most important channels when shrinking via channel sorting
                                                 Once-for-All, ICLR’20                                                                              49
progressively shrink the width

                                   Progressive Shrinking
                                       put it
Published as a conference paper at ICLR 2020  together:

                                                                    D  [4, 3, 2]                       W [6, 4, 3]
                                     K    [7, 5, 3]              D [4, 3]                             W [6, 4]
                                                               Sample D at each                      Channel sorting
                                                             Sample D at each                      Channel sorting
                                Sample K at each layer
                                                              unit; sample K                          (Fig. 4)
                                                                 Skip top (4-D)                      Sample E at each
                                   Generate kernel         Keep the first D layers                Sample W at each              Once-
                   Train full
                                   weights (Fig. 3)         at each unit (Fig. 3)                 layer; sample K, D            for-all
                   network                                                                                                     Network
                                                               Fine-tune weights                    Fine-tune weights
                     K=7         Fine-tune weights &
                                                             Fine-tune weights                     Fine-tune weights
Elastic Resolution D = 4        transformation matrix
R [128, 132, …, 224] W = 6       Elastic Kernel Size           Elastic Depth                        Elastic Width
                                     D = 4, W = 6            W = 6, K    [7, 5, 3]            D     [4, 3, 2], K   [7, 5, 3]

Figure 2: Illustration of the progressive shrinking process to support different depth D, width W ,
kernel size K and resolution R. It leads to a large space comprising diverse sub-networks (> 1019 ).

sequence of layers where only the first layer has stride 2 if the feature map size decreases (Sandler
                                                  Once-for-All, ICLR’20                                                                   50
Performances of Sub-networks on ImageNet
                                                        w/o PS                   w/ PS
                         78
ImageNet Top-1 Acc (%)

                         75                                                                                  3.5%
                                                                                                 3.7%
                                                                                     3.4%
                                                             3.4%        3.3%
                         73                      3.5%

                                       2.8%
                         70
                              2.5%

                         67
                                 D=2      D=2        D=2          D=2          D=4           D=4       D=4      D=4
                                 W=3      W=3        W=6          W=6          W=3           W=3       W=6      W=6
                                 K=3      K=7         K=3         K=7          K=3           K=7       K=3      K=7
                                                Sub-networks under various architecture configurations
                                                          D: depth, W: width, K: kernel size

        • Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.

                                                            Once-for-All, ICLR’20                                     51
How about search?

           for OFA training iterations:
training       forward-backward();
decouple
           for devices:
search         for search episodes:
                   sample from OFA; //with evolution
                   if good_model: break;
               direct deploy without training;

                   Once-for-All, ICLR’20               52
How to evaluate if good_model? — by Model Twin
                                                    Accuracy/Latency predictor
                                                          RMSE ~0.2%
       OFA          Acc Dataset
      Network   [Architecture, Accuracy]

                    Accuracy
                 Prediction Model            Predictor-based
                                           Architecture Search   Specialized
                                                                 Sub-Network
                     Latency
                 Prediction Model

                 Latency Dataset
                 [Architecture, Latency]

                Once-for-All, ICLR’20                                            53
2.6x faster than EfficientNet, 1.5x faster than MobileNetV3
                                  OFA                                                                                                 OFA
                                  EfficientNet                                                                                        MobileNetV3
                         81                                                                                               77
                                                                                                                                                                          76.4
                                                80.1                2.6x faster
                         80                                                                                               75                          74.9
                                          79.8
Top-1 ImageNet Acc (%)

                                                                                                 Top-1 ImageNet Acc (%)
                                                                                                                                                                               75.2
                                                                                       79.8
                                                                                                                                        73.3
                         79                                                                                               73                                  73.3
                                   78.7                                                                                                        1.5x faster
                                                                              78.8                                             71.4

                         78                                                                                               71
                                                          . 8 % h i g h e r                                                                    70.4
                                                        3
                                                           accuracy                                                                 4% higher
                         77                                                                                               69        accuracy

                                                        76.3                                                                        67.4
                         76                                                                                               67
                              0   50      100     150      200       250      300    350   400                                 18     24       30     36     42      48   54      60
                                          Google Pixel1 Latency (ms)                                                                       Google Pixel1 Latency (ms)

                  • Training from scratch cannot achieve the same level of accuracy
                                                                                 Once-for-All, ICLR’20                                                                                 54
More accurate than training from scratch
                                  OFA                                                                                                 OFA
                                  EfficientNet                                                                                        MobileNetV3
                                  OFA - Train from scratch                                                                            OFA - Train from scatch
                         81                                                                                               77
                                                                                                                                                                          76.4
                                                80.1                2.6x faster
                         80                                                                                               75                          74.9
                                          79.8
Top-1 ImageNet Acc (%)

                                                                                                 Top-1 ImageNet Acc (%)
                                                                                                                                                                               75.2
                                                                                       79.8
                                                                                                                                        73.3
                         79                                                                                               73                                  73.3
                                   78.7                                                                                                        1.5x faster
                                                                              78.8                                             71.4

                         78                                                                                               71
                                                          . 8 % h i g h e r                                                                    70.4
                                                        3
                                                           accuracy                                                                 4% higher
                         77                                                                                               69        accuracy

                                                        76.3                                                                        67.4
                         76                                                                                               67
                              0   50      100     150      200       250      300    350   400                                 18     24       30     36     42      48   54      60
                                          Google Pixel1 Latency (ms)                                                                       Google Pixel1 Latency (ms)

                  • Training from scratch cannot achieve the same level of accuracy
                                                                                 Once-for-All, ICLR’20                                                                                 55
OFA: 80% Top-1 Accuracy on ImageNet
                                   81
                                                                                  14x less computation
   595M MACs                                                                                                                                         Xception
                                                Once-for-All (ours)                                         InceptionV3
   80.0% Top-1
                                   79           EfficientNet                          ResNetXt-50
                                                                      NASNet-A
                                                                                                                          DPN-92
     ImageNet Top-1 accuracy (%)

                                      MBNetV3                            DenseNet-169                                                            ResNetXt-101
                                   77
                                             ProxylessNAS                                                   DenseNet-264
                                              AmoebaNet        DenseNet-121
                                   75
                                               MBNetV2                                                                                  ResNet-101
                                              PNASNet                             ResNet-50
                                              ShuffleNet   InceptionV2
                                   73         DARTS                       2M               4M          8M          16M             32M         64M
                                            IGCV3-D
                                                                              Model Size                                            The higher the better

                                                                                                                                                →
                                   71

                                   69
                                        0
                                              MobileNetV1 (MBNetV1)

                                                1              2          3                4
                                                                                                   Handcrafted

                                                                                                       5
                                                                                                                  AutoML

                                                                                                                     6
                                                                                                                          The lower the better
                                                                                                                                    7
                                                                                                                                               →8           9
                                                                                           MACs (Billion)

• Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under
  the mobile vision setting (< 600M MACs).
                                                                           Once-for-All, ICLR’20                                                                56
OFA Enables Fast Specialization on Diverse Hardware Platforms
                                                                                      OFA                 MobileNetV3                 MobileNetV2
                             77                                                       77                                                       77
                                                                   76.3                                                                                                                    76.4
                                                                                                                                 75.8

    Top-1 ImageNet Acc (%)
                                                     74.7                                                            74.7                                                 74.7
                             75                                                       75                                                       75
                                                                          75.2                                                         75.2                                                  75.2
                                          73.1                                                        73.4                                                     73.0
                             73                             73.3                      73                                  73.3                 73                                73.3
                                                                                            71.5
                                   70.5                                                                                                              71.1
                             71                                                       71                                                       71
                                              70.4                                                         70.4                                                   70.4
                             69                                                       69                                                       69
                                       67.4                                                    67.4                                                        67.4
                             67                                                       67                                                       67
                                  25      40    55    70     85     100                    23 28 33 38 43 48 53 58 63 68                            7       10        13    16   19    22         25
                                       Samsung S7 Edge Latency (ms)                             Google Pixel2 Latency (ms)                                            LG G8 Latency (ms)
                             77                                             76.4      77                                                       77
                                                            75.3                                                            75.7
                                           73.8                                                                    74.6                                                                      73.7
    Top-1 ImageNet Acc (%)

                             73 72.6                                                  73                                                       73                                72.8
                                                                                                    72.0
                                                                                            71.1
                                                                            72.0                                                      72.0                     69.6                           71.5
                             69                                    69.8               69                                     69.8              69
                                                                                                                                                        67.0                        69.0

                             66                   65.4                                66              65.4                                     66
                                                                                                                                                                  63.3
                             62                                                       62                                                       62
                                          60.3                                                     60.3
                             58                                                       58                                                                   59.1
                                                                                                                                               58
                                  10       14        18       22     26          30        9         11       13      15         17       19        3.0        4.0    5.0     6.0    7.0 8.0
                                          NVIDIA 1080Ti Latency (ms)                               Intel Xeon CPU Latency (ms)                            Xilinx ZU3EG FPGA Latency (ms)
                                               Batch Size = 64                                             Batch Size = 1                                     Batch Size = 1 (Quantized)

                                                                                            Once-for-All, ICLR’20                                                                                      57
We need Green AI
Solve the Environmental Problem of NAS

                              Evolved Transformer
How to save CO2 emission

1. Once for all: Amortize the search cost    2. Lite-transformer: Human-in-the-loop
    across many sub-networks and            design. Apply human insights of HW&ML,
         deployment scenarios                       rather than “just search it”

             Once-for-All, ICLR’20              Lite Transformer, ICLR’20
OFA has broad applications

•   Efficient Transformer

•   Efficient Video Recognition

•   Efficient 3D Vision

•   Efficient GAN Compression
OFA’s Application: Hardware-Aware Transformer
         American Life                        36156

         US car including                    126000
         fuel
         Evolved                      626155
                 Efficient NLP on mobile devices
         Transformer
                enable real time conversation
         HAT (Ours)
                between speakers using6000
                                        different
                languages

     “Nice to meet you”                     “Encantada de conocerte”
                                            “만나서 반갑습니다”
                                    ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
                                            “              ”
                                            “Freut mich, dich kennenzulernen”

00   Human Life                                                                                                               750
     (Avg. 1 year)
                         11,023                                                                  BLEU Model Size Reduction
01   American Life                                                                                                            751
     (Avg. 1 year)
                          36,156                                           Transformer Float32   41.2   705MB         –
02                                                                         HAT Float32           41.8   227MB        3⇥       752
     US Car w/ Fuel
                                   126,000
03   (Avg. 1 lifetime)                                                     HAT 8 bits            41.9   57MB        12⇥       753
04   Evolved                                                  626,155      HAT 4 bits            41.1   28MB        25⇥       754
     Transformer
05                                           12041×                                                                           755
     HAT (Ours)          52
                                                                    Table 5: K-means quantization of HAT models on
06              0        175K      350K          525K      700K                                                               756
                                                                   WMT’14 En-Fr. 4-bit quantization reduces model size
07                            CO2 Emission (lbs)                                                                              757
                                                                    by 25⇥ with only 0.1 BLEU loss than transformer base-
08                                                                                                                            758
     Figure 9: The design cost  3.7x   smaller
                                   measured       in model    size, line.
                                                      pounds of      same  8-bit quantization even
                                                                              performance           increases BLEU
                                                                                                on WMT’14      En-De;by 0.1
09   CO2 emission. Our 3x,framework   for searching
                             1.6x, 1.5x       faster on HAT  re-
                                                           Raspberrythan its
                                                                          Pi,float
                                                                               CPU,version.
                                                                                        GPU than Transformer Baseline         759
10   duces the search cost by four orders of magnitude     than
                                                 12,000x less CO        2 than   evolved     transformer                      760
     Evolved Transformer (So et al., 2019).
                                                                    settings  for  NLP  tasks and proposed  a multi-branch
11                                                                                                                            761
                                                                        mobile Transformer. However, it relied on FLOPs
                                                                         HAT, ACL’20
OFA’s Application: Efficient Video Recognition
                                                        75
                                                             OFA + TSM (large)                      ResNet50 + TSM

                          Kinetics Top-1 Accuracy (%)
                                                                                  Same Acc.
                                                        74
                                                                              7x less computation

                                                        73                                          ResNet50 + I3D
                                                                 OFA + TSM (small)
                                                        72
                                                                  Same Comp.
                                                        71         +3.0% Acc.

                                                        70
                                                                 MobileNetV2 + TSM
                                                        69
                                                             0           10             20            30             40
                                                                              Computation (GFLOPs)

     7x less computation, same performance as TSM+ResNet50
   same computation, 3% higher accuracy than TSM+MobileNet-v2

  TSM, ICCV’19
OFA’s Application: Efficient 3D Recognition

AR/VR: a whole backpack
of computer

                                                                      Accuracy v.s. Latency Tradeoff
                          self-driving: a whole trunk of GPU

                             4x FLOPs reduction and 2x speedup over MinkowskiNet
                             3.6% better accuracy under the same computation budget.
                                        followup of PVCNN, NeurIPS’19 (spotlight)
OFA’s Application: GAN Compression

                             8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN
                              1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU

         GAN Compression, CVPR’20                                           64
Summary: Once-for-All Network
• We introduce once-for-all network for efficient inference on diverse hardware platforms.
• We present an effective progressive shrinking approach for training once-for-all networks.

                                      Progressive Shrinking
                                                              Fine-tune
          Train the            Shrink the model                                         once-for-all
                                                            both large and
         full model            In 4 dimensions                                           network
                                                            small sub-nets

• Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,
  setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
    • First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
    • First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection.

• Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).
        net, image_size = ofa_specialized(net_id, pretrained=True)

• Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
        ofa_network = ofa_net(net_id, pretrained=True)

                              Project Page: https://ofa.mit.edu
References
    Model Compression & NAS
-   Once-For-All: Train One Network and Specialize It for Efficient Deployment, ICLR’20
-   ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware, ICLR’19
-   APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, CVPR’20
-   HAQ: Hardware-Aware Automated Quantization with Mixed Precision, CVPR’19
-   Defensive Quantization: When Efficiency Meets Robustness, ICLR’19
-   AMC: AutoML for Model Compression and Acceleration on Mobile Devices, ECCV’18

  Efficient Vision:
- GAN Compression: Learning Efficient Architectures for Conditional GANs, CVPR’20
- TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
- PVCNN: Point Voxel CNN for Efficient 3D Deep Learning, NeurIPS’19

  Efficient NLP:
- Lite Transformer with Long Short Term Attention, ICLR’20
- HAT: Hardware-aware Transformer, ACL’20

  Hardware & EDA:
- SpArch: Efficient Architecture for Sparse Matrix Multiplication, HPCA’20
- Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning, DAC’20

                                                                                                 66
Make AI Efficient:
 Tiny Computational Resources
 Tiny Human Resources

Media Coverage:

Website: songhan.mit.edu    github.com/mit-han-lab   youtube.com/c/MITHANLab
Diverse Hardware Platforms, 50+ Pretriained Models are Released

                        Once-for-All, ICLR’20                     68
OFA for FPGA Accelerators

                                                                  MobileNetV2    MnasNet                         OFA (Ours)
                                                  50.0                                                 80.0

                Arithmetic Intensity (OPS/Byte)
                                                          40%

                                                                                 ZU3EG FPGA (GOPS/s)
                                                         higher                                              57%
                                                  37.5                                                 60.0
                                                                                                            higher

                                                  25.0                                                 40.0

                                                  12.5                                                 20.0

                                                   0.0                                                  0.0

                                                    Measured results on                                                FPGA
• Non-specialized neural networks do not fully utilize the hardware resource. There is a large room for
  improvement via neural network specialization.

                                                                        Once-for-All, ICLR’20
Copyright Notice
This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the
opinion of the author(s) and their respective companies. The inclusion of presentations in this
publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of
the authors and their respective companies and may contain copyrighted material. As such, it is strongly
encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions
regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

                                   www.tinyML.org
Copyright Notice
This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the
opinion of the author(s) and their respective companies. The inclusion of presentations in this
publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of
the authors and their respective companies and may contain copyrighted material. As such, it is strongly
encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions
regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

                                   www.tinyML.org
You can also read