Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit? - Sioni Summers - CERN Indico

Page created by Ron Wood

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit? - Sioni Summers - CERN Indico

Binary Ensemble Neural Network: More Bits per
      Network or More Networks per Bit?

                mPP Journal Club
                 August 6 2020

                Sioni Summers

Introduction & Context

●
      We recently published a paper on Binary & Ternary Neural Networks (BNNs,
      TNNs) in hls4ml [1]
●
      BNNs & TNNs can be very efficiently computed → multiplication becomes
      boolean ‘xnor’ & accumulation involves summing ‘1s’
       –   Multipliers are the critical FPGA resource for NNs with more than a few
           bits weights & activations
●
      However, with the same architecture (numbers of neurons, layers) as a
      float-precision reference, performance can drop a lot
       –   Common approach → increase network size
       –   Balance between accepted performance loss vs. compute efficiency
●
      This paper [2] introduces ‘BENN’ : Binary Ensemble Neural Networks
       –   Instead of increasing network size, use several smaller ones with
           ensemble methods

●
      [1] https://iopscience.iop.org/article/10.1088/2632-2153/aba042/meta
●
      [2] https://ieeexplore.ieee.org/document/8954129

    6/8/2020                      mPP Journal Club – Sioni Summers

Introduction & Context

●
      “In this paper, we investigate BNNs systematically in terms of representation
      power, speed, bias, variance, stability, and their robustness. We find that
      BNNs suffer from severe intrinsic instability and non-robustness
      regardless of network parameter values. What implied by this observation is
      that the performance degradation of BNNs are not likely to be resolved by
      solely improving the optimization techniques; instead, it is mandatory to
      cure the BNN function, particularly to reduce the prediction variance and
      improve its robustness to noise”

    6/8/2020                     mPP Journal Club – Sioni Summers

Ensemble Methods

●
      Combine multiple weak learners into a strong one
       –   Bagging & Boosting most common methods
●
      Paraphrasing the paper: ensemble techniques not normally useful for NNs, NNs
      are not ‘weak classifiers’, but BNNs are :D
●
      Bagging: (Bootstrap aggregating)
       –   Train different learners (BNNs) independently
       –   Boostrapping → randomly draw samples from the dataset for each learner
       –   Train each learner on its sample, aggregate predictions (hard-voting, soft-
           voting)

                 https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

    6/8/2020                              mPP Journal Club – Sioni Summers

Ensemble Methods

●
      Boosting is a sequential fitting of the weak learners
       –   Samples which are badly predicted in a round are given more weight in
           the next → focus on the difficult examples
●
      Adaboost (adaptive boosting), as used in the BENN paper, aggregates the
      weak learners (BNNs)
●
      Using NN, you need to choose random init of each learner, or ‘warm start’
      from previous round (paper tries both)

               https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

    6/8/2020                                 mPP Journal Club – Sioni Summers

Training BNNs, TNNs

●
      Binary Neural Networks use 1-bit for the weights: {+1, -1}. TNNs
      additionally allow ‘0’
●
      The {+1, -1} are encoded as {+1, 0} → use 1 bit, ‘*’ = ‘xnor’
●
      We use the ‘Straight Through Estimator’ (STE) to train:
       –   Weights are actually continuous floating point values
       –   Clip & round during forward pass
       –   Compute gradient with respect to floating point weight, update FP
           weight
●
      Activations can also be 1- or 2- bits, but we see better performance with
      ReLU using a few (>~4 bits)
●
      In our paper, used a Bayesian Optimization of hyperparameters to find best
      model size → 7x more neurons per hidden layer (7² more ‘*’)
●
      (The paper only uses BNNs, but the technique should apply to TNNs too)

    6/8/2020                      mPP Journal Club – Sioni Summers

Training BNNs, TNNs

●
      From our paper
●
      Experiments with MNIST

●
      Experiments with jets high level features

    6/8/2020                    mPP Journal Club – Sioni Summers

BNN Training problems

●
      Non-robustness (overfitting)
●
      BNNs have huge variation in loss, accuracy while training
●
      (I think) this has a lot to do with 1) the extreme quantization, 2) the STE
●
      In literature ‘Gradient mismatch’
●
      (Stochastic rounding as in QKeras is supposed to help with this)

    6/8/2020                     mPP Journal Club – Sioni Summers

BNN Training problems

●
      Non-robustness (overfitting)
●
      BNNs have huge variation in loss, accuracy while training
●
      (I think) this has a lot to do with 1) the extreme quantization, 2) the STE
●
      In literature ‘Gradient mismatch’
●
      (Stochastic rounding as in QKeras is supposed to help with this)

    6/8/2020                     mPP Journal Club – Sioni Summers

Experiments

●
      Datasets: CIFAR-10, Imagenet
●
      Many different flavours of BNN tested
●
      Generally keeping some layers (in particular 1st and last) has been seen to
      help a lot, while keeping most of the compute savings

●
      In the hls4ml BNN+TNN paper, we used something like ‘AB’ (= “BNN”) and
      ‘AQB’ (= “Hybrid BNN”)

    6/8/2020                    mPP Journal Club – Sioni Summers

Results

      ‘-Indep’ = independent trainings, vs. ‘-Seq’ = warm start
●
      ‘BENN-SB-5’ = 5 learners of type SB, x-axis shows boosting method

●
      ‘SB’ better than ‘AB’ (no surprise)
●
       BENN-SB-5, BENN-SB-32 can do as well as full float model

    6/8/2020                     mPP Journal Club – Sioni Summers

Results - ImageNet

●
      On ImageNet dataset,
      AlexNet model, the
      ‘BENN-SB-6, Boosting’
      model does nearly as
      well as the Full-Precision
●
      On ImageNet, with
      ResNet-18 the gap
      between Full-Precision
      and BENN is wider

    6/8/2020                       mPP Journal Club – Sioni Summers

BENN is hardware friendly

●
      “BENN is hardware friendly: Using BENN with K ensembles is
      better than using one K-bit classifier”
●
      I don’t think they can really conclude this from their study
       –   The only ‘K-bit’ classifier is the ‘full-precision’, which uses float and is
           not really a fair comparison
       –   They didn’t include results for low-bitwidth QNN to compare with e.g.
           the 3, 5, 6 ensemble BENNs
       –   Would be interesting to compare to something like QKeras ~few bit
           models
●
      That said, an ensemble of N neural networks involves fewer operations
      (MACCs) than one NN with neurons increased N-times
●
      So the method could be very useful for cases which must use BNNs, but
      model size/architecture is free

    6/8/2020                         mPP Journal Club – Sioni Summers

Paper Conclusion

●
    “In this paper, we proposed BENN, a novel neural network
    architecture which marries BNN with ensemble methods. The
    experiments showed a large performance gain in terms of
    accuracy, robustness, and stability. Our experiments also reveal
    some insights about trade-offs on bit width, network size,
    number of ensembles, etc. We believe that by leveraging
    specialized hardware such as FPGA and more advanced modern
    ensemble techniques with less overfitting, BENN can be a new
    dawn for deploying large deep neural networks into mobile and
    embedded systems.”

6/8/2020                  mPP Journal Club – Sioni Summers

My Conclusion

●
      It’s a cool idea to increase performance and robustness of BNNs, which are
      known to be difficult to train, and lossy compared to full-precision
●
      For me, it’s still not clear in the 1 to ~8 range whether a k-ensemble BENN
      or a k-bit QNN is better
●
      But, in a case which must use BNNs, certainly seems BENN can ‘boost’
      performance ;)

●
      I tried to run my own tests with QKeras and scikit-learn boosting methods
       –   TF Keras has ‘KerasClassifier’ wrapper for scikit-learn
       –   It doesn’t suport ‘sample_weight’, so I had to add another wrapper of
           my own…
       –   sklearn BaggingClassifier worked, performance slightly worse than QNN,
           4 bits
       –   sklearn AdaBoost didn’t train → random performance, not sure why
       –   Need to fiddle with parameters and try again!
       –   Here’s what I tried
           https://gist.github.com/thesps/b0c3d1636d5f3d7d8c35391e0155d592
    6/8/2020                      mPP Journal Club – Sioni Summers

You can also read