Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019

Page created by Amy Sullivan
 
CONTINUE READING
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
Vector Engine and AI
The NEC SX-Aurora TSUBASA

Dr. Erich Focht, NEC Deutschland GmbH
April 2019

1      © NEC Corporation 2018
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
NEC Vector Supercomputers: High Sustained Performance
                                                  First
                              8.23 Bytes/FLOP                Tadashi Watanabe
                                                >1GFLOPS
                                                             Multi-lane pipelines
                                                             Vector Caches

                                                CMOS
                              8 Bytes/FLOP
                                              air cooled

                                               Single chip
                              4 Bytes/FLOP                    Earth Simulator
                                             vector processor

                                                  ADB
                              2 Bytes/FLOP
                                              vector cache
                                                Multi-core        Best HPCG
                              1 Byte/FLOP
                                                vector SoC     efficiency! 10%
2    © NEC Corporation 2018
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
ML / AI

Frovedis / Apache Spark
Torch + PyTorch
Tensorflow
Network Optimizer
Storage
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
Frovedis: FRamework Of VEctorized and DIStributed data analytics
       ▌C++ framework similar to Spark
         Supports Spark/Python interface
       ▌MPI is used for high performance communication
       ▌Optimized for SX-Aurora TSUBASA (also works on x86)
                             Open Source!
                          github.com/frovedis

                               Spark / Python Interface

          Matrix Library          Machine Learning        DataFrame

                                    Frovedis Core
9     © NEC Corporation 2019
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
Vector Engine and AI The NEC SX-Aurora TSUBASA - Dr. Erich Focht, NEC Deutschland GmbH April 2019
Machine Learning Library
 Implemented with Frovedis Core and Matrix Library
      Supports both dense and sparse data
      Sparse data support is important in large scale machine learning

 ▌Supported algorithms:                                       ▌ Under development:
      Linear model                 Word2vec                  Frequent Pattern Mining
       • Logistic Regression        Factorization             Spectral Clustering
       • Multinominal Logistic       Machines                  Hierarchical Clustering
         Regression                 Decision Tree             Latent Dirichlet Allocation
       • Linear Regression                                     Deep Learning (MLP, CNN)
                                    Naïve Bayes
       • Linear SVM
                                    Graph algorithms          Random Forest
      ALS
                                                               Gradient Boosting Decision
      K-means                       • Shortest Path,
                                                                Tree
                                       PageRank,
      Preprocessing                   Connected Components
       • SVD, PCA                                             ▌ We will support more!
15         © NEC Corporation 2019
DataFrame
                                                                       A   B   C       D

 ▌Supports similar interface as Spark DataFrame
     Select, Filter, Sort, Join, Group by/Aggregate
     (SQL interface is not supported yet)

                                                                   A       B       C       D
 ▌Implemented as distributed column store
                                                         rank #0
     Each column is represented as distributed vector
                                                         rank #1
     Each operation only scans argument columns:
      other columns are created when necessary           rank #2
      (late materialization)

     Reduces size of data to access
16        © NEC Corporation 2019
Spark / Python Interface

     ▌Writing C++ programs is sometimes tedious, so we created a wrapp
      er interface to Spark
      Call the framework through the same Spark API
      Users do not have to be aware of vector hardware

     ▌Implementation: created a server with the functionalities
      Receives RPC request from Spark and executes ML algorithm, etc.
      Only pre-built algorithms can be used from Spark

     ▌Other languages can also be supported by this architecture
      Currently Python is supported (scikit-learn API)
17        © NEC Corporation 2019
Performance Evaluation: Machine Learning
▌Xeon (Gold 6126) 1 socket vs
 1 VE, with sparse data (w/o I/O)

                                                  Speed Up (Spark = 1)
     LR uses CTR data provided by                                       120            113.2
      Criteo (1/4 of the original, 6GB)
                                                                         100
     K-means and SVD used Wikipedia
                                                                          80
      doc-term matrix (10GB)                                                                                             56.8
                                                                          60
     Spark version: 2.2.1                                                                              42.8
                                                                          40
                                                                          20          10.6            8.8
Workloads                                                                         1               1                 1 5.3
                                                                           0
▌ Web ads optimization (Logistic regression)                                           LR         K-means             SVD
▌ Document clustering (K-means)
                                                                               Spark/x86     Frovedis/x86      Frovedis/VE
▌ Recommendation (Singular value decomposition)

18        © NEC Corporation 2019
Performance Evaluation: DataFrame

▌Evaluated with TPC-H SF-20                                              50                                47.3

                                                  Speed Up (Spark = 1)
     Q1: group by/aggregate                                             45
                                                                         40                                            34.8
     Q3: filter, join, group by/aggregate                               35                    33.8
                                                                         30
     Q5: filter, join, group by/aggregate (lar                          25
      ger join)                                                          20
                                                                         15       10.1                              10.6
     Q6: filter, group by/aggregate                                     10                   8.8
                                                                                                          5.8
                                                                          5    1 3.2      1           1            1
                                                                          0
                                                                                Q01        Q03           Q05        Q06

                                                                              Spark/x86   Frovedis/x86      Frovedis/VE

19          © NEC Corporation 2019
Tensorflow for Aurora (Beta)
▌Hand-optimized Vector Engine DNN Library: veDNN
▌Built with LLVM-VE
     –   Supports scalar code + vector intrinsics
     –   RangeVectorizer (RV) in guided mode (AKA “needs directives”)
     –   https://sx-aurora.github.io/posts/Testing-LLVM-VE-RV-update/
         for(int64_t
          for(int64_t i=0;
                       i=0; i
Why is this so difficult to optimize?
What data scientists see:
x = Conv(x, kernel=1x1, bias=True)
x = ReLU(x)
x = AvgPooling(x, kernel=13x13)

What HPC people see:
function(Conv):
     for(Batch, OutChannel, Y, X):
        for(InChannel, KernelY, KernelX):
              output[…] += input[…] * weight[…]
        output[…] += bias[…]

function(ReLU):
     for(Batch, OutChannel, Y, X):
        output[…] = max(0, input[…])

function(AvgPooling):
     for(Batch, OutChannel, Y, X):
        for(KernelY, KernelX):
              output[…] += input[…] / (13*13)

31        © NEC Corporation 2019
Why is this so difficult to optimize?
What we actually want:

function(FusedNetwork):
     for(Batch, OutChannel):
       float N[…]
       for(Y, X):
           for(InChannel, KernelY, KernelX):
                 N[…] += input[…] * weight[…]
           N[…] += bias[…]
           N[…] = max(0, X)
       for(Y, X):
           for(KernelY, KernelX):
                 output[…] += N[…] / (13*13)

32        © NEC Corporation 2019
Inference (128x Batched) Sol vs PyTorch v1.0.1

                      1000

                      900

                      800
Execution Time (ms)

                      700

                      600

                      500

                      400

                      300

                      200

                      100

                        0

                                                      PyTorch 1.0.1   Sol
      34                     © NEC Corporation 2019
You can also read