OBDP 2021 PRESENTATION DR PABLO GHIGLINO - A Low Power And High Performance Artificial Intelligence Approach To Increase Guidance Navigation And ...

Page created by Crystal Torres
 
CONTINUE READING
OBDP 2021 PRESENTATION DR PABLO GHIGLINO - A Low Power And High Performance Artificial Intelligence Approach To Increase Guidance Navigation And ...
OBDP 2021 PRESENTATION
DR PABLO GHIGLINO
A Low Power And High Performance Arti cial Intelligence
Approach To Increase Guidance Navigation And Control
Robustness

                                          pablo.ghiglino@klepsydra.com
                                                    www.klepsydra.com
                           fi
OBDP 2021 PRESENTATION DR PABLO GHIGLINO - A Low Power And High Performance Artificial Intelligence Approach To Increase Guidance Navigation And ...
KLEPSYDRA AI OVERVIEW

            Trained Model

                                          Advanced features

Language                    Performance
 bindings                     analysis

                                          Basic features

            Klepsydra AI

Images       Timeseries     Sensor Data
OBDP 2021 PRESENTATION DR PABLO GHIGLINO - A Low Power And High Performance Artificial Intelligence Approach To Increase Guidance Navigation And ...
KLEPSYDRA AI DISTRIBUTION

     1. Developer station distribution:
             •   Libraries + header les
             •   Development tools (performance analysis, etc.)
             •   Unlimited seats.

     2. Target computer distribution:
             •   Libraries + header les
             •   License le per seat.
fi
        fi
        fi
SUPPORT MATRIX

Architecture    Supported Processors

ARM             Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz
                Samsung Exynos5 Octa ARM Cortex™-A15 Quad 2Ghz and Cortex™-A7
                Quad 1.3GHz CPUs

                64-bit Arm Neoverse core
                6-core Carmel ARM v8.2 64-bit CPU, 8MB L2 + 4MB L3

x86             Intel Xeon E5-2686 v4
                2.4GHz 8-core Intel Core i9
                2.9GHz quad-core Intel Core i7

OS Type        Supported OS

Linux          Ubuntu 18.04, 20.04, Docker, Embedded Linux, FreeRTOS, PykeOS
Core API

class DeepNeuralNetwork {
public:

     /**
      * @brief setCallback
      * @param callback. Callback function for the prediction result.
      */
     virtual void setCallback(std::function callback) = 0;

     /**
      * @brief predict. Load input matrix as input to network.
      * @param inputVector. An F32AlignedVector of floats containing network input.
      *
      * @return Unique id correspoding to the input vector
      */
     virtual unsigned long predict(const kpsr::ai::F32AlignedVector& inputVector) = 0;

     /**
      * @brief predict. Copy-less version of predict.
      * @param inputVector. An F32AlignedVector of floats containing network input.
      *
      * @return Unique id correspoding to the input vector
      */
     virtual unsigned long predict(const std::shared_ptr & inputVector) = 0;

};

                                                                                                                         5
ONNX API
class KPSR_API OnnxDNNImporter
{
public:

     /**
      * @brief import an onnx file and uses a default eventloop factory for all processor cores
      * @param onnxFileName
      * @param testDNN
      * @return a share pointer to a DeepNeuralNetwork object
      *
      * When log level is debug, dumps the YAML configuration of the default factory.
      * It makes use of all processor cores.
      */
     static std::shared_ptr createDNNFactory(const std::string & onnxFileName,
                                                                                 bool testDNN = false);

     /**
      * @brief importForTest an onnx file and uses a default synchronous factory
      * @param onnxFileName
      * @param envFileName. Klepsydra AI configuration environment file.
      * @return a share pointer to a DeepNeuralNetwork object
      *
      * This method is intented to be used for testing purposes only.
      *
      */
     static std::shared_ptr createDNNFactory(const std::string & onnxFileName,
                                                                                 const std::string & envFileName);
};
                                                                                                                     6
APPROACHES TO CONCURRENT
                      ALGORITHMIC EXECUTION

Parallelisation                         Pipeline
BENCHMARK DESCRIPTION

Description
•   Given an input matrix, a number of sequential multiplications will be
    performed:
       •   Step 1: A => B = A x A => Step 2 : C = B x B…
       •   Matrix A randomly generated on each new sequence
Parameters:
•   Matrix dimensions: 100x100
•   Data type: Float, integer
•   Number of multiplications per matrix: [10, 60]
•   Processing frequency: [2Hz - 100Hz]
Technical Spec
•   Computer: Odroid XU4
•   OS: Ubuntu 18.04
FLOAT PERFORMANCE RESULTS

                 CPU Usage. 30 Steps                                     Throughput. 30 Steps                                         Latency. 30 Steps
80.0                                                  20.00                                                  180.00

60.0                                                  15.00                                                  135.00

40.0                                                  10.00                                                   90.00

20.0                                                   5.00                                                   45.00

 0.0                                                   0.00                                                    0.00
       2.00    6.50       11.00       15.50   20.00           2.00    6.50       11.00       15.50   20.00            2.00     6.50         11.00     15.50   20.00
                   Publishing Rate (Hz)                                   Publishing Rate (Hz)                                     Publishing Rate (Hz)
              OpenMp             Klepsydra                           OpenMp             Klepsydra                            OpenMp              Klepsydra

                 CPU Usage. 40 Steps                                     Throughput. 40 Steps                                         Latency. 40 Steps
70.0                                                  14.00                                                  240.00

52.5                                                  10.50                                                  180.00

35.0                                                   7.00                                                  120.00

17.5                                                   3.50                                                   60.00

 0.0                                                   0.00                                                    0.00
       2.00    5.00        8.00       11.00   14.00           2.00    5.00        8.00       11.00   14.00            2.00     5.00         8.00      11.00   14.00
                   Publishing Rate (Hz)                                   Publishing Rate (Hz)                                     Publishing Rate (Hz)
              OpenMp             Klepsydra                           OpenMp             Klepsydra                            OpenMp              Klepsydra
KLEPSYDRA AI DATA PROCESSING
                                                                APPROACH

                                             Klepsydra AI threading model

Input                                                                                                  Output
                                     Layer                                   Layer                      Data
Data

    {
    {
                         Thread 1                                 Thread 2           Most layers can
                                                                                     be parallelised
                                                                                     and are
        Eventloops are      Threading model consists of:                             vectorised.
        assigned to
        cores               -   Number of cores assigned to event loops

                            -   Number of event loops per core

                            -   Number of parallelisation threads for each layer
Performance tuning
                                                           Performance parameters
Performance Criteria
                                                           • number_of_cores
•    CPU usage
•    RAM usage                                             Number of cores where event loops will be distributed (by
                                                           default one event loop per core). High throughput requires
•    Throughput (output data rate)
                                                           more cores, i.e., more CPU usage, low throughput requires
•    Latency                                               low number of cores, therefore substantial reduction in
                                                           CPU usage.

Performance parameters:                                    Performance parameters

• pool_size                                                • number_of_parallel_threads
Size of the internal queues of the event loop publish/     Number of threads assigned to parallelise layers. For low
subscribe pairs.                                           latency requirements, assign large numbers (maximum =
High throughput requires large numbers, i.e., more RAM     number of cores), i.e., increase CPU usage. For no latency
usage, low throughout requires smaller number, therefore   requirements, use low numbers (minimum = 1), therefore
less RAM.                                                  substantial reduction in CPU usage.
                                                                                                                        11
Example of performance benchmarks

        Latency: 56ms

                        Latency: 35ms

           TensorFlow   Klepsydra AI

                                        12
ROADMAP

Q2 2021                                Q1 2022
•   No third party dependencies.       •   NVIDIA Jetson TX2 Support (alpha
       •   Binaries are C/C++ only         release)
       •   Custom format for models    •   Quantisation support
Q3 2021
•   FreeRTOS support (alpha version)   Q2 2022
       •   Xilinx Ultrascale+ board    •   Graphs support
       •   Microchip SAM V71           •   Memory allocation new model
Q4 2021                                •   C support
•   PykeOS support (alpha version)
       •   Xilinx Zedboard             Legend:
                                       Hard deadlines
                                       Flexible dates
CONTACT INFORMATION

Dr Pablo Ghiglino
pablo.ghiglino@klepsydra.com
+41786931544
www.klepsydra.com
linkedin.com/company/klepsydra-technologies
You can also read