Machine Learning at Facebook: Understanding Inference at the Edge - Facebook Research

Page created by Wallace Ball
 
CONTINUE READING
Machine Learning at Facebook: Understanding Inference at the Edge - Facebook Research
Machine Learning at Facebook:
                       Understanding Inference at the Edge

          Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan,
       Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao,
            Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang,
              Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo∗, Peizhao Zhang

                                                      Facebook, Inc.

ABSTRACT                                                                                    120
                                                                                                    Avg. GFLOPS

                                                               Peak multi-core CPU GFLOPS
At Facebook, machine learning provides a wide range of                                      100
capabilities that drive many aspects of user experience
                                                                                            80
including ranking posts, content understanding, object
detection and tracking for augmented and virtual real-                                      60
ity, speech and text translations. While machine learn-
                                                                                            40
ing models are currently trained on customized data-
center infrastructure, Facebook is working to bring ma-                                     20
chine learning inference to the edge. By doing so, user
                                                                                             0
experience is improved with reduced latency (inference                                            2013            2014   2015   2016
time) and becomes less dependent on network connec-
tivity. Furthermore, this also enables many more appli-           Figure 1: The distribution of peak performance
cations of deep learning with important features only             of smartphone SoCs running Facebook mobile
made available at the edge. This paper takes a data-              app exhibit a wide spread. The data samples
driven approach to present the opportunities and de-              represents over 85% of the entire market share
sign challenges faced by Facebook in order to enable              and are sorted by the corresponding SoC release
machine learning inference locally on smartphones and             year. Peak performance can vary by over an
other edge platforms.                                             order of magnitude, increasing the design chal-
                                                                  lenge of performance optimization.
1.   INTRODUCTION
   Machine Learning (ML) is used by most Facebook                 time machine learning at the image capture time. En-
services. Ranking posts for News Feed, content under-             abling edge inference requires overcoming many unique
standing, object detection and tracking for augmented             technical challenges stemming from the diversity of mo-
and virtual reality (VR) platforms, speech recognition,           bile hardware and software not found in the controlled
and translations all use ML. These services run both in           datacenter environment.
datacenters and on edge devices. All varieties of ma-                While inference is generally less computationally de-
chine learning models are being used in the datacen-              manding than training, the compute capabilities of edge
ter, from RNNs to decision trees and logistic regres-             systems (both hardware and software) running the Face-
sion [1]. While all of training runs exclusively in the           book app limit what is possible. Figure 1 shows the
datacenter, there is an increasing push to transition in-         peak performance of edge devices, representing over 85%
ference execution, especially deep learning, to the edge.         of the entire market share, over the corresponding re-
Facebook makes over 90% of its advertising revenue                lease year of a System on a Chip (SoC). The figure shows
from mobile [2] and has focused on providing its over             a wide variation in performance that must be considered
2 billion monthly active users the best possible experi-          to enable efficient, real-time inference across all edge
ence [3]. In addition to minimizing users network band-           devices. Trends emerge over time to tell another story:
width and improving response time, executing inference            while the average theoretical performance of SoCs is im-
on the edge makes certain deep learning services possi-           proving over time, there is a consistent, widespread peak
ble, for example, Instagram features that involve real-           performance regardless the release year of the SoCs. To
∗
 Sungjoo Yoo is a Professor at Seoul National University. A       provide the best user experience despite limited perfor-
large part of this work was performed during his sabbatical       mance scaling Facebook has been proactive in develop-
leave at Facebook.                                                ing tools and optimizations to enable all models/services
to execute across the observed performance spectrum.              • System diversity makes porting code to co-processors,
   Optimizations include techniques for model architec-             such as DSPs, challenging. We find it more effec-
ture search, weight compression, quantization, algorith-            tive to provide general, algorithmic level optimiza-
mic complexity reduction, and microarchitecture spe-                tions that can target all processing environments.
cific performance tuning. These optimizations enable                When we have control over the system environ-
edge inference to run on mobile CPUs. Only a small                  ment (e.g., Portal [4] or Oculus [5] virtual reality
fraction of inference currently run on mobile GPUs.                 platforms) or when there is little diversity and a
This is no small feat considering the computational com-            mature SW stack (e.g., iPhones), performance ac-
plexity of state-of-the-art deep learning models and that           celeration with co-processors becomes more viable.
most CPUs are relatively low-performance. In our dataset,
an overwhelming majority of mobile CPUs use in-order              • The main reason to switch to an accelerator/co-
ARM Cortex-A53 and Cortex-A7 cores. While a great                   processor is power-efficiency and stability in exe-
deal of academic work has focused on demonstrating                  cution time. Speedup is largely a secondary effect.
the potential of co-processors (GPUs/DSPs) and accel-
erators, as we will show, in the field the potential per-         • Inference performance variability in the field is much
formance benefits of mobile GPUs vs. CPUs for the                   worse than standalone benchmarking results. Vari-
Android market are not great. Considering theoretical               ability poses a problem for user-facing applications
peak FLOP performance, less than 20% of mobile SoCs                 with real-time constraints. To study these effects,
have a GPU 3× more powerful than CPUs and, on a                     there is a need for system-level performance mod-
median mobile device, GPUs are only as powerful as                  eling.
CPUs. Inference sees limited co-processor use today as          Facebook expects rapid growth across the system stack.
a result of close performance between CPU clusters and       This growth will lead to performance and energy effi-
GPUs as well as an immature programming environ-             ciency challenges, particularly for ML mobile inference.
ment.                                                        While today we have focused on optimizing tools and
   In this paper we detail how Facebook runs inference       infrastructure for existing platforms we are also explor-
on the edge. We begin by reviewing the hardware and          ing new design solutions to enable efficient deep learning
software system stack the Facebook app is run on (Sec-       inference at the edge.
tion 2). This highlights the degree of device diversity
and divergence of software, presenting many design and
optimization challenges. Next, we review the machine         2.    THE LAY OF THE LAND: A LOOK AT
learning frameworks and tool sets including PyTorch 1.0            SMARTPHONES FACEBOOK RUNS ON
and the execution flow for mobile inference (Section 3).       Facebook’s neural network engine is deployed on over
To understand the optimization techniques Facebook           one billion mobile devices. These devices are comprised
has implemented to improve the performance and ef-           of over two thousand unique SoCs1 running in more
ficiency of inference we present two case studies. For       than ten thousand smartphones and tablets2 . In this
horizontally integrated devices (e.g., smartphones) we       section we present a survey of the devices that run Face-
show how general optimizations including quantization        book services to understand mobile hardware trends.
and compression can be used across all devices (Sec-
tion 4). Vertically integrated solutions enable control      2.1     There is no standard mobile chipset to op-
over the hardware-software stack. In the case of the                 timize for
Oculus virtual reality (VR) platform, we show how in-           Figure 2 shows the cumulative distribution function
ference can easily be ported to run on DSPs to improve       (CDF) of the SoC market share. The data paints a clear
energy efficiency, execution time predictability, and per-   picture: there is no “typical” smartphone or mobile SoC.
formance (Section 5). The degree of performance vari-        The most commonly-used SoC accounts for less than 4%
ance found in inference on the same device is presented      of all mobile devices. Moreover, the distribution shows
in Section 6–this is a problem for applications with real-   an exceptionally long tail: there are only 30 SoCs with
time constraints. Finally, we conclude by discussing the     more than 1% market share and their joint coverage is
ramifications of our findings and provide our take on        only 51% of the market.
what it means for potential research directions in archi-       In production, smartphone hardware is extremely frag-
tecture and systems (Section 7).                             mented. This diversity comes from a combination of the
   This paper makes the following key observations:          multiple IP blocks in a SoC which may include CPU(s),
                                                             GPU clusters, shared caches, memory controllers, image
   • Nearly all mobile inference run on CPUs and most        1
                                                              Some of the SoCs are different by connectivity modules.
     deployed mobile CPU cores are old and low-end.          2
                                                              SoC information is widely accessible through Android
     In 2018, only a fourth of smartphones implemented       system properties and Linux kernel mechanisms, such as
     CPU cores designed in 2013 or later. In a median        /proc/cpuinfo file and sysfs filesystem. Android devel-
     Android device, GPU provides only as much per-          opers commonly use SoC information to optimize perfor-
                                                             mance. To allow developers to optimize ML-based applica-
     formance as its CPU. Only 11% of the Android            tion performance, we developed cpuinfo library to decode
     smartphones have a GPU that is 3 times more per-        SoC specification and open sourced it at https://github.
     formant than its CPU.                                   com/pytorch/cpuinfo.
1.0     95%

              0.8                                                                                                                                             2011
                    65%
                                                                                                                            2012
CDF of SoCs

              0.6                                                                                                                                    15.6%
                                 225                                                                                                 54.7%
                          50
                                 SoCs
                                                                                                                                                       1.8%      2005-2010
              0.4         SoCs

                                                                                                                                                     23.6%
              0.2
                                                                                                                                              4.2%
                                                                                                                                                              2015+
              0.0
                     0           250    500   750   1000   1250   1500   1750   2000
                                                Unique SoCs                                                                          2013-2014
 Figure 2:    There is no standard mobile SoC                                             Figure 3: The most commonly-used mobile pro-
 to optimize for. The top 50 most common SoCs                                             cessors, Cortex A53, are at least six years old. In
 account for only 65% of the smartphone market.                                           2018, only a fourth of smartphones implemented
                                                                                          CPU cores designed in 2013 or later.

 processors, a digital signal processor (DSP), and even
                                                                                                                 10

                                                                                       SoC GPU flops/CPU flops
 a specialized neural network accelerator (NPU). There
 are over 25 mobile chipset vendors which each mixes and                                                          8
 matches its own custom-designed components with IP
 blocks licensed from other companies. The fragmenta-                                                             6
 tion of hardware is particularly acute on Android, where                                                         4
 the Facebook app runs on over two thousand of differ-
 ent SoCs compared to a little more than a dozen SoCs                                                             2
 on iOS.                                                                                                          0
                                                                                                                      0.0      0.2           0.4       0.6        0.8        1.0
 2.2                Mobile CPUs show little diversity                                                                                        Marketshare
    The general availability and programmability of CPUs                                  Figure 4: The theoretical peak performance dif-
 make them the default option for mobile inference. Thus,                                 ference between mobile CPUs and GPUs is nar-
 we pay close attention to the CPU microarchitecture                                      row. In a median Android device, GPU provides
 differences between mobile SoCs. Figure 3 shows a                                        only as much performance as its CPU. Only 11%
 breakdown of the year smartphone CPU cores were de-                                      of the smartphones have a GPU that is 3 times
 signed or released. 72% of primary CPU cores being                                       more performant than its CPU.
 used in mobile devices today were designed over 6 years
 ago. Cortex A53 represents more than 48% of the entire
 mobile processors whereas Cortex A7 represents more                                      of energy-efficient cores. Only a small fraction include
 than 15% of the mobile processors. When looking at                                       three clusters of cores. Cores in the different clusters
 more recent CPUs, the distribution is much more di-                                      may differ in microarchitectures, frequency settings, or
 verse without dominating microarchitectures. The im-                                     cache sizes. A few SoCs even have two clusters consist-
 plication of the dominant Cortex A7 and Cortex A53                                       ing of identical cores. In nearly all SoCs, cores within
 IPs for machine learning is that most of today’s edge in-                                the same cluster have a shared cache, but no cache
 ference runs on in-order (superscalar) mobile processors                                 level is shared between cores in the different clusters.
 with only one to four cores per cluster. Furthermore,                                    The lack of a shared cache imposes a high synchroniza-
 this view of the world poses a new, real challenge for                                   tion cost between clusters. For this reason, Facebook
 systems and computer architecture researchers – pro-                                     apps target the high-performing cluster by, for exam-
 posed mobile hardware optimizations and accelerators                                     ple, matching thread and core count for neural network
 need to consider the long IP lifetime.                                                   inference.
    We observe a similar multi-core trend as desktop and
 server chips in mobile. 99.9% of Android devices have                                    2.3                          The performance difference between a mo-
 multiple cores and 98% have at least 4 cores. We find                                                                 bile CPU and GPU/DSP is narrow
 distinct design strategies between Android and iOS smart-                                  High-performance GPUs continue to play an impor-
 phones – iOS devices tend to use fewer, more powerful                                    tant role in the success of deep learning. It may seem
 cores while Android devices tend to have more cores,                                     natural that mobile GPUs play a similar part for edge
 which are often less powerful. A similar observation was                                 neural network inference. However, today nearly all
 made in 2015 [6]. To optimize a production application                                   Android devices run inference on mobile CPUs due to
 for this degree of hardware diversity, we optimize for the                               the performance limitations of mobile GPUs as well as
 common denominator: the cluster of most performant                                       programmability and software challenges.
 CPU cores.                                                                                 Figure 4 shows the peak performance ratio between
    About half of the SoCs have two CPU clusters: a                                       CPUs and GPUs across Android SoCs. In a median de-
 cluster of high-performance cores and another cluster                                    vice, the GPU provides only as much theoretical GFLOPS
performance as its CPU. 23% of the SoCs have a GPU           of Android devices ship with a broken OpenCL driver.
at least twice as performant as their CPU, and only 11%      In the worst case, 1% of the devices crash when the
have a GPU that is 3 times as powerful than its CPU.         app tries to load the OpenCL library. The instability of
This performance distribution is not a historical artifact   OpenCL’s library and driver makes it unreliable to use
but a consequence of the market segmentation: mid-           at scale.
end SoCs typically have CPUs that are 10-20% slower             OpenGL ES has proved to be a viable alternative.
compared to their high-end counterparts. The perfor-         OpenGL ES is a trimmed variant of the OpenGL API
mance distribution corresponds to a wider gap for the        specifically for mobile and embedded systems. Being a
GPUs in SoCs targeted for different performance tiers—       graphics API, OpenGL ES is not tailored to GPGPU
the performance gap for mobile GPUs is two to four           programming. However recent versions of the API pro-
times. Realizable mobile GPUs performance is further         vide sufficient capabilities to program neural network
bottlenecked by limited memory bandwidth capacities.         computations. Different versions dictate what we can
Unlike high-performance discrete GPUs, no dedicated          do with mobile GPUs and there are several versions of
high-bandwidth memory is available on mobile. More-          the OpenGL ES API on the market.
over, mobile CPUs and GPUs typically share the same
memory controller, competing for the scarce memory                • OpenGL ES 2.0 is the first version of the API
bandwidth.                                                          with a programmable graphics pipeline. All mobile
                                                                    devices running Facebook apps on Android sup-
2.4   Available co-processors: DSPs and NPUs                        port this version. With OpenGL ES 2.0 it is pos-
   Compute DSPs are domain-specific co-processors well-             sible to implement neural network operators via
suited for fixed-point inference. The motivation at Face-           the render-to-texture technique, but inherent lim-
book to explore co-processor performance acceleration               itations of the API make computations memory
opportunities is for the increased performance-per-watt             bound. All computations have to happen inside a
efficiency benefit (higher performance with lower power             fragment shader which can write only 16 bits3 of
consumption). However DSPs face the same challenge                  output. Therefore, multi-channel convolution or
GPUs do – “compute” DSPs are available in only 5% of                matrix-matrix multiplication would require read-
the Qualcomm-based SoCs the Facebook apps run on.                   ing the same inputs multiple times. The compu-
Most DSP do not yet implement vector instructions.                  tation patterns are similar to matrix-matrix mul-
While all vendors are adding vector/compute DSPs, it                tiplication on CPU using a dot product function.
is likely to take many years before we see a large market         • OpenGL ES 3.0 (or newer) is supported on 83% of
presence.                                                           Android devices. It is the first version of OpenGL
   The regularity in the computational patterns of many             ES that is practical for neural network implemen-
DNN workloads makes NPUs exceptionally amenable to                  tations. Similar to 2.0, all computations need to
hardware acceleration. Many academic research projects,             be implemented in fragment shaders, but OpenGL
startups, and companies have proposed solutions in this             ES 3.0 supports several features for efficiency. For
space (Section 7 offers a thorough treatment). The most             example, each invocation of a fragment shader can
notable deployed NPU is the Cambricon 1A in the Kirin               write up to 128 bits of data into each of the (up
970 SoC [7] and the Neural Engine in the Apple A12                  to 8) textures while also using uniform buffers to
Bionic SoC [8]. While relatively few NPUs exist today,              load constant data (e.g., weights).
and fewer programmable by third parties, we may be
reaching an inflection point.                                     • OpenGL ES 3.1 (or newer) is supported on 52%
                                                                    of the Android devices. It introduces compute
2.5   Programmability is a primary roadblock                        shaders that provide similar functionalities avail-
      for using mobile co-processors                                able in OpenCL 1.x and early versions of CUDA.
   The major APIs used to program neural networks on                For example, important compute features such as,
mobile GPUs are OpenCL, OpenGL ES, and Vulkan on                    launching kernels on GPU with reduced overhead
Android and Metal on iOS.                                           for the graphics pipeline, fast synchronization within
   OpenCL was designed to enable general-purpose pro-               a work-group, access to local memory shared by
grams to run on programmable co-processors. Thus,                   threads in a work-group, and arbitrary gather and
OpenCL does not provide graphics specific functional-               scatter operations with random-access buffers, be-
ity, e.g., 3D rendering. Focusing on general-purpose                come available.
computations helps: OpenCL’s API and intrinsic func-
                                                                Figure 5(b) shows how over the past year the pro-
tions as well as support for memory address space man-
                                                             grammability of mobile GPUs on Android devices has
agement, and efficient thread synchronization make it
                                                             steadily improved. Today, a median Android device has
easier to express computations compared to graphics-
                                                             the support of GPGPU programming with OpenGL ES
oriented APIs like OpenGL. However while most An-
                                                             3.1 compute shaders.
droid devices ship with OpenCL drivers, OpenCL is not
                                                                Vulkan is a successor to OpenGL and OpenGL ES.
officially a part of the Android system, and they do not
                                                             It provides similar functionality to OpenGL ES 3.1, but
go through the same conformance tests as OpenGL ES
                                                             3
and Vulkan. As shown in Figure 5(a), a notable portion           32 bits with OES rgb8 rgba8 extension
No         No GPU                                                                                    No library
                 library/device                          100
            Loading
              fails                                                                       OpenGL 3.2
                                                          80
        Loading
        crashes
                                            OpenCL 2.0    60                              OpenGL 3.1

                                                          40
                                                                                          OpenGL 3.0
                                                                                                       No device
                                                          20
         OpenCL 1.1                                                                       OpenGL 2.0
                                    OpenCL 1.2             0                                                                             Vulkan 1.0
                                                               Aug 17   Nov 17   Feb 18       Jun 18
                  (a) OpenCL Support                             (b) OpenGL ES Support                             (c) Vulkan Support

                      Figure 5: Mobile GPUs have fragile usability and poor programmability.

with a new API targeted at minimizing driver overhead.                           3.1      Machine learning models and frameworks
Looking forward, Vulkan is a promising GPGPU API.                                   We developed several internal platforms and frame-
Today, early adoption of Vulkan (see Figure 5(c)) is lim-                        works to simplify the task of bringing machine learning
ited, being found on less than 36% of Android devices.                           into Facebook products. As an example, FBLearner
   Metal is Apple’s GPU programming language. Mo-                                is an ML platform that automates many tasks such as
bile GPUs on iOS devices paint a very different picture.                         training on clusters and is the tool of choice by many
Because Apple chipsets with the A-series mobile pro-                             teams experimenting and developing custom ML at Face-
cessors are vertically-designed, the system stack is more                        book. In addition to production tooling, ML develop-
tightly integrated for iPhones. Since 2013 all Apple mo-                         ment at Facebook has been underpinned by Caffe2 and
bile processors, starting with A7, support Metal. Metal                          PyTorch, as set of distinct deep learning frameworks
is similar to Vulkan but with much wider market share                            both of which are open source. Caffe2 is optimized for
and more mature system stack support. 95% of the iOS                             production scale and broad platform support while Py-
devices support Metal. Moreover the peak performance                             Torch was conceived with flexibility and expressibility
ratio between the GPU and the CPU is approximately                               in mind allowing researchers to fully express the de-
3 to 4 times, making Metal on iOS devices with GPUs                              sign space. Caffe2 provides cutting-edge mobile deep
an attractive target for efficient neural network infer-                         learning capabilities across a wide range of devices and
ence. Guided by this data and experimental perfor-                               is deployed broadly through the Facebook application
mance validation, Facebook apps enable GPU-powered                               family. In particular, it is deployed to over one bil-
neural network inference on iOS for several models.                              lion devices, of which approximately 75% are Android
   In summary, Facebook takes a data-driven design ap-                           based, with the remainder running iOS. Caffe2, in par-
proach: the heterogeneity of SoCs makes it inordinately                          ticular, is built with optimized mobile inference in mind
challenging to perform fine-grained, device/SoC-specific                         to deliver the best experience possible for a broad set
optimization. Diverse SoCs pose significant programma-                           of mobile devices.
bility challenge. It is difficult to deploy performance                             At the 2018 F8 Developer Conference, Facebook an-
optimization techniques to SoC that are implemented                              nounced the road map for a new unified AI framework
with different versions of device drivers, scopes of mem-                        – PyTorch 1.0 [9]. Pytorch 1.0 combines the production
ory granularities, and consistency models.                                       scale of Caffe2 and the research flexibility of PyTorch.
                                                                                 It supports the ONNX specification for ecosystem in-
3.   MACHINE LEARNING AT FACEBOOK                                                teroperability. With this, Facebook aims to accelerate
   Facebook puts in significant engineering efforts into                         AI innovation by streamlining the process of transition-
developing deep learning features for mobile platforms.                          ing models developed through research exploration into
Figure 6 illustrates the execution flow of machine learn-                        production scale with little transition overhead.
ing, where a product leverages a series of inputs to                                PyTorch 1.0 adopts the ONNX specification for ecosys-
build a parameterized model, which is then used to                               tem interoperability. In addition to being able to ex-
create a representation or a prediction. Hazelwood et                            port the ONNX model format, PyTorch 1.0 leverages
al. presented the major products and services leverag-                           ONNX’s Interface for Framework Integration (ONNX-
ing machine learning that run on Facebook customized                             IFI) as a stable interface for external backend integra-
datacenter infrastructure [1]. To ensure reliable, high-                         tion. ONNXIFI enables PyTorch 1.0 to leverage exter-
performance, and high-quality training, this phase gen-                          nal software libraries and hardware accelerators without
erally happens offline in the cloud. On the other hand,                          requiring redundant integration work for each new back-
an inference phase that makes real-time predictions on                           end and also supports embedded edge devices. We are
pre-trained models runs either in the cloud or on mobile                         also collaborating with mobile OS ecosystem partners
platforms. This paper focuses on mobile inference—                               such as Android to include a similar level of function-
making real-time predictions locally at the edge.                                ality, natively within the OS-specific machine learning
Figure 6: Execution flow of Facebook’s machine learning for mobile inference.

accelerator APIs.                                           mance is key for edge inference; thus, performance op-
                                                            timization is critical for mobile. However, performance
3.2   DNN Inference at the Edge                             characterization and analysis is far more complicated
   Smartphones are capable of realizing deep learning       for mobile than cloud because of fragmentation of the
in real time without relying on the cloud but there are     mobile ecosystem (Section 2).
also limitations. While smartphones have improved sig-         Performance is far more limited with wider perfor-
nificantly in computation performance in recent years,      mance variance for mobile than cloud. Most cloud in-
these mobile platforms also have various resource con-      ference runs on server-class CPUs with theoretical com-
straints, such as power, memory, and compute capabil-       putation capability of up to several TFLOPS. On the
ities. Putting all aspects of SoC components together       other hand, mobile SoCs are orders of magnitude less
leads to the landscape of a widely diverse set of SoCs,     capable, and deliver between single-digit GFLOPS in
as presented in Section 2. As a result, mobile presents     the ultra low-end to few hundred of GFLOPS on the
both an opportunity and, at the same time, a challenge      very high-end.
for machine learning. Depending on the application and         Furthermore, model and code sizes are imperative for
product domains, Facebook takes different approaches.       mobile because of the limited memory capacity of a few
   Here we review how inference works today by review-      GBs. Techniques, such as weight pruning, quantiza-
ing the mobile workflow. We then provide two case           tion, and compression, are commonly used to reduce
studies that provide details into how we optimize for       the model size for mobile. Code size is a unique design
commodity devices (i.e., mobile SoCs) and vertically in-    point for mobile inference. For good deployment experi-
tegrated, custom solutions (i.e., Oculus VR platforms).     ence, the amount of new code pushed into the app needs
                                                            to be incremental. Several methods are available for ap-
3.3   Important Design Aspects for Mobile In-               plication code size management and are potentially vi-
      ference and Potential Approaches                      able. First option is to compile applications containing
                                                            ML models to platform-specific object code using, for
   To develop mobile applications for the wild west of
                                                            example, Glow [10], XLA [11], or TVM [12]. This often
mobile SoCs, a simple approach is to target applica-
                                                            leads to larger model sizes (as the model now contains
tion development for the lowest common denominator,
                                                            machine codes but enables a smaller interpreter.). Sec-
in this case, mobile CPUs. This, however, optimizes
                                                            ond option is to directly use vendor-specific APIs, such
for ubiquity and programmability while sacrificing effi-
                                                            as iOS CoreML [13], from operating system vendors.
ciency.
                                                            Another approach is to deploy a generic interpreter,
   To fully harvest potential performance benefits for
                                                            such as Caffe2 or TF/TFLite, that compiles code us-
edge inference, there are several important design trade-
                                                            ing optimized backend. The first approach is compiled
offs that we consider. Iteration speed from an idea to
                                                            execution which treats ML models as code whereas the
product deployment takes weeks–order of magnitude
                                                            later approach is interpreted execution which treats ML
longer than the deployment cycle for cloud. Perfor-
models as data. Techniques are chosen depending on           tion. In addition, tuning of spatial resolution that con-
design tradeoff suitable in different usage scenarios.       trols the processing time of middle layers is particularly
                                                             useful for mobile. We also apply commonly-used tech-
3.4   Mobile Inference Workflow                              niques, such as pruning and quantization, to aggres-
   Facebook develops a collection of internal platforms      sively cut down the size of DNN models while main-
and toolkits to simplify the tasks of leveraging ma-         taining reasonable quality [17].
chine learning within its products. FBLearner offers            In addition, to maintain certain performance levels
this ecosystem of machine learning tools, enabling work-     for good user experience, quantization is used for edge
flow execution and management (FBLearner Flow), reg-         inference. The use of quantization is a standard in-
istry of pointers to data sources, features, and models      dustry practice with support in e.g., Google’s GEMM-
for training and inference (FBLearner Feature Store),        LOWP [18] and Qualcomm’s neural processing SDK [19].
optimal configurations for experiments (FB Learner Au-       A floating point tensor is linearly quantized into 8 or
toML), real-time prediction service (FBLearner Predic-       fewer bits and all nodes in the data flow graph operate
tor), among many others [1].                                 on this quantized tensor value. To efficiently quantize
   Figure 6 depicts the execution flow for applying ma-      node outputs, we need to precompute good quantiza-
chine learning for edge inference. First, features are       tion parameters prior to inference time. There are two
collected and selected for any ML modeling tasks from        approaches here. One is to modify the graph at training
FBLearner Feature Store. The Feature Store is essen-         time to learn the quantization directly–quantization-
tially a catalog of feature generators, that is hosted on    aware training [20]. The other is to add a stage af-
Facebook’s data centers. Then, a workflow describing         ter training to compute appropriate quantizers–post-
architectures of a model and steps for the model train-      training quantization. More advanced quantization tech-
ing and evaluation is built with FBLearner Flow. After       niques at the execution front-end is under investiga-
model training and evaluation, the next step is to export    tion [21, 22, 23].
and publish the model so it can be served in one of Face-       To make the best possible use of limited comput-
book’s production inference tiers. Before models are de-     ing resources, Caffe2 Runtime integrates two in-house
ployed for edge inference, optimization techniques, such     libraries, NNPACK [24] and QNNPACK [25], which
as quantization, can be applied in the Optimizer.            provide optimized implementation of convolution and
   In general, to improve model accuracy, three approaches   other important CNN operations, and contain platform-
are used iteratively: increasing training data, refining     specific optimizations tailored for mobile CPUs.
feature sets, and changing model architectures, by e.g.         NNPACK (Neural Networks PACKage) per-
increasing the number of layers or sharing embeddings        forms computations in 32-bit floating-point precision
for features. For performance and memory requirement         and NCHW layout, and targets high-intensity convo-
reasons, we often quantize portions of models. One ex-       lutional neural networks, which use convolutional oper-
ample is to reduce the precision of a large multi-GB         ators with large kernels, such as 3x3 or 5x5. NNPACK
embedding table from 32-bit single precision float to 8-     implements asymptotically fast convolution algorithms,
bit integers. This process takes place after we verify       based on either Winograd transform or Fast Fourier
that there is little or no measurable impact to model        transform, which employ algorithmic optimization to
accuracy. Then, for edge inference, to improve compu-        lower computational complexity of convolutions with
tational performance while maximizing efficiency, tech-      large kernels by several times. With algorithmic advan-
niques, such as quantization, k-means clustering, exe-       tage and low-level microarchitecture-specific optimiza-
cution engine selection, are employed to create mobile-      tions, NNPACK often delivers higher performance for
specific models. Once the model is deployed to a mo-         direct convolution implementation.
bile platform, Caffe2 Runtime interprets models and             QNNPACK (Quantized NNPACK) on the other
call kernels to process inputs. Depending on the hard-       hand performs computations in 8-bit fixed-point pre-
ware architecture and the system stack support, back-        cision and NHWC layout. It is designed to augment
end neural network libraries are used by Caffe2 Runtime      NNPACK for low-intensity convolutional networks, e.g.
for additional optimization.                                 neural networks with large share of 1x1, grouped, depth-
                                                             wise, or dilated convolutions. These types of convolu-
4.    HORIZONTAL INTEGRATION: MAKING                         tions do not benefit from fast convolution algorithms,
      INFERENCE ON SMARTPHONES                               thus QNNPACK provides a highly efficient implementa-
                                                             tion of direct convolution algorithm. Implementation in
   Mobile inference is primarily used for image and video    QNNPACK eliminates the overhead of im2col transfor-
processing. Therefore, inference speed is typically mea-     mation and other memory layout transformations typ-
sured as the number of inference runs per second. An-        ical for matrix-matrix multiplication libraries. Over a
other commonly-used metric is inference time, particu-       variety of smartphones, QNNPACK outperforms state-
larly for latency sensitive applications. To exploit per-    of-the-art implementations by approximately an aver-
formance optimization opportunities before models are        age of two times.
deployed onto mobile platforms and to ensure fast model         The choice of two mobile CPU backends help Caffe2
transmission to the edge, Caffe2 implements specific         Runtime deliver good performance across a variety of
features, such as compact image representations and          mobile devices and production use-cases. In the next
weight [14, 15], channel pruning [16], and quantiza-
section we present the relative performance compari-                                7
son of three state-of-the-art DNN models running on                                 6           Low-end                  Mid-end                    High-end

                                                                 FPS (Normalized)
QNNPACK with quantization compared to running on
NNPACK in floating-point representation.                                            5
                                                                                    4           ShuffleNet
4.1   Performance optimization versus accuracy                                      3           Mask-RCNN
      tradeoff
   The primary performance benefits with reduced pre-
                                                                                    2
cision computation come from–(1) reduced memory foot-                               1
print for storage of activations, (2) higher computation                            0

                                                                                        Gen-1
                                                                                                Gen-2
                                                                                                        Gen-3
                                                                                                                Gen-4
                                                                                                                        Gen-1
                                                                                                                                Gen-2
                                                                                                                                        Gen-1
                                                                                                                                                Gen-2
                                                                                                                                                        Gen-3
                                                                                                                                                                Gen-4
efficiency, and (3) improved performance for bandwidth
bounded operators, such as depthwise convolutions and
relatively small convolutions. Reduced precision com-              Figure 7: Performance comparison over several
putation is beneficial for advanced model architectures.           generations of low-end, mid-end, and high-end
This inference time speedup is, however, not received              smartphones for two image-based DNN mod-
equally well when the technique is applied directly onto           els, ShuffleNet [27] and Mask-RCNN [28]. The
all models.                                                        smartphone performance tier does not always
   We compare the inference time speedup of the re-                correspond to inference performance. The per-
duced precision computation with 8-bit fixed-point over            formance of DNN models respond to hardware
the baseline FP32 implementation (under acceptable                 resources differently.
accuracy tradeoff). First, the UNet-based Person Seg-
mentation model [26] that relies on 3x3 convolutions
with relatively small spatial extent experiences perfor-           a better use of quantization that offers more consistent
mance regression in the quantized version. This re-                performance improvement.
gression is caused by inability to leverage NNPACK’s                  While this section represents the way designs are cur-
highly optimized Winograd-based convolution for both               rently done today, as more hardware and better (co-
the low- and the high-end Android smartphones. Fur-                )processors (e.g., CPUs/GPUs/DSPs) make their way
thermore, for the quantized models, additional instruc-            onto mobile devices we will take advantage of the addi-
tions are needed to extend elements from 8 to 16 bits for          tional computation performance by using more accurate
computation4 , leading to additional performance over-             models. Next we present two approaches for enabling
head compared to the FP32 version, which can imme-                 mobile inference at Facebook. First is the horizontal in-
diately use loaded elements in multiply-add operations.            tegration that enables Facebook Apps to run efficiently
   For style transfer models, a network with a relatively          across a variety of mobile platforms. Second is the ver-
small number of channels and large spatial resolution              tical integration for the VR application domain.
is used with 3x3 convolutions. We start seeing much
better performance response to QNNPACK-powered re-                 4.2                  An augmented reality example
duced precision computation. The efficiency reduction                 Smart cameras are designed to add intelligence to
from losing Winograd is compensated by reduced mem-                cameras, i.e., processing images directly on an edge de-
ory bandwidth for these large spatial domain convolu-              vice. This feature improves user experience by reduc-
tion.                                                              ing latency and bandwidth usage. In addition to image
   Finally, when we look at a custom architecture de-              classification models, advanced machine learning tech-
rived from ShuffleNet [27], which leverages grouped 1x1            niques are applied to estimate and improve the quality
convolutions and depthwise 3x3 convolutions for the                of photos and videos for various Facebook services, to
bulk of the model computation, we see substantial in-              understand digital contents locally, directly on smart-
ference performance improvement from reduced mem-                  phones.
ory bandwidth consumption for the depthwise convo-                    The challenges with smart cameras are large com-
lutions. Reduced precision computation on QNNPACK                  putational demands for delivering real-time inference.
improves inference performance for the depthwise-separable         To enable smart cameras on mobile platforms running
models that are increasingly popular in mobile and em-             the Facebook App, we train mobile specific models,
bedded computer vision applications.                               compress weights for transmission, and execute quan-
   However, in order to maximize performance benefit,              tized models. We design a collection of image clas-
we have to consider both algorithmic and quantization              sification models tailored for smartphones. To lessen
optimization. Currently, using algorithmic optimization            the transmission cost, models can be compressed us-
with e.g. Winograd algorithm for CNNs can disallow                 ing a Deep Compression-like pipeline. As previously
quantization. Therefore, if the benefit from Winograd              discussed, quantization is also considered. Finally, to
transformation is greater than that of quantization, we            improve inference performance, some of the models are
see a relative slowdown for quantized models, calling for          processed using an 8-bit fixed point datatype for the
4
 This inefficiency is not inherent to 8-bit fixed-point convo-     weights. Additionally, models shipped with the k-means
lutions, but is caused by restrictions of the NEON instruc-        quantization method typically use 5 or 6 bits for the
tion set.                                                          weights.
DNN features           DNN models                                  MACs          Weights
                           Hand Tracking          U-Net [29]                                  10x           1x
                           Image Model-1          GoogLeNet [30]                              100x          1x
                           Image Model-2          ShuffleNet [27]                             10x           2x
                           Pose Estimation        Mask-RCNN [28]                              100x          4x
                           Action Segmentation    TCN [31]                                    1x            1.5x

                               Table 1: DNN-powered features for Oculus.

4.3   DNN model performance across a wide spec-                                               CPU
      trum of smartphones                                                           103       DSP

                                                            Inferences per second
   Figure 7 illustrates the performance of two important
Facebook DNN models, i.e., classification and human
pose estimation, across multiple generations of smart-
phones in the different performance tiers. The x-axis                               102
represents multiple generations of smartphones in the
low-end, mid-end, and high-end performance tiers whereas
the y-axis plots the normalized inference time speedup
over the first smartphone generation in the low-end tier.

                                                                                           tracking

                                                                                                      classification M1

                                                                                                                          classification M2

                                                                                                                                              estimation

                                                                                                                                                           segmentation
First, we observe that the performance tier does not al-

                                                                                            Hand

                                                                                                                                                Pose

                                                                                                                                                              Action
ways directly correspond to inference performance. The

                                                                                                           Image

                                                                                                                               Image
newest generation of smartphones in the low-end tier
offer competitive inference performance as that in the
mid-end tier for both DNN models.
   Furthermore, the performance of DNN models re-             Figure 8: Inference time performance compari-
spond to different degree of hardware resources differ-       son between CPU and DSP.
ently. For example, the DNN model used for human
bounding box and keypoint detection (Mask-RCNN [28])
demands much higher compute and memory resource               and video performance requirements. Moreover, head-
capacities. Thus, when comparing the inference time           sets include multiple cameras to cover a wide field of
speedup between the latest generation of smartphones          view. This puts the performance requirements of VR
between the low-end and high-end performance tiers,           platforms on the order of many hundreds of inference
we see a much higher performance speedup for smart-           per second. This presents a significant challenge for
phones equipped with more abundant resources–5.62             embedded devices as all processing has to be performed
times speedup for Gen-4/High-End over 1.78 times speedup      on-device with high performance delivery. To overcome
for Gen-4/Low-End. Although we still see higher infer-        the particular design challenges in the VR design space,
ence time speedup in the latest generation of high-end        for mobile inference, Facebook takes a vertical design
smartphones, the speedup is less pronounced for the           approach. In the case of Oculus, we explore and assess
DNN model used for classification (ShuffleNet [27]).          the suitability of Qualcomm Hexagon DSPs for DNN
   This model-specific inference time comparison projects     models by offloading the most computationally demand-
the performance of realistic DNN models onto the di-          ing DNNs and compare performance across used models
verse spectrum of smartphone platforms described in           running on DSPs and CPUs.
Section 2. In addition to the previously-shown peak
performance analysis for the deployed smartphones in          5.1                         DNN models and hardware used for VR
the wild, Figure 7 shows how different generations of                                     platforms
smartphones across the different performance tiers re-           The Oculus VR platform explores many state-of-the-
act to two realistic DNN models. It is important to           art DNN models. Models are programmed in PyTorch
continue strong performance scaling for the remaining         1.0 and the weights are quantized with PyTorch 1.0’s
smartphones in the entire market for higher product           int8 feature for mobile inference. Table 1 shows some of
penetration.                                                  the key DNN models explored by Oculus: Hand Track-
                                                              ing, Image Classification Model-1, Image Classification
5.    VERTICAL INTEGRATION: PROCESS-                          Model-2, Pose Estimation, and Action Segmentation.
                                                                 For mobile inference, the DNN models are offloaded
      ING VR INFERENCE FOR OCULUS                             using PyTorch 1.0’s CPU and Facebook ’s BoltNN DSP
  Oculus platforms create new forms of interactions by        inference backends. The CPU model utilizes a big.LITTLE
running multiple DNNs for tasks including hand, face,         core cluster with 4 Cortex-A73 and 4 Cortex-A53 and
and body tracking. To provide a high-quality user ex-         a Hexagon 620 DSP. All CPU cores are set to the max-
perience the models must run at a steady frame rate of        imum performance level. The four high-performance
30 to 60 FPS, three times greater than mobile image           CPU cores are used by the DNN models. The DSP
life and for ergonomic requirement of platform temper-
FPS         20                                                ature. We compare the performance, power, and tem-
                                                              perature of the post estimation model running on the
            10                                                CPU versus the DSP. Figure 9 shows that the CPU im-
                                                              plementation consumes twice as much power as that of
                                                              the DSP in the beginning. Then, thermal throttling
Power (W)

        5.0                                                   kicks in so the power consumption of the CPU imple-
                                                              mentation drops while still using 18% more power than
        2.5                                                   the DSP. The thermal throttling has a significant ef-
            50                                                fect on performance, degrading the FPS performance to
                                                              10 frames-per-second. For lower platform power con-
Temp (C)

                                                              sumption and operating temperature, Facebook takes
            40                                                the vertical-designed approach to offload DNN models
                                             CPU       DSP    using the BoltNN DSP backend for its VR platforms.
                 0      100    200     300     400     500       Despite higher performance, lower power consump-
                                Time (Sec)                    tion and operating temperature, the DSP implementa-
 Figure 9: Inference frames-per-second perfor-                tion comes with significantly higher programming over-
 mance, power, and temperature comparison for                 head. First, because most DSP architectures support
 a vision model for Oculus. Thermal throttling                fixed-point data types and operations, DNN models need
 (marked with dotted lines) prevents CPU from                 to be quantized. Depending on the application do-
 operating at an acceptable FPS performance                   main and models, this may cause substantial accuracy
 level.                                                       loss. It also requires developers to port model operators
                                                              to fixed-point implementation; otherwise, this can eas-
                                                              ily become the performance bottleneck for light-weight
 shares the same memory space with the mobile CPU
                                                              operations. Furthermore, developers must pay addi-
 cores and has a separate layer of caches, making it con-
                                                              tional attention to optimize memory layout; otherwise
 venient to program but also isolated enough to prevent
                                                              the memory hierarchy can become a contentious re-
 cache thrashing for other concurrent processes running
                                                              source, leading to additional delay.
 on the mobile CPU. As we will later see in Section 6,
                                                                 Last but not least, an important, yet less explored
 dependable, stable execution is an important feature to
                                                              and understood factor to determine where DNN mod-
 have to guarantee user experience.
                                                              els should be run at—CPU versus accelerators—is in-
 5.2             DSP evaluation results and analysis          ference time variation. Even if one can hand optimized
                                                              CPU implementation such that the inference time meets
    Figure 8 compares the FPS of the DSP and CPUs for         the application-specific performance target, and the power
 all models. DSP clearly outperforms CPU for all the          and temperature results are competitive, offloading ML
 models that come with various complexity and archi-          models to accelerators may still be more desirable, de-
 tectures, achieving an average speedup of 1.91x, rang-       spite the higher programming overhead. We next in-
 ing from 1.17 to 2.90 times. The highest speedup comes       troduce the role of performance variability for mobile
 from models with simple convolution operations, such         inference.
 as in the Hand Tracking and the Image Classification
 Models.
    When intensive memory-bound operations are involved,      6.   MAKING INFERENCE IN THE WILD:
 such as depth-wise convolutions in the ShuffleNet-based           PRACTICAL CONSIDERATIONS FOR PRO-
 image classification and pose estimation models, the
 speedup of DSP over CPU becomes less pronounced.                  CESSING ON MOBILE DEVICES
 This is because the memory load-store operations are at         Performance variability is a practical concern for Face-
 the granularity of the vector width or coarser, e.g., more   book because it is challenging to make guarantees on
 than 128B in Hexagon DSPs. Thus, additional mem-             quality of service. Real time constraints and model ac-
 ory transformation is needed, introducing extra perfor-      curacy are often competing objectives: higher-quality
 mance overhead. Furthermore, for memory-bound lay-           models take longer to process but provide more ac-
 ers, such as grouped convolutions or depth-wise con-         curacy. For example, we might conservatively use a
 volutions, extra computations are required to optimize       smaller, less computationally expensive model to meet
 the memory layout of activations and filters, in order to    a 95% performance target across all devices and all
 fully take advantage of the SIMD units. Finally, across      App instances. However, if we had a better way to
 all models, additional system overhead can come from         model and predict performance variability we could put
 remote procedure calls that flush the L2 cache on the        tighter bounds and could use different models tuned
 chipset.                                                     to maximize accuracy while meeting real-time perfor-
    In addition to using the amount of performance speedup    mance/FPS metrics to provide the best user experience
 to determine where a DNN model should be executed,           (FPS) and service (model accuracy). In this section we
 designs for AR/VR wearables must consume as little           show how much performance can vary and suggest a
 power consumption as possible for prolonged battery          simple way to model it.
50                                            A11
                                                                                                                                      A10
Inference time (ms)
                      105                                                                                                             A9
                                                                                        40

                                                                           Percentage
                                                                                        30
                      104                                                               20
                                                                                        10
                      103
                              A6     A7     A8     A9   A10    A11                       0

                                                                                          0.0
                                                                                          1.0
                                                                                          2.0
                                                                                          3.0
                                                                                          4.0
                                                                                          5.0
                                                                                          6.0
                                                                                          7.0
                                                                                          8.0
                                                                                          9.0
                                                                                         10.0
                                                                                         11.0
                                                                                         12.0
                                                                                         13.0
                                                                                         14.0
                                                                                         15.0
                                                                                         16.0
                                             Chipset
   Figure 10: The inference time performance im-                                                            Inference time
   proves over generations of iPhones. However,                             Figure 11: The inference time follows an approx-
   within each generation, significant inference per-                       imate Gaussian distribution with the mean cen-
   formance variability is observed with a large                            tered at 2.02ms and the standard deviation of
   number of outliers.                                                      1.92ms.

                                                                            scenarios. Finally, process variation and battery aging
   6.1                      Performance variability observed in the pro-    also contribute to performance variability. To have rep-
                            duction environment                             resentative performance results and analysis, it is im-
      To arrive at an optimal design point, we perform rig-                 portant to perform in-field studies for machine learning
   orous evaluations for Facebook services and use the per-                 designs and performance evaluation in the mobile space.
   formance characterization results to drive better solu-
   tions. A key observation we derive from the perfor-                      6.2              Do the performance variability character-
   mance data is that mobile inference performance ex-                                       istics follow certain trends or statistical dis-
   hibits significant variability, even across the same device                               tributions?
   running the same software stack. Figure 10 shows the                        It is clear that inference performance on smartphones
   inference time performance of the most time-consuming                    is non-deterministic and follows a wide distribution. This
   convolutional neural network layer of a key model across                 is highly undesirable as the non-deterministic inference
   several generations of iPhone SoCs (x-axis). As ex-                      time translates directly into varied quality of user expe-
   pected, we see that the inference time (y-axis) is the                   rience. If we were able to model and predict variability,
   lowest for the most recent generation of iPhones, i.e.,                  we could optimize designs by, for example, customiz-
   Apple A-11 processors. Somewhat less intuitive is the                    ing networks to best suit different mobile platforms and
   observed wide performance variability of inference per-                  users depending on situations. In-field functionality and
   formance – even within the same generations of iPhone                    performance evaluation is an important part of our it-
   SoCs.                                                                    erative model fine-tuning process.
      We examine the inference performance results inter-                      Figure 11 illustrates the histogram for the inference
   nally using our small-scale smartphone benchmarking                      time of the key machine learning layer across three dif-
   lab. While we see a general trend of performance vari-                   ferent generations of iPhone mobile SoCs. In particu-
   ability across key machine learning models, the degree                   lar, the inference time for A11 follows an approximate
   of performance variability is much less pronounced, usu-                 Gaussian distribution with the mean centered at 2.02ms
   ally less than 5%. This presents a challenge as ideally                  and the standard deviation of 1.92ms. A recent work
   we would benchmark new models under the exact con-                       by Gaudette et al. [35] shares similar observations for
   ditions we expect the models to run. From our obser-                     mobile applications in general and proposes modeling
   vations this undertaking seems impractical as it would                   techniques to predict the non-determinism in perfor-
   require a fleet of devices.                                              mance with general forms of Gaussian. The follow-on
      The much higher performance variability in the pro-                   work [36] takes a data-driven approach with the use of
   duction environment is likely due to higher system ac-                   arbitrary polynomial chaos expansions which approxi-
   tivities in deployed smartphones and the environment                     mates stochastic systems by a set of orthogonal polyno-
   the smartphones are in (e.g., the ambient temperature                    mial bases, without any assumption of workload/system
   or how many Apps a user allows to run concurrently).                     statistical distribution. With the ability to model per-
   Concurrent processes or background activities cause re-                  formance variability, a certain level of inference perfor-
   source contention, leading to performance perturbation [32].             mance can be guaranteed, leading to overall better qual-
   Furthermore, the performance of mobile processors is                     ity of user experience.
   not only limited by processor junction temperature but                      In summary, the significant performance variability
   also smartphone surface temperature for ergonomic re-                    observed for mobile inference introduces varied user ex-
   quirements [33, 34]. This means that, depending on                       perience. If taking a classic approach to modeling and
   how and where smartphones are used, the likelihood of                    evaluating ML model performance and energy efficiency
   thermal throttling is potentially much higher in the pro-                with an average value of experimental runs, designers
   duction environment, representing more realistic usage                   risk the chance for delivering the required level of per-
formance quality. Thus, particularly for mobile infer-       scale of Facebook . DSPs have more robust software
ence benchmarking, it is critical to describe how severe     stacks. However, porting code still takes a long time
performance variability is for a design. One option is       as the implementation must be signed and whitelisted
to represent evaluation results (for e.g., inference time    by DSP vendors. The story for Apple devices is bet-
performance) with the information of average, maxi-          ter, partially because there is so much less variety of
mum, minimum, and standard deviation of experimen-           devices and software. Metal also plays a large role as
tal measurement values. Furthermore, our observation         it is relatively straightforward to use. Therefore, many
here also pinpoints the importance of in-field studies for   iPhone inference are run on mobile GPUs. With the
machine learning designs.                                    introduction of Vulkan and DSP engineering efforts, in-
                                                             ference are making their way into co-processors. Look-
7.   DISCUSSION AND FUTURE DIRECTIONS                        ing forward, more research and engineering effort put
                                                             into making existing mobile GPU and DSP hardware
   This section discusses the implications from the re-      more amenable to processing DNN inference has a high
sults shown in this paper which influence the important      impact to ML adoption at the edge.
design decisions within Facebook. We also highlight the         Co-processors and accelerators are used for
research directions for the years to come.                   power and stable performance; speedup is of-
   The majority of mobile inference run on CPUs.             ten secondary. The main reason mobile inference are
Given all the engineering efforts put into accelerating      ported to a co-processor is for improved efficiency and
DNN inference with co-processors and accelerators, it        dependable, predictable execution time. While there
is somewhat counterintuitive that inference on Android       are applications that require specialized hardware for
devices are processed on mobile CPUs. The reality is         performance, we suspect this finding is not a Facebook
that it is currently too challenging to maintain code        or DNN-specific phenomenon. Because our main focus
bases optimized to perform well across the wide range        is end-user usability, unless the performance gain is sig-
of Android devices (Section 2). Moreover, as illustrated     nificant (e.g., 50x) and achieved using better tools and
in Figure 1, even if we did port all inference to run        infrastructure, it is unlikely most of these accelerators
on co-processors, the performance gains would not be         will actually be utilized when found on mobile devices.
substantial enough to justify the implementation effort.        Accuracy is a priority, but it must come with
   Most inference run on CPUs that are at least              a reasonable model size. The accuracy of a DNN
six years old. Future facing research is important, but      model can be tied directly to user experience [1]. It is
the reality is that having large-scale, global impact on     also generally true that larger models result in higher
smartphones may be further off than what we think.           accuracy. When it comes to mobile, it is important to
As presented in Section 2, most inference are made on        maximize accuracy while keeping model sizes reason-
processors released in 2011 and 2012, respectively. This     able. Facebook focuses on model architecture optimiza-
isn’t just a case of old smartphones that are still being    tion to identify highly-accurate models while minimiz-
out there or being left on. A major portion of these         ing the number of parameters and MACs. Trained mod-
smartphones are sold in the recent one to two years.         els are then further refined for efficiency with aggressive
To provide the same experience to all Facebook users,        quantization and weight/channel pruning. Looking for-
substantial software optimization efforts are targeted in    ward, methods to improve architecture search, includ-
optimizing inference for these CPUs—ones that repre-         ing techniques, such as BayesOpt [37, 38], AutoML [39]
sent the largest market share.                               and [40], are of important interest.
   The performance difference between a mobile                  There is also a big push for generally applicable opti-
CPU and GPU/DSP is not 100×. Given the per-                  mizations. Recent work on hardware for machine learn-
formance gap between server CPUs and GPUs is usually         ing and efficient training and inference has substantially
60-100×, one might suspect that a similar trend is found     advanced the state-of-the-art [41, 42, 43, 44, 45, 46, 47,
on the mobile side. However, this is not the case. Mo-       48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
bile GPUs, and even DSPs, are less than 9 × faster than      63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77].
mobile CPUs. Similar finding is found in [6]. This is        Many of the ideas being published in the top architec-
largely because mobile GPUs were not designed to pro-        ture conferences are custom hardware specific and not
cess the same class of high-resolution graphics render-      applicable to general-purpose existing SoCs.
ing that discrete GPUs are. Mobile GPUs help offload            Performance variability is a real problem. An
image processing in a relatively low-end environment.        often overlooked issue in the architecture research com-
DSPs are slightly more promising–mobile inference are        munity is performance variability, which results in a
slowly transitioning to execute on DSPs. Furthermore,        serious concern for real-time and user-facing systems.
many mobile CPUs come with a decently provisioned            Optimizing designs for the average case risks user ex-
SIMD unit, which when properly programmed provides           perience for a large portion of the market share while
sufficient performance for vision-based inference.           targeting designs for all mobile devices in the market
   Programmability is a primary roadblock to us-             implies conservative design decisions. It is important to
ing mobile co-processors/accelerators. As seen in            present performance results taking into account perfor-
Section 2, one of the main challenges with using GPUs        mance distribution, particularly for mobile inference.
and DSPs is programmability. For Android smartphones,           Researchers need to consider full-picture and
OpenCL is not reliable enough for a business at the
You can also read