Moby: A Mobile Benchmark Suite for Architectural Simulators

Page created by Felix Cummings
 
CONTINUE READING
Moby: A Mobile Benchmark Suite for Architectural
                  Simulators

                               Yongbing Huang∗† , Zhongbin Zha∗† , Mingyu Chen∗ , Lixin Zhang∗
                     ∗ State   Key Laboratory of Computer Architecture, Institute of Computing Technology,
                                            Chinese Academy of Sciences, Beijing, China
                                    † University of Chinese Academy of Sciences, Beijing, China
                                  Email:{huangyongbing, zhazhongbin, cmy, zhanglixin}@ict.ac.cn

    Abstract—Mobile devices such as smartphones and tablets          Snapdragon [6] are more prevalent than processors like Intel’s
have become the primary consumer computing devices, and              Atom [7]. Generally, as the performance of these mobile
their rate of adoption continues to grow. The applications that      processors improves, their microarchitectures become more
run on these mobile platforms vary in how they use hardware          complicated. For example, mobile processors with four-cores,
resources, and their diversity is increasing. Performance and        an out-of-order execution model, and two-level caches have be-
power limitations also vary widely across mobile platforms. Thus
there is a growing need for tools to help computer architects
                                                                     come the mainstream. Mobile system designers must consider
design systems to meet the needs of mobile workloads. Full-system    how application and OS diversity affect their design choices
simulators are invaluable tools for designing new architectures,     in this increasingly complex design space.
but we still need appropriate benchmark suites that capture the          Benchmarking and architectural simulation are two im-
behaviors of emerging mobile applications. Current benchmark
                                                                     portant tools for processor design and computer architecture
suites cover only a small range of mobile applications, and many
cannot run directly in simulators due to their user interaction      research. To be relevant, a benchmark suite for architectural
requirements.                                                        research must satisfy at least two properties. First, work-
                                                                     loads in the benchmark suite should be diverse enough to
    In this paper, we introduce and characterize Moby, a bench-      exhibit the range of behaviors of the target applications.
mark suite designed to make it easier to use full-system architec-   Second, all the applications should be portable to architec-
tural simulators to evaluate microarchitectures for mobile pro-      tural simulators. However, most of current mobile benchmark
cessors. Moby contains popular Android applications, including
a web browser, a social networking application, an email client, a
                                                                     suites represent only a small subset of mobile application-
music player, a video player, a document processing application,     s [8], [9], [10], [11], [12], [13], and some cannot be run directly
and a map program. To facilitate microarchitectural exploration,     in simulators due to user-interaction requirements (e.g., the
we port the Moby benchmark suite to the popular gem5 simulator.      interactive games and audio player of Gutierrez et al. [14].
We characterize the architecture-independent features of Moby        Meanwhile, existing benchmarks such as SPEC CPU2006
applications on the simulator and analyze the architecture-          exhibit significantly different behaviors from interactive mobile
dependent features on a current-generation mobile platform. Our      applications [13], [14].
results show that mobile applications exhibit complex instruction
execution behaviors and poor code locality, but current mobile           In this paper, we develop Moby, a benchmark suite de-
platforms especially instruction-related components cannot meet      signed to evaluate microarchitectures of mobile platforms in
their requirements.                                                  full-system architectural simulators. Generally, there are two
                                                                     design issues that drive our benchmark suite. First, mobile
                      I.   I NTRODUCTION                             applications on different operating systems are incompatible.
                                                                     Since Android is the commonly used operating system for
    Mobile devices, especially smartphones and tablets, have         mobile devices, Moby contains only mobile applications that
become an important world-wide market. From the application          run on the Android OS. Second, most popular mobile ap-
point of view, a wide variety of mobile applications are now         plications are commercial, and thus their source codes are
widely used; these include web browsers, social networks, e-         not generally available. We choose only applications that can
mail clients, audio and video players, document processing           be freely downloaded, in order to avoid licensing issues. In
systems, and map programs, to name a few. Different types of         total, Moby contains 10 mobile applications spanning nine
applications present different requirements for the hardware         categories, including a web browser, a social networking
components of mobile platforms. From the mobile operating            application, email, audio, video, document, map and game.
system point of view, Android [1] and iOS [2] have the highest       Except the web browser application BBench [14], all the other
market occupancy and growth speed. Android adoption has              applications are selected from Google Play Store [15].
ramped up quickly, gaining in popularity six times faster than
iOS. Android and iOS use different programming languages                 Since our benchmark suite is intended to drive architec-
and execution models, and they differ in their utilization of        tural simulators, all applications should be executable without
hardware resources. Therefore, the requirements placed on            manual user inputs. Although AutoGUI [13] provides a user
hardware resources for different mobile applications vary.           interface automation tool to record and deterministically replay
                                                                     user actions, we use an alternative method to bypass user
    As for mobile platforms, ARM [3] based mobile proces-            interaction by executing only typical representative operations
sors such as Apple’s Ax [4], TI’s OMAP [5], Qualcomm’s               for these mobile applications. While Moby can be executed
on many simulators that support Android OS, we take the                 Benchmark suites intended to support architectural design
commonly used gem5 simulator [16] as an example to test             space exploration and system research should make it pos-
and characterize Moby. The gem5 disk image for Moby has             sible to instrument, manipulate, and model the constituent
already become public [17].                                         applications in detail. However, most popular mobile appli-
                                                                    cations are commercial, which complicates instrumentation
    We measure microarchitecture-independent features on the        because their source codes are unavailable. Although most
gem5 simulator with the ARM ISA and microarchitecture-              Moby applications lack source codes, all can be downloaded
dependent features on the ARM-based Pandaboard develop-             for free. Note that most mobile applications involve user
ment board [18]. The microarchitecture-independent features         interaction and require a network connection, both of which are
include instruction mix, working set size, data and instruc-        difficult to implement or model in architectural simulators. The
tion locality, and binary execution behaviors. The instruction      dependence on networks can easily be removed by buffering
features show that the representative operations for all applica-   any required remote data in local storage.
tions execute several billion instructions and that nearly 70%
of branches are conditional. Furthermore, most applications             2) User Interaction: A major difficulty in analyzing inter-
spawn about 20 processes and invoke more than 20 libraries.         active mobile applications is generating reproducible results
On the Pandaboard, measured microarchitecture-dependent             without manual user inputs. Moreover, the slow execution
features mainly include CPI and the behaviors of the branch         speed of full-system simulators makes it impractical to incor-
predictor, cache, TLB, and memory components. Experimental          porate user-action inputs in experiments. There are two main
results demonstrate that all applications present high CPIs,        solutions to cope with user interaction in simulators. Tools like
which implies that these mobile applications and current ARM-       AutoGUI [13] and Xnee [19] provide automation capabilities
based mobile platforms are not well matched. In particular,         to record and replay user inputs. Using similar tools, we are
the instruction-related resources (branch predictor, instruction    working to identify representative code pieces that suffer from
cache, and TLB) suffer from serious miss rates.                     large response latencies. However, most current automation
                                                                    tools still suffer from shortcomings like nondeterministic re-
   In summary, we make three contributions:                         play.
  • We present Moby, a new mobile benchmark suite that                  The other simple solution is to avoid user interaction
    contains a diverse set of applications and is appropriate       in simulators. We find that most main activities of mobile
    for simulation-based design space exploration.                  applications can be executed as separate processes, and user
  • We extract typical representative operations of interactive     interactions mainly used specify inputs for these activities. By
    applications in Moby and automatically execute them on          executing these activities by specifying their inputs manually
    full-system architectural simulators.                           in command lines, user interactions are no longer required.
  • We describe both microarchitecture-independent and              Compared to using automation tools, this method is both
    microarchitecture-dependent characteristics of all Moby         simple and efficient, and thus we adopt this approach for Moby.
    applications.
                                                                        The main activities illustrated above are considered to be
    The rest of this paper is organized as follows. Section         representative operations of mobile applications, and they can
II describes the applications included in Moby. Section III         be extracted from the AndroidManifest.xml file for mobile ap-
introduces our experimental platforms and tools. Then the           plications running on the Android OS. In the current version of
microarchitecture-independent features and microarchitecture-       Moby, only typical operations for each application are executed
dependent features of Moby are illustrated in Section IV and        in architectural simulators. In addition, we can combine several
Section V, respectively. Finally, we describe related work in       typical operations together, in order to improve simulation
Section VI and conclude in Section VII.                             accuracy.
                                                                        3) Selection Steps: We take five steps to select the appli-
            II.   T HE M OBY B ENCHMARK S UITE                      cations in the Moby benchmark suite. Given the popularity
                                                                    and maturity of the Android ecosystem, we study mobile
    The goal of this work is to define a benchmark suite that       applications executed on Android OS. Initially, we choose
can be used to design and optimize mobile processors. In            commonly used programs from the Google Play Store [15]
this section, we first present our methods to select suitable       as our application pool. Then, for each category in the Google
applications for such a suite, and then we describe each            Play Store, we focus only on popular applications that are
included application in detail.                                     free and have high download rates. A subset of applications
                                                                    studied can be found in Zhang et al. [20]. Next, we measure
A. Benchmark Selection Methods                                      microarchitectural characteristics of these applications on a
                                                                    real platform (see Section V). After that, according to their
    1) Requirements: As a mobile benchmark suite for archi-         characteristics, we select several applications to represent
tectural simulators, Moby should both contain emerging and          each category. Finally, we extract representative operations
diverse applications and also support research.                     for selected applications, and verify whether these operations
                                                                    can be automatically executed without interaction with users
    The rapid innovation of the mobile Internet spawns many         and whether necessary data downloaded from networks can be
emerging applications with new and varied behaviors. Mobile         buffered and replayed offline. Moby includes only applications
platform architects need tools to model the diverse behaviors       that pass these tests.
and resource requirements of current and emerging mobile
applications in different application markets.                         As a result, we have chosen nine applications from the
TABLE I.      S UMMARY OF M OBY
                                                                                      Hence, the typical operations for KingsoftOffice are opening
      Bench                   Category           Typical OP                Input      files with these formats.
 BBench∗                  Web Browser       Load web pages          Web pages
 K9Mail                   Email             Load/Show emails        Buffered emails       6) Adobe [25] is an application for reliably viewing and in-
 SinaWeibo                Social Network    Load information        Buffered texts    teracting with PDF documents. Its typical operation is viewing
 NeteaseNews              News              Check and load news     Buffered news     a PDF file.
 KingsoftOffice           Document          Open doc/xls/ppt file   A doc file
 AdobeReader              Document          Open pdf file           A PDF file           7) BaiduMap [26] is a mobile map client from China’s
 BaiduMap                 Map               Load an area’s map      Buffered maps     biggest search engine and is similar to Google Maps. The
 MXPlayer                 Video             Play a video            A video file      map client presents detailed maps with 3D buildings, supports
 TTPod                    Audio             Play a song             A music file      navigation, and displays neighboring restaurant and hotel in-
 FrozenBubble             Game              Load game               Null              formation. Loading the map for a specific area from offline
 ∗: BBench is from Gutierrez et al. [14]                                              maps is its typical operation.
                                                                                         8) MXPlayer [27] is a video player that supports almost
Google Play Store for our benchmark suite; summaries for                              every movie format. It applies hardware acceleration to all the
these are shown in Table I. Almost all applications are com-                          videos with the help of a new H/W decoder, and it supports
mercial, but they can be downloaded for free. Moreover, we                            multi-core decoding. Its typical operation consists of playing
choose BBench [14] to represent web-browser applications,                             a mp4 video clip stored in local disk.
creating a 10-applications in suite. We will add more applica-
                                                                                          9) TTPod [28] is a music player for a wide variety of dif-
tions to our benchmark suite as more mainstream or popular
                                                                                      ferent audio formats. It provides high-quality decoding, highly
applications emerge.
                                                                                      accurate lyrics, and album acts downloads. It supports a rich
                                                                                      graphical user interface with built-in graphics, a customizable
B. Benchmark Descriptions                                                             equalizer function, and floating lyrics. The operation for this
    In this subsection, we describe each application’s use and                        application is to play the first minute of an MP3 file.
features in detail.
                                                                                          10) FrozenBubble [29] is a puzzle game for Android. In-
    1) BBench [14] is an automated web-browser page-                                  teractive games have become important applications on mobile
rendering benchmark that tests rendering performance. It com-                         devices with the introduction of high performance CPUs and
prises a sequence of snapshots of a varied selection of the                           mobile GPUs. However, interactive games heavily rely on
most popular sites. The webpages included in the benchmark                            users and thus cannot be automatically executed in simulators.
contain diverse content and page styles (e.g., dynamic context,                       This game is chosen because it can be fully loaded and simply
JavaScript, video, images, Flash, CSS, HTML5, etc.). Typical                          played without any user interaction.
operation constitutes simply loading a webpage.
    2) K9Mail [21] is an open-source email client running
on the Android platform, which supports the commonly used                             C. Input Sets
POP3 and IMAP4 protocols. Although K9Mail supports fea-
tures like sending/receiving email, searching, and multi-folder                           Generally, each benchmark suite should provide several
syncing, our benchmark only chooses loading and displaying                            input sets that represent various usage scenarios. For example,
emails buffered in local storage as its typical operations. This                      PARSEC [30] applications contain six input sets, each of which
requires no network connection, and it can be easily automated                        processes different amounts of data. Unlike those applications
without user interaction.                                                             designed for high performance servers, most mobile applica-
                                                                                      tions do not focus on data processing. Therefore, in the current
    3) SinaWeibo [22] is a client for one of China’s biggest                          version of Moby, we use just one input size for each applica-
social networking and microblogging services. It allows users                         tion. Nevertheless, multiple input sets for Moby applications
to publish information instantly and share it with others. The                        can be easily produced. For applications like KingsoftOffice,
information includes text, picture, music, and video. Loading                         AdobeReader, MXPlayer, and TTPod (which mainly execute
and displaying information is the typical operation for social                        or process input files), input sets can easily be constructed
networking applications. Again, this information is buffered                          by selecting input files with varied types and sizes. The input
locally.                                                                              sets of other mobile applications are actually buffered network
                                                                                      data, which the users can obtain from real mobile platforms by
    4) NeteaseNews [23] is a news reader application. Users
                                                                                      performing different web queries. These kinds of applications
can obtain news by subscribing to magazines, newspapers,
                                                                                      include K9Mail, Sinaweibo, NeteaseNews, and BaiduMap. For
and other resources. The typical operation for news readers
                                                                                      example, users can construct input sets for BaiduMap by
is checking the news from the server and listing articles. In
                                                                                      downloading maps of various areas from anywhere on the
our benchmark, we substitute local data for data on remote
                                                                                      internet.
server.
    5) KingsoftOffice [24] is an efficient mobile office appli-
cation. It contains rich editing features, and supports 23 kinds                                          III.   M ETHODOLOGY
of files, including DOC, XLS, PPT, and PDF. Writer, pre-
sentation, and spreadsheet are commonly used KingsoftOffice                              In this section, we explain how we characterize the Moby
programs to manipulate DOC, PPT, and XLS files respectively.                          benchmark suite in terms of platforms and tools.
TABLE II.        M OBY I NSTRUCTION S UMMARY

                                                                                             Branches
                Bench                     Instruction Count (Billions)                                           Loads    Stores   Working Set Size (MB)
                                                                                     Total       Cond./Total∗
                           †
                BBench                    2.48                                       14.43%      69.5%           23.05%   12.16%                     80
                K9Mail                    1.18                                       11.00%      72.60%          20.03%   9.34%                      64
                SinaWeibo                 2.23                                       16.92%      68.35%          27.21%   14.68%                    114
                NeteaseNews               2.65                                       16.58%      69.01%          25.85%   12.22%                    104
                KingsoftOffice            2.24                                       16.59%      68.73%          26.13%   14.06%                     87
                AdobeReader               2.09                                       15.17%      70.47%          23.74%   12.19%                     83
                BaiduMap                  3.53                                       14.31%      72.50%          22.79%   12.29%                    102
                MXPlayer‡                 3.84                                       18.22%      70.64%          23.79%   12.76%                     97
                TTPod‡                    3.87                                       15.18%      68.45%          25.49%   12.84%                    126
                FrozenBubble              0.28                                       15.59%      71.76%          21.53%   9.66%                      47
                ∗: Percent of all branch instructions that are conditional

                †: BBench only loads each page once

                ‡: TTPod and MXPlayer each play about three seconds of music/video

A. Platforms                                                                                         components, including the processor pipeline, cache, and TLB.
                                                                                                     The Cortex A9 provides six core counters that can count up to
    We measure the Moby benchmark suite on both the gem5                                             six events simultaneously, one extra cycle counter, and two L2
simulator [16] and the Pandaboard ES [18] development board                                          cache counters. However, the metrics shown in Section V can-
running Android version ICS 4.0. The gem5 simulator is                                               not be directly acquired or computed using the above counters
a widely used architecture simulator which supports Alpha,                                           when running each application just once. Hence, we repeat
ARM, SPARC, MIPS, POWER and x86 ISAs. By default, the                                                each experiment multiple times with different combinations of
gem5 simulator provides several machine configurations for                                           counters, and report average values from ten measurements.
ARM ISAs. These machine configurations, which contain the                                            All performance event data are collected using the lightweight
parameters of main hardware components, are almost the same                                          performance counter tool TopMC [33].
as the configurations of real ARM-based development boards
such as Versatile Express [31].
                                                                                                                IV.   M ICROARCHITECTURE -I NDEPENDENT
   The Pandaboard ES board comes with a market-quality                                                                     C HARACTERIZATION
OMAP 4460 system-on-chip (SoC) equipped with a dual-core
Cortex A9 processor [32] manufactured on the 45 nm process                                               Microarchitecture-independent characteristics enable us to
node and 1GB LPDDR2 DRAM. The Cortex A9 processor is                                                 understand the inherent nature of applications. In this section,
a complex out-of-order four-wide superscalar core with eight                                         we provide an overview of the microarchitecture-independent
pipeline stages. It has 32 KB 4-way set associative L1 I/D                                           characteristics of Moby in terms of instruction mix, working
caches and a 512 KB 16-way set associative L2 cache.                                                 set size, spawned processes, invoked libraries, and code and
                                                                                                     data locality. Note that most mobile applications execute many
    In our experiments, the configurations of main hardware                                          short activities, where each activity only accounts for a few
components such as cache and memory in gem5 are set                                                  billion instructions. Compared to the trillions of instructions
according to that of Pandaboard. Their operating systems and                                         for SPEC CPU2006 applications [34], executing these few
disk images are also nearly the same.                                                                billion instructions is much more suitable for slow, full-
                                                                                                     system simulators. We use the gem5 full-system simulator to
B. Tools                                                                                             execute the representative operations for each workload shown
                                                                                                     in Table I1 , and collect all the following microarchitecture-
    In order to study instruction behaviors, we modify the gem5                                      independent metrics. We find that our workloads share similar
simulator to collect the instruction trace for each application.                                     instruction profiles even though their working-set sizes vary
Meanwhile, we can also map instructions to their binaries such                                       significantly.
as libraries, the OS kernel, and the application binary file by
dumping mapping tables between instructions’ virtual address-                                        A. Instruction Mix
es and binaries. The mapping tables are just the contents of
the proc file ”/proc/pid/maps” in the Android file system. This                                          The mix of instructions reflect the requirements on d-
information is maintained in the virtual memory structure and                                        ifferent hardware resources. For example, load and store
can be tracked by the task structure of processes. Thus, in                                          instructions rely on cache and memory resources. Different
the simulator, we only need to find out these task structures                                        types of branch instructions indirectly reveal the complexity of
for different processes and then read out the corresponding                                          programs and their demands on branch predictors. As shown in
contents.                                                                                            Table II, load and store instructions account for about 25% and
                                                                                                     12%, respectively, for most applications. Compared to most
   For the purpose of studying performance of processor                                              integer benchmarks of SPEC CPU2006, which present diverse
and memory components, we measure the microarchitecture                                              load and store behaviors [35], the percentages of load and store
characteristics of Moby suite using hardware performance
counters [3] on the Pandaboard. The hardware performance                                               1 We have already released the corresponding gem5 disk images and
events provided by the Cortex A9 processor cover most main                                           execution scripts for all Moby applications [17].
50                            50                             50                            50                         50
 Percent of Requests

                                      Inst.   Data               Inst.   Data                    Inst.   Data                Inst.    Data                Inst.   Data
                       40                            40                             40                            40                         40
                       30                            30                             30                            30                         30
                       20                            20                             20                            20                         20
                       10                            10                             10                            10                         10
                        0                             0                              0                             0                          0

                                 Reuse Distance
                              (a) BBench                    (b) K9Mail                      (c) SinaWeibo              (d) NeteaseNews            (e) KingsoftOffice

 50                                                  50                             50                            40                         50
                                     Inst.    Data               Inst.   Data                    Inst.   Data                Inst.    Data                Inst.   Data
 40                                                  40                             40                            30                         40
 30                                                  30                             30                                                       30
                                                                                                                  20
 20                                                  20                             20                                                       20
 10                                                  10                             10                            10                         10
  0                                                   0                              0                             0                          0

                            (f) AdobeReader                (g) BaiduMap                     (h) MXPlayer                  (i) TTPod               (j) FrozenBubble

Fig. 1.                       L1 reuse distance distributions

                                       1
                                                                                                                                                      BBench
                                                     16-way set associative cache                                                                     K9Mail
                                     0.8
                                                                                                                                                      SinaWeibo
                                     0.6                                                                                                              NeteaseNews
                               CDF

                                                                                                                                                      KingsoftOffice
                                     0.4                                                                                                              AdobeReader
                                                                                                                                                      BaiduMap
                                     0.2
                                                                                                                                                      MXPlayer
                                                                                                                                                      TTPod
                                       0
                                                                                                                                                      FrozenBubble

                                                                                         Reuse Distance

Fig. 2.                       L2 reuse distance distributions

instructions are similar for Moby applications. Another 15%                                              implies that only a small portion of each touched page is used.
of instructions are branches for all applications except MX-                                             Given this, these applications are likely to suffer frequent TLB
Player and K9mail. Meanwhile, conditional branches occupy                                                misses.
nearly 70% of these branch instructions across all applications.
Generally, each conditional branch instruction may result in
executing a wrong path and consequently require out-of-order                                             C. Locality
processors to roll back. The high percentages of conditional                                                 In order to gain a deeper understanding of the code and data
branch instructions is likely to trigger many mispredictions                                             locality in Moby applications, we analyze the reuse distances
with large penalties, which will affect the overall performance.                                         (the number of distinct references between two successive uses
                                                                                                         of a line) of all references to two different cache levels. Fig-
B. Working Sets                                                                                          ure 1 shows the reuse distances of instruction and data requests
                                                                                                         for each Moby application. All the requests are captured when
   Working-set size can be measured at cacheline or page                                                 they access L1 instruction or data caches. Figure 2 shows the
granularities, depending on our purpose. We choose pages                                                 cumulative density function (CDF) of reuse distances for L2
(4KB) as our basic granularity in this paper because we                                                  cache requests. All the requests studied are those requests that
aim at studying main memory access behaviors. Note that                                                  miss in 32KB L1 instruction/data cache.
most typical operations of Moby applications only last several
seconds, and thus we consider working-set sizes to be the total                                              As shown in Figure 1, Moby applications present similar
number of pages touched during the whole execution.                                                      instruction and data locality behaviors. Typically, only about
                                                                                                         30% of instruction references have reuse distances less than
    Half the working sets in Table II approach or exceed                                                 four, which is the set associativity of the Pandaboard instruc-
100 megabytes. Only K9Mail and FrozenBubble, which only                                                  tion cache. Memory references with larger reuse distances
execute around a billion instructions, have working sets smaller                                         suffer misses under LRU replacement. Some instructions have
than 65 MB. Even so, all working sets exceed the capacity                                                a zero reuse distance because whenever a line is fetched into
of the last-level cache. In contrast to these large working sets,                                        the instruction queue, subsequent instructions will also be
mobile input sets are usually small, and the applications do not                                         found in the queue: no cache access is required. The figure
execute sustained memory accesses. For example, SinaWeibo                                                shows that highly associative instruction caches (64 or more)
typically loads tens of small text messages at a time, which                                             could service over 80% of instructions.
40
                            Processes   Libraries
     35
     30
     25
     20
     15
     10
      5
      0

Fig. 3.   Numbers of processes and invoked libraries

    As for data requests, Figure 1 indicates that Moby ap-
plications generally present good data locality. For a four-        Fig. 4.   Instruction flow distributions for KingsoftOffice
way set associative data cache, nearly 70% of lines can be
reused. Requests with one reuse distance constitute 40% of
all accesses, which implies that data within a cacheline enjoy      colorful segments imply that those binaries are executed con-
high temporal locality.                                             tinuously without suffering interference from other binaries.
                                                                    Figure 4 illustrates that the Android kernel and five other
    Figure 2 shows the reuse distance distribution of L2 cache      binaries (dalvik-jit-code-cache, libdvm.so, libcutils.so, libc.so,
accesses. For SinaWeibo, TTPod, and NeteaseNews, about              and libnativehelper.so) dominate the execution of Kingsoft-
40% of memory references have reuse distances smaller than          Office. These binaries can be organized into three groups,
16 (the associativity of the Pandaboard L2 cache). At this          Java-language related, C-language related, and system related.
associativity, reused memory locations for the remaining ap-        Moreover, the execution switches among different binaries
plications only reach 20%. Moreover, as reuse distance grows,       frequently. For instance, the executions of the libc library
these “reusability ratios” increase only gradually until they       and the Android kernel are interleaved. In such situations,
level off at 512. This implies that 16 is a good choice for         instruction locality and branch prediction accuracy may be
L2 cache associativity on mobile platforms.                         affected, which results in poor performance for instruction-
                                                                    related components.
D. Instruction Execution Flow
    The instructions executed by most mobile applications ex-       E. PCA Analysis
hibit complex behavior. Mobile applications tend to depend on           Diversity is an important metric to evaluate the repre-
GUI-based display systems. Furthermore, for high portability        sentative of a benchmark suite. We use principal component
and programmer productivity, most Android applications are          analysis (PCA) to demonstrate the diversity of Moby appli-
written in the object-oriented Java language. Thus, Moby            cations by analyzing both their microarchitecture-independent
applications may invoke many libraries and generate many            and microarchitecture-dependent behaviors. PCA applies an
instructions.                                                       orthogonal transformation to a group of possibly correlated
    Figure 3 depicts the number of processes spawned and            variables to convert them into several uncorrelated variables
libraries invoked. Most applications create tens of process-        (principal components) with different weights. Similar PCA
es/threads and access more than 15 libraries. Six of the Moby       analysis has been conducted on mobile applications and tradi-
applications invoke more than 20 libraries, which increases         tional SPEC benchmarks by Sunwoo et al. [13], whose results
code footprints and puts pressure on all instruction-related        demonstrate that mobile applications differ greatly from SPEC
microarchitectural resources. Furthermore, multiple processes       benchmarks, especially in instruction-side behaviors.
running in parallel inevitably cause interference in the caches,         Figure 5 depicts the PCA map of the above
the TLB, and the predictors.                                        microarchitecture-independent metrics for Moby applications,
    To better understand dynamic instruction behaviors, we          showing only two main principal components. The X-axis
collect instruction traces and map dynamic instructions back        (i.e., Dim 1) shows the first principal component, which
to the static binaries. Given the many background processes         represents more than 65% of the variability. Dim 2 shown
running within the Android OS, we record only instructions          in the Y-axis accounts for another 20% variability. Hence,
closely related to the target application. A memory map             these two principal components can influence the differences
file for each processor assists translation (as described in        among all Moby applications. The distance between points
Section III-B).                                                     on the PCA map implies the dissimilarity of applications.
                                                                    The closer two points are, the more similar the applications.
    As an example, we present a part of the instruction execu-      As shown in the figure, 10 Moby applications are evenly
tion flow of KingsoftOffice in Figure 4. The X-axis depicts         scattered in different regions. This phenomenon means that
the number of instructions executed, and the Y-axis shows           the mobile applications we choose are diverse with respect to
the corresponding static binary files for these instructions. The   their inherent characteristics.
Stalled Due to TLB
                                                                                                   30

                                                                    Percent of Overall Cycles
                                                                                                                                      Stalled Due to Dcache
                                                                                                   25
                                                                                                                                      Stalled Due to Icache
                                                                                                   20
                                                                                                   15
                                                                                                   10
                                                                                                           5
                                                                                                           0

                                                                   Fig. 7.                                         Contribution to overall cycles broken down by component
Fig. 5.   PCA results for microarchitecture-independent metrics

                                                                                                           14

                                                                                 % Branches Mispredicted
   4
                                                                                                           12
 3.5
                                                                                                           10
   3
                                                                                                               8
 2.5
   2                                                                                                           6
 1.5                                                                                                           4
   1                                                                                                           2
 0.5                                                                                                           0
   0

                                                                   Fig. 8.                                         Branch misprediction rates
Fig. 6.   CPI results

                                                                       2) Stalled Cycle per Component: The processor’s pipeline
            V.    M ICROARCHITECTURE -D EPENDENT                   will stall if components fill and cannot allocate additional
                       C HARACTERIZATION                           resources for incoming requests. There are many such compo-
                                                                   nents in ARM processors, including the cache, TLB, reorder
   We explain the microarchitecture-dependent results of Mo-       buffer, load/store buffer, and reservation stations.
by on the Pandaboard development board in this section. These
metrics are obtained from the hardware performance counters            Figure 7 depicts the percent of stalled cycles caused by
provided by ARM processors.                                        cache and TLB resources. Since other components cause very
                                                                   few stall cycles, the remaining cycles can be considered to be
                                                                   the processor’s active cycles. Two interesting observations can
A. Overall Performance                                             be made from Figure 7. First, nearly 2% and 5% of processors
                                                                   cycles are stalled waiting for the TLB and instruction cache
    1) CPI: Cycles per instruction (CPI) initially characterizes   for almost all Moby applications. But the pipeline stall cycles
the overall performance of a target application on measured        incurred by the data cache vary from 3% to 20% for different
platform. Applications with high CPI perform poorly, which         applications. Second, for applications such as K9Mail and
means that the microarchitecture of the measured platform          TTPod, the instruction cache stalls the processor’s pipeline
could be improved to better cope with these applications.          more often than the data cache. Unlike desktop applica-
    Figure 6 depicts the CPI results for all Moby applications.    tions [36] and server applications [14], whose data cache
Six out of ten applications have CPI higher than 3, and the        dominates the pipeline stalls, the TLB and instruction cache of
CPIs of the remaining applications are around 2. Note that the     mobile processors are primarily responsible for the observed
ideal CPI for Cortex A9 processor with its two-issue width         performance degradation. Therefore, more attention should be
is 0.5, and hence these applications perform poorly. Mobile        paid to optimizing mobile processor TLBs and instruction
processors like Cortex A9 processor could be better optimized      caches.
for workloads like Moby. Moreover, we observe that the
four applications with relatively low CPI — KingsoftOffice,        B. Branch Misprediction Rate
AdobeReader, MXPlayer, and TTPod — process large amounts
of data, unlike the applications with higher CPI. This implies         The branch predictor plays an important role in ensuring
that instruction-related components might hinder the overall       efficient out-of-order execution and exploiting instruction level
performance.                                                       parallelism.
50                                                                     80
                                                   Icache     Dcache   ITLB   DTLB                                                                         Data   Inst.   Total
 Misses per 1K Instructions

                              45                                                                     70
                              40                                                                     60

                                                                                      % Miss Ratio
                              35
                                                                                                     50
                              30
                              25                                                                     40
                              20                                                                     30
                              15                                                                     20
                              10
                                                                                                     10
                               5
                               0                                                                      0

Fig. 9.                            Cache and TLB miss rates                          Fig. 10.                                        L2 cache miss rates

                                                                                                                                    70

                                                                                                      Percent of Overall Requests
    As shown in Figure 8, the branch misprediction rates for
                                                                                                                                    60
NeteaseNews and BaiduMap reach up to 12%. This happens
because nearly 70% of branches are conditional, as shown                                                                            50
in Table II, and the execution of these applications switches                                                                       40
frequently among different binaries, as illustrated in Figure 4.                                                                    30
Each time instructions are switched, branch mispredictions are                                                                      20
likely occur. Note that unpredictable user behaviors for inter-
                                                                                                                                    10
active mobile applications can further exacerbate the branch
misprediction rate.                                                                                                                  0

C. Cache & TLB & Memory
    1) L1 I/D Cache & I/D TLB: As illustrated by Gutierrez
et al. [14], Jiang et al. [36] and Ferdman et al. [37], the miss
rates of the instruction cache and instruction TLB are high                          Fig. 11.      Ratio of data requests in L2 cache. The rest requests are
                                                                                     instructions.
due to the large code size of interactive applications and the
limited cache size of mobile processors. Figure 9 shows the
same observation.                                                                    fraction of all memory requests, since there will be many DMA
    According to the L1 cache reuse distance shown in Fig-                           requests issued by other I/O devices like the GPU and LCD
ure 1, less than 35% of instruction references have reuse                            display system.
distances smaller than four (the associativity of the Pandaboard
instruction and data caches). Given that data references with                        D. Core Utilization
similarly small reuse distances reach nearly 70%, it is obvious                          Mobile processors are still improving, in terms of both
that the data cache outperforms the instruction cache.                               frequency and numbers of cores. In order to study the core
    Furthermore, since most mobile applications do not manip-                        utilization, we count the cycles executed by different cores.
ulate large amounts of data, their data references are relatively                    In Figure 12, we depict the ratio of cycles executed by Core
few compared to their instruction references.                                        0 compared to the total cycles executed by both cores on the
                                                                                     Pandaboard. Except for MXPlayer, Moby applications do most
    The DTLB suffers a higher miss rate than the data cache                          of their work on Core 0. This suggests that most mobile appli-
for several applications (e.g., BBench). This suggests that                          cations are programmed without considering the existence of
the number of DTLB entries is insufficient to hold random                            multicore platform, and thus they cannot fully utilize precious
distributed data references.                                                         processor resources. Under this condition, simply integrating
    2) L2 Cache: Figure 10 depicts the miss rates of different                       more cores in mobile processors just consumes more power
kinds of requests to the L2 cache. More than 10% instruction                         without improving performance.
requests miss in L2 cache for all Moby applications, and sev-
eral suffer more than 25% instruction misses. This result again                      E. PCA Analysis
demonstrates the large code footprints for mobile applications.
                                                                                         Figure 13 depicts the top two principal components of
Although mobile applications present good locality in the L1
                                                                                     PCA analysis based on the above microarchitecture-dependent
data cache, nearly 50% of the data requests miss in L2 cache.
                                                                                     characteristics. Dim1 is the primary principal component, and
    It is interesting to observe that data references no longer                      its main contributor is the L2 miss rate. Dim2 is the second
dominate L2 requests, as shown in Figure 11. Except Kingsoft-                        principal component, and it includes L1D MPKI, DTLB MP-
Office, AdobeReader, and FrozenBubble, the L2 cache receives                         KI, and the branch misprediction rate. Although data points
more instruction requests than data requests. However, from                          for applications such as SinaWeibo and FrozenBubble in the
the view of memory, instruction requests account for a small                         Y-axis are a bit closer, these applications are widely spread
90                                                                      has become less suitable for current microarchitectural analy-
     Percent of Overall Core Cycles
                                      80                                                                      sis. Gutierrez et al. [14] present an interactive game, a video
                                      70                                                                      player, a media player, and BBench as typical benchmarks for
                                      60                                                                      smartphones. MobileBench [12] contains several web browsing
                                      50                                                                      applications, a photo rendering application and a video play-
                                      40
                                                                                                              er. Sunwoo et al. [13] study several smartphone workloads
                                      30
                                                                                                              (AndEBench, CaffeineMark, RL Benchmark, Angry Birds,
                                      20
                                                                                                              and KingsoftOffice) to measure the performance of the dalvik
                                      10
                                       0
                                                                                                              virtual machine, the SQLite and the whole system. Compared
                                                                                                              to these benchmarks, Moby contains some similar applications,
                                                                                                              and some with diverse behaviors not yet found in other suites.

                                                                                                                                      VII.     C ONCLUSION
Fig. 12.                                   Ratio of cycles executed by Core 0. Core 1 accounts for the rest       Mobile devices have already become the primary consumer
ratio.                                                                                                        computing devices, and their use still exhibits rapid growth.
                                                                                                              Efficient mobile processor design requires knowledge of typ-
                                                                                                              ical mobile applications. In this paper, we have presented a
                                                                                                              mobile benchmark suite — Moby — that includes popular
                                                                                                              applications executed under Android OS. Our analysis finds
                                                                                                              them to be sufficiently diverse to be considered representative.
                                                                                                                  In this study, we fully characterize Moby in order to assist
                                                                                                              other researchers in using it for their studies. We use the gem5
                                                                                                              simulator and the hardware performance counters provided
                                                                                                              by ARM processors to evaluate Moby’s microarchitecture-
                                                                                                              independent features (instruction mix, working set size, data
                                                                                                              and instruction locality, and binary execution behavior) and
                                                                                                              microarchitecture-dependent features (CPI and the behaviors
                                                                                                              of the branch predictor, caches, TLBs, and other memory
                                                                                                              components). We will continue to add more mobile applica-
                                                                                                              tions to the Moby benchmark suite as more mainstream or
                                                                                                              popular applications emerge. Furthermore, we will integrate
Fig. 13. PCA results microarchitecture-dependent characteristics on the                                       user-action automation tools to model the effects of user inputs
Pandaboard
                                                                                                              on applications.

across the primary principal component. Given the PCA results                                                                         ACKNOWLEDGMENT
of both microarchitecture-independent and microarchitecture-
dependent behaviors, we can conclude that Moby applications                                                      We would like to thank Sally McKee for her useful sug-
behavior vary significantly and in many ways.                                                                 gestions and hard work improving the writing quality. We also
                                                                                                              thank Yungang Bao, Kun Zhang, and other teammates from
                                                        VI.    R ELATED W ORK                                 ICT, and the anonymous reviewers for helpful suggestions
                                                                                                              and insightful feedback. This research is supported by the
    There are many kinds of benchmarks to evaluate the                                                        National Basic Research Program of China (973 Program)
performance of mobile devices. In the industry community,                                                     under the grant number 2011CB302502, the National Natural
commonly used benchmarks such as EEMBC [38], SiSoft                                                           Science Foundation of China (NSFC) under the grant number
Snadra [39], AnTuTu [8], 3D GLBenchmark [9], and Geek-                                                        60925009, 61272132, and 61221062, the Strategic Priority
bench [10] can measure the peek performance of mobile                                                         Research Program of the Chinese Academy of Sciences under
device components, including the CPU, memory, GPU, and                                                        the grant number XDA06010401, and the Huawei Research
multimedia support. On one hand, some of these benchmarks                                                     Program under the grant number YBCB2011030.
are not freely available to academia. On the other hand, the
peak performance of each component cannot represent the
                                                                                                                                           R EFERENCES
total performance of the system. Other benchmarks such as
SunSpider [11] and BrowserMark [40] only test the perfor-                                                      [1] “Android operating sytem for mobile devices,” http://www.android.com.
mance of specific applications or classes of applications (e.g.,                                               [2] “iOS operating system for apple,” http://www.apple.com/ios.
embedded Java benchmarks [43] or the MEVBench computer                                                         [3] “ARM architecture reference manual: ARM v7-A and ARM v7-R
vision applications [42]).                                                                                         edition.”
                                                                                                               [4] “Apple system on chips,” http://en.wikipedia.org/wiki/Apple\ System\
   In the research community, MiBench [41] has been widely                                                          on\ Chips.
used for embedded systems. Although it contains 35 embedded                                                    [5] “OMAP applications processors,” http://www.ti.com/lsds/ti/omap-
applications covering six categories, the applications differ                                                      applications-processors/features.page.
greatly from current mobile applications in terms of diversity,                                                [6] “Qualcomm                     Snapdragon                  processors,”
coding language, code size, and functionality. Hence, MiBench                                                      http://www.qualcomm.com/snapdragon.
[7]   “Intel Atom processor,” http://www.intel.com/content/www/us/en/             [28] “TTPod,” http://www.ttpod.com, http://t.cn/zTT2cNg.
       processors/atom/atom-processor.html.                                        [29] “FrozenBubble,” http://t.cn/zTTLjD8.
 [8]   “AnTuTu,” http://www.antutu.com/index.shtml.                                [30] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark
 [9]   “Gfxbench,” https://gfxbench.com/result.jsp.                                     suite: Characterization and architectural implications,” in Proceedings
[10]   “Geekbench,” http://www.primatelabs.com/geekbench.                               of the 17th international conference on Parallel Architectures and
                                                                                        Compilation Techniques. ACM, 2008, pp. 72–81.
[11]   “SunSpider,” http://www.webkit.org/perf/sunspider/sunspider.html.
                                                                                   [31] “Versatile Express products,” http://www.arm.com/products/tools/
[12]   D. Pandiyan, S.-Y. Lee, and C.-J. Wu, “Performance, energy character-
                                                                                        development-boards/versatile-express/index.php.
       izations and architectural implications of an emerging mobile platform
       benchmark suite c mobilebench,” in IEEE International Symposium on          [32] “ARM Cortex A9,” http://www.arm.com/products/processors/cortex-
       Workload Characterization (IISWC). IEEE, 2013.                                   a/cortex-a9.php.
[13]   D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. Emmons,             [33] “TopMC,” http://asg.ict.ac.cn/projects/topmc, 2011.
       and N. Paver, “A structured approach to the simulation, analysis and        [34] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM
       characterization of smartphone applications,” in IEEE International              SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.
       Symposium on Workload Characterization (IISWC). IEEE, 2013.                 [35] S. Bird, A. Phansalkar, L. K. John, A. Mericas, and R. Indukuru,
[14]   A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi,               “Performance characterization of spec cpu benchmarks on Intels core
       C. Emmons, and N. Paver, “Full-system analysis and characterization of           microarchitecture based processor,” in SPEC Benchmark Workshop,
       interactive smartphone applications,” in IEEE International Symposium            2007.
       on Workload Characterization (IISWC). IEEE, 2011, pp. 81–90.                [36] T. Jiang, R. Hou, L. Zhang, K. Zhang, L. Chen, M. Chen, and N. Sun,
[15]   “Google Play Store,” https://play.google.com/store.                              “Micro-architectural characterization of desktop cloud workloads,” in
[16]   N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,           IEEE International Symposium on Workload Characterization (IISWC).
       J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5             IEEE, 2012, pp. 131–140.
       simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2,         [37] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd-
       pp. 1–7, 2011.                                                                   jic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clear-
[17]   “Moby: A Mobile Benchmark Suite,” http://asg.ict.ac.cn/projects/moby,            ing the clouds: a study of emerging scale-out workloads on modern
       2013.                                                                            hardware,” in Proceedings of the seventeenth international conference
                                                                                        on Architectural Support for Programming Languages and Operating
[18]   OMAP4460 Pandaboard ES System Reference Manual, pandaboard.org,                  Systems. ACM, 2012, pp. 37–48.
       2011.
                                                                                   [38] “EDN       embedded      microprocessor      benchmark      consortium,”
[19]   “GNU Xnee webpage,” http://www.gnu.org/software/xnee.                            http://www.eembc.org.
[20]   K. Zhang, Y. Huang, and M. Chen, “Architecture characteristics and          [39] “SiSoft sandra,” http://www.sisoftware.net.
       analysis of mobile device applications,” in National Anual Conference
       on High Performance Computing, China (In Chinese), 2013, pp. 81–90.         [40] “Browsermark,” http://browsermark.rightware.com.
[21]   “K9Mail,” https://github.com/k9mail/k-9, http://t.cn/zTlAnPO.               [41] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and
                                                                                        R. B. Brown, “MiBench: A free, commercially representative embed-
[22]   “SinaWeibo,” http://www.weibo.com, http://t.cn/zTYHOxK.
                                                                                        ded benchmark suite,” in IEEE International Workshop on Workload
[23]   “NeteaseNews,” http://www.163.com, http://t.cn/zTYmGMj.                          Characterization. IEEE, 2001, pp. 3–14.
[24]   “KingsoftOffice,” http://www.kingsoftstore.com, http://t.cn/zTYsBQC.        [42] J. Clemons, H. Zhu, S. Savarese, and T. Austin, “MEVBench: A mobile
[25]   “AdobeReader,”                    http://www.adobe.com/products/eulas,           computer vision benchmarking suite,” in IEEE International Symposium
       http://t.cn/zTTPgDj.                                                             on Workload Characterization. IEEE, 2011, pp. 91–102.
[26]   “BaiduMap,” http://map.baidu.com, http://t.cn/zTT7y0Y.                      [43] C. Isen, L. John, J. P. Choi, and H. J. Song, “On the representativeness
[27]   “MXPlayer,”                         https://sites.google.com/site/mxvpen,        of embedded Java benchmarks,” in IEEE International Symposium on
       http://t.cn/zTTAq7Q.                                                             Workload Characterization, 2008. IEEE, 2008, pp. 153–162.
You can also read