Moby: A Mobile Benchmark Suite for Architectural Simulators

Page created by Felix Cummings

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Moby: A Mobile Benchmark Suite for Architectural
Simulators

Yongbing Huang∗† , Zhongbin Zha∗† , Mingyu Chen∗ , Lixin Zhang∗
∗ State Key Laboratory of Computer Architecture, Institute of Computing Technology,
Chinese Academy of Sciences, Beijing, China
† University of Chinese Academy of Sciences, Beijing, China
Email:{huangyongbing, zhazhongbin, cmy, zhanglixin}@ict.ac.cn

Abstract—Mobile devices such as smartphones and tablets Snapdragon [6] are more prevalent than processors like Intel’s
have become the primary consumer computing devices, and Atom [7]. Generally, as the performance of these mobile
their rate of adoption continues to grow. The applications that processors improves, their microarchitectures become more
run on these mobile platforms vary in how they use hardware complicated. For example, mobile processors with four-cores,
resources, and their diversity is increasing. Performance and an out-of-order execution model, and two-level caches have be-
power limitations also vary widely across mobile platforms. Thus
there is a growing need for tools to help computer architects
come the mainstream. Mobile system designers must consider
design systems to meet the needs of mobile workloads. Full-system how application and OS diversity affect their design choices
simulators are invaluable tools for designing new architectures, in this increasingly complex design space.
but we still need appropriate benchmark suites that capture the Benchmarking and architectural simulation are two im-
behaviors of emerging mobile applications. Current benchmark
portant tools for processor design and computer architecture
suites cover only a small range of mobile applications, and many
cannot run directly in simulators due to their user interaction research. To be relevant, a benchmark suite for architectural
requirements. research must satisfy at least two properties. First, work-
loads in the benchmark suite should be diverse enough to
In this paper, we introduce and characterize Moby, a bench- exhibit the range of behaviors of the target applications.
mark suite designed to make it easier to use full-system architec- Second, all the applications should be portable to architec-
tural simulators to evaluate microarchitectures for mobile pro- tural simulators. However, most of current mobile benchmark
cessors. Moby contains popular Android applications, including
a web browser, a social networking application, an email client, a
suites represent only a small subset of mobile application-
music player, a video player, a document processing application, s [8], [9], [10], [11], [12], [13], and some cannot be run directly
and a map program. To facilitate microarchitectural exploration, in simulators due to user-interaction requirements (e.g., the
we port the Moby benchmark suite to the popular gem5 simulator. interactive games and audio player of Gutierrez et al. [14].
We characterize the architecture-independent features of Moby Meanwhile, existing benchmarks such as SPEC CPU2006
applications on the simulator and analyze the architecture- exhibit significantly different behaviors from interactive mobile
dependent features on a current-generation mobile platform. Our applications [13], [14].
results show that mobile applications exhibit complex instruction
execution behaviors and poor code locality, but current mobile In this paper, we develop Moby, a benchmark suite de-
platforms especially instruction-related components cannot meet signed to evaluate microarchitectures of mobile platforms in
their requirements. full-system architectural simulators. Generally, there are two
design issues that drive our benchmark suite. First, mobile
I. I NTRODUCTION applications on different operating systems are incompatible.
Since Android is the commonly used operating system for
Mobile devices, especially smartphones and tablets, have mobile devices, Moby contains only mobile applications that
become an important world-wide market. From the application run on the Android OS. Second, most popular mobile ap-
point of view, a wide variety of mobile applications are now plications are commercial, and thus their source codes are
widely used; these include web browsers, social networks, e- not generally available. We choose only applications that can
mail clients, audio and video players, document processing be freely downloaded, in order to avoid licensing issues. In
systems, and map programs, to name a few. Different types of total, Moby contains 10 mobile applications spanning nine
applications present different requirements for the hardware categories, including a web browser, a social networking
components of mobile platforms. From the mobile operating application, email, audio, video, document, map and game.
system point of view, Android [1] and iOS [2] have the highest Except the web browser application BBench [14], all the other
market occupancy and growth speed. Android adoption has applications are selected from Google Play Store [15].
ramped up quickly, gaining in popularity six times faster than
iOS. Android and iOS use different programming languages Since our benchmark suite is intended to drive architec-
and execution models, and they differ in their utilization of tural simulators, all applications should be executable without
hardware resources. Therefore, the requirements placed on manual user inputs. Although AutoGUI [13] provides a user
hardware resources for different mobile applications vary. interface automation tool to record and deterministically replay
user actions, we use an alternative method to bypass user
As for mobile platforms, ARM [3] based mobile proces- interaction by executing only typical representative operations
sors such as Apple’s Ax [4], TI’s OMAP [5], Qualcomm’s for these mobile applications. While Moby can be executed

on many simulators that support Android OS, we take the Benchmark suites intended to support architectural design
commonly used gem5 simulator [16] as an example to test space exploration and system research should make it pos-
and characterize Moby. The gem5 disk image for Moby has sible to instrument, manipulate, and model the constituent
already become public [17]. applications in detail. However, most popular mobile appli-
cations are commercial, which complicates instrumentation
We measure microarchitecture-independent features on the because their source codes are unavailable. Although most
gem5 simulator with the ARM ISA and microarchitecture- Moby applications lack source codes, all can be downloaded
dependent features on the ARM-based Pandaboard develop- for free. Note that most mobile applications involve user
ment board [18]. The microarchitecture-independent features interaction and require a network connection, both of which are
include instruction mix, working set size, data and instruc- difficult to implement or model in architectural simulators. The
tion locality, and binary execution behaviors. The instruction dependence on networks can easily be removed by buffering
features show that the representative operations for all applica- any required remote data in local storage.
tions execute several billion instructions and that nearly 70%
of branches are conditional. Furthermore, most applications 2) User Interaction: A major difficulty in analyzing inter-
spawn about 20 processes and invoke more than 20 libraries. active mobile applications is generating reproducible results
On the Pandaboard, measured microarchitecture-dependent without manual user inputs. Moreover, the slow execution
features mainly include CPI and the behaviors of the branch speed of full-system simulators makes it impractical to incor-
predictor, cache, TLB, and memory components. Experimental porate user-action inputs in experiments. There are two main
results demonstrate that all applications present high CPIs, solutions to cope with user interaction in simulators. Tools like
which implies that these mobile applications and current ARM- AutoGUI [13] and Xnee [19] provide automation capabilities
based mobile platforms are not well matched. In particular, to record and replay user inputs. Using similar tools, we are
the instruction-related resources (branch predictor, instruction working to identify representative code pieces that suffer from
cache, and TLB) suffer from serious miss rates. large response latencies. However, most current automation
tools still suffer from shortcomings like nondeterministic re-
In summary, we make three contributions: play.
• We present Moby, a new mobile benchmark suite that The other simple solution is to avoid user interaction
contains a diverse set of applications and is appropriate in simulators. We find that most main activities of mobile
for simulation-based design space exploration. applications can be executed as separate processes, and user
• We extract typical representative operations of interactive interactions mainly used specify inputs for these activities. By
applications in Moby and automatically execute them on executing these activities by specifying their inputs manually
full-system architectural simulators. in command lines, user interactions are no longer required.
• We describe both microarchitecture-independent and Compared to using automation tools, this method is both
microarchitecture-dependent characteristics of all Moby simple and efficient, and thus we adopt this approach for Moby.
applications.
The main activities illustrated above are considered to be
The rest of this paper is organized as follows. Section representative operations of mobile applications, and they can
II describes the applications included in Moby. Section III be extracted from the AndroidManifest.xml file for mobile ap-
introduces our experimental platforms and tools. Then the plications running on the Android OS. In the current version of
microarchitecture-independent features and microarchitecture- Moby, only typical operations for each application are executed
dependent features of Moby are illustrated in Section IV and in architectural simulators. In addition, we can combine several
Section V, respectively. Finally, we describe related work in typical operations together, in order to improve simulation
Section VI and conclude in Section VII. accuracy.
3) Selection Steps: We take five steps to select the appli-
II. T HE M OBY B ENCHMARK S UITE cations in the Moby benchmark suite. Given the popularity
and maturity of the Android ecosystem, we study mobile
The goal of this work is to define a benchmark suite that applications executed on Android OS. Initially, we choose
can be used to design and optimize mobile processors. In commonly used programs from the Google Play Store [15]
this section, we first present our methods to select suitable as our application pool. Then, for each category in the Google
applications for such a suite, and then we describe each Play Store, we focus only on popular applications that are
included application in detail. free and have high download rates. A subset of applications
studied can be found in Zhang et al. [20]. Next, we measure
A. Benchmark Selection Methods microarchitectural characteristics of these applications on a
real platform (see Section V). After that, according to their
1) Requirements: As a mobile benchmark suite for archi- characteristics, we select several applications to represent
tectural simulators, Moby should both contain emerging and each category. Finally, we extract representative operations
diverse applications and also support research. for selected applications, and verify whether these operations
can be automatically executed without interaction with users
The rapid innovation of the mobile Internet spawns many and whether necessary data downloaded from networks can be
emerging applications with new and varied behaviors. Mobile buffered and replayed offline. Moby includes only applications
platform architects need tools to model the diverse behaviors that pass these tests.
and resource requirements of current and emerging mobile
applications in different application markets. As a result, we have chosen nine applications from the

TABLE I. S UMMARY OF M OBY
Hence, the typical operations for KingsoftOffice are opening
Bench Category Typical OP Input files with these formats.
BBench∗ Web Browser Load web pages Web pages
K9Mail Email Load/Show emails Buffered emails 6) Adobe [25] is an application for reliably viewing and in-
SinaWeibo Social Network Load information Buffered texts teracting with PDF documents. Its typical operation is viewing
NeteaseNews News Check and load news Buffered news a PDF file.
KingsoftOffice Document Open doc/xls/ppt file A doc file
AdobeReader Document Open pdf file A PDF file 7) BaiduMap [26] is a mobile map client from China’s
BaiduMap Map Load an area’s map Buffered maps biggest search engine and is similar to Google Maps. The
MXPlayer Video Play a video A video file map client presents detailed maps with 3D buildings, supports
TTPod Audio Play a song A music file navigation, and displays neighboring restaurant and hotel in-
FrozenBubble Game Load game Null formation. Loading the map for a specific area from offline
∗: BBench is from Gutierrez et al. [14] maps is its typical operation.
8) MXPlayer [27] is a video player that supports almost
Google Play Store for our benchmark suite; summaries for every movie format. It applies hardware acceleration to all the
these are shown in Table I. Almost all applications are com- videos with the help of a new H/W decoder, and it supports
mercial, but they can be downloaded for free. Moreover, we multi-core decoding. Its typical operation consists of playing
choose BBench [14] to represent web-browser applications, a mp4 video clip stored in local disk.
creating a 10-applications in suite. We will add more applica-
9) TTPod [28] is a music player for a wide variety of dif-
tions to our benchmark suite as more mainstream or popular
ferent audio formats. It provides high-quality decoding, highly
applications emerge.
accurate lyrics, and album acts downloads. It supports a rich
graphical user interface with built-in graphics, a customizable
B. Benchmark Descriptions equalizer function, and floating lyrics. The operation for this
In this subsection, we describe each application’s use and application is to play the first minute of an MP3 file.
features in detail.
10) FrozenBubble [29] is a puzzle game for Android. In-
1) BBench [14] is an automated web-browser page- teractive games have become important applications on mobile
rendering benchmark that tests rendering performance. It com- devices with the introduction of high performance CPUs and
prises a sequence of snapshots of a varied selection of the mobile GPUs. However, interactive games heavily rely on
most popular sites. The webpages included in the benchmark users and thus cannot be automatically executed in simulators.
contain diverse content and page styles (e.g., dynamic context, This game is chosen because it can be fully loaded and simply
JavaScript, video, images, Flash, CSS, HTML5, etc.). Typical played without any user interaction.
operation constitutes simply loading a webpage.
2) K9Mail [21] is an open-source email client running
on the Android platform, which supports the commonly used C. Input Sets
POP3 and IMAP4 protocols. Although K9Mail supports fea-
tures like sending/receiving email, searching, and multi-folder Generally, each benchmark suite should provide several
syncing, our benchmark only chooses loading and displaying input sets that represent various usage scenarios. For example,
emails buffered in local storage as its typical operations. This PARSEC [30] applications contain six input sets, each of which
requires no network connection, and it can be easily automated processes different amounts of data. Unlike those applications
without user interaction. designed for high performance servers, most mobile applica-
tions do not focus on data processing. Therefore, in the current
3) SinaWeibo [22] is a client for one of China’s biggest version of Moby, we use just one input size for each applica-
social networking and microblogging services. It allows users tion. Nevertheless, multiple input sets for Moby applications
to publish information instantly and share it with others. The can be easily produced. For applications like KingsoftOffice,
information includes text, picture, music, and video. Loading AdobeReader, MXPlayer, and TTPod (which mainly execute
and displaying information is the typical operation for social or process input files), input sets can easily be constructed
networking applications. Again, this information is buffered by selecting input files with varied types and sizes. The input
locally. sets of other mobile applications are actually buffered network
data, which the users can obtain from real mobile platforms by
4) NeteaseNews [23] is a news reader application. Users
performing different web queries. These kinds of applications
can obtain news by subscribing to magazines, newspapers,
include K9Mail, Sinaweibo, NeteaseNews, and BaiduMap. For
and other resources. The typical operation for news readers
example, users can construct input sets for BaiduMap by
is checking the news from the server and listing articles. In
downloading maps of various areas from anywhere on the
our benchmark, we substitute local data for data on remote
internet.
server.
5) KingsoftOffice [24] is an efficient mobile office appli-
cation. It contains rich editing features, and supports 23 kinds III. M ETHODOLOGY
of files, including DOC, XLS, PPT, and PDF. Writer, pre-
sentation, and spreadsheet are commonly used KingsoftOffice In this section, we explain how we characterize the Moby
programs to manipulate DOC, PPT, and XLS files respectively. benchmark suite in terms of platforms and tools.

TABLE II.        M OBY I NSTRUCTION S UMMARY

                                                                                             Branches
                Bench                     Instruction Count (Billions)                                           Loads    Stores   Working Set Size (MB)
                                                                                     Total       Cond./Total∗
                           †
                BBench                    2.48                                       14.43%      69.5%           23.05%   12.16%                     80
                K9Mail                    1.18                                       11.00%      72.60%          20.03%   9.34%                      64
                SinaWeibo                 2.23                                       16.92%      68.35%          27.21%   14.68%                    114
                NeteaseNews               2.65                                       16.58%      69.01%          25.85%   12.22%                    104
                KingsoftOffice            2.24                                       16.59%      68.73%          26.13%   14.06%                     87
                AdobeReader               2.09                                       15.17%      70.47%          23.74%   12.19%                     83
                BaiduMap                  3.53                                       14.31%      72.50%          22.79%   12.29%                    102
                MXPlayer‡                 3.84                                       18.22%      70.64%          23.79%   12.76%                     97
                TTPod‡                    3.87                                       15.18%      68.45%          25.49%   12.84%                    126
                FrozenBubble              0.28                                       15.59%      71.76%          21.53%   9.66%                      47
                ∗: Percent of all branch instructions that are conditional

                †: BBench only loads each page once

                ‡: TTPod and MXPlayer each play about three seconds of music/video

A. Platforms                                                                                         components, including the processor pipeline, cache, and TLB.
                                                                                                     The Cortex A9 provides six core counters that can count up to
    We measure the Moby benchmark suite on both the gem5                                             six events simultaneously, one extra cycle counter, and two L2
simulator [16] and the Pandaboard ES [18] development board                                          cache counters. However, the metrics shown in Section V can-
running Android version ICS 4.0. The gem5 simulator is                                               not be directly acquired or computed using the above counters
a widely used architecture simulator which supports Alpha,                                           when running each application just once. Hence, we repeat
ARM, SPARC, MIPS, POWER and x86 ISAs. By default, the                                                each experiment multiple times with different combinations of
gem5 simulator provides several machine configurations for                                           counters, and report average values from ten measurements.
ARM ISAs. These machine configurations, which contain the                                            All performance event data are collected using the lightweight
parameters of main hardware components, are almost the same                                          performance counter tool TopMC [33].
as the configurations of real ARM-based development boards
such as Versatile Express [31].
                                                                                                                IV.   M ICROARCHITECTURE -I NDEPENDENT
   The Pandaboard ES board comes with a market-quality                                                                     C HARACTERIZATION
OMAP 4460 system-on-chip (SoC) equipped with a dual-core
Cortex A9 processor [32] manufactured on the 45 nm process                                               Microarchitecture-independent characteristics enable us to
node and 1GB LPDDR2 DRAM. The Cortex A9 processor is                                                 understand the inherent nature of applications. In this section,
a complex out-of-order four-wide superscalar core with eight                                         we provide an overview of the microarchitecture-independent
pipeline stages. It has 32 KB 4-way set associative L1 I/D                                           characteristics of Moby in terms of instruction mix, working
caches and a 512 KB 16-way set associative L2 cache.                                                 set size, spawned processes, invoked libraries, and code and
                                                                                                     data locality. Note that most mobile applications execute many
    In our experiments, the configurations of main hardware                                          short activities, where each activity only accounts for a few
components such as cache and memory in gem5 are set                                                  billion instructions. Compared to the trillions of instructions
according to that of Pandaboard. Their operating systems and                                         for SPEC CPU2006 applications [34], executing these few
disk images are also nearly the same.                                                                billion instructions is much more suitable for slow, full-
                                                                                                     system simulators. We use the gem5 full-system simulator to
B. Tools                                                                                             execute the representative operations for each workload shown
                                                                                                     in Table I1 , and collect all the following microarchitecture-
    In order to study instruction behaviors, we modify the gem5                                      independent metrics. We find that our workloads share similar
simulator to collect the instruction trace for each application.                                     instruction profiles even though their working-set sizes vary
Meanwhile, we can also map instructions to their binaries such                                       significantly.
as libraries, the OS kernel, and the application binary file by
dumping mapping tables between instructions’ virtual address-                                        A. Instruction Mix
es and binaries. The mapping tables are just the contents of
the proc file ”/proc/pid/maps” in the Android file system. This                                          The mix of instructions reflect the requirements on d-
information is maintained in the virtual memory structure and                                        ifferent hardware resources. For example, load and store
can be tracked by the task structure of processes. Thus, in                                          instructions rely on cache and memory resources. Different
the simulator, we only need to find out these task structures                                        types of branch instructions indirectly reveal the complexity of
for different processes and then read out the corresponding                                          programs and their demands on branch predictors. As shown in
contents.                                                                                            Table II, load and store instructions account for about 25% and
                                                                                                     12%, respectively, for most applications. Compared to most
   For the purpose of studying performance of processor                                              integer benchmarks of SPEC CPU2006, which present diverse
and memory components, we measure the microarchitecture                                              load and store behaviors [35], the percentages of load and store
characteristics of Moby suite using hardware performance
counters [3] on the Pandaboard. The hardware performance                                               1 We have already released the corresponding gem5 disk images and
events provided by the Cortex A9 processor cover most main                                           execution scripts for all Moby applications [17].

50                            50                             50                            50                         50
 Percent of Requests

                                      Inst.   Data               Inst.   Data                    Inst.   Data                Inst.    Data                Inst.   Data
                       40                            40                             40                            40                         40
                       30                            30                             30                            30                         30
                       20                            20                             20                            20                         20
                       10                            10                             10                            10                         10
                        0                             0                              0                             0                          0

                                 Reuse Distance
                              (a) BBench                    (b) K9Mail                      (c) SinaWeibo              (d) NeteaseNews            (e) KingsoftOffice

 50                                                  50                             50                            40                         50
                                     Inst.    Data               Inst.   Data                    Inst.   Data                Inst.    Data                Inst.   Data
 40                                                  40                             40                            30                         40
 30                                                  30                             30                                                       30
                                                                                                                  20
 20                                                  20                             20                                                       20
 10                                                  10                             10                            10                         10
  0                                                   0                              0                             0                          0

                            (f) AdobeReader                (g) BaiduMap                     (h) MXPlayer                  (i) TTPod               (j) FrozenBubble

Fig. 1.                       L1 reuse distance distributions

                                       1
                                                                                                                                                      BBench
                                                     16-way set associative cache                                                                     K9Mail
                                     0.8
                                                                                                                                                      SinaWeibo
                                     0.6                                                                                                              NeteaseNews
                               CDF

                                                                                                                                                      KingsoftOffice
                                     0.4                                                                                                              AdobeReader
                                                                                                                                                      BaiduMap
                                     0.2
                                                                                                                                                      MXPlayer
                                                                                                                                                      TTPod
                                       0
                                                                                                                                                      FrozenBubble

                                                                                         Reuse Distance

Fig. 2.                       L2 reuse distance distributions

instructions are similar for Moby applications. Another 15%                                              implies that only a small portion of each touched page is used.
of instructions are branches for all applications except MX-                                             Given this, these applications are likely to suffer frequent TLB
Player and K9mail. Meanwhile, conditional branches occupy                                                misses.
nearly 70% of these branch instructions across all applications.
Generally, each conditional branch instruction may result in
executing a wrong path and consequently require out-of-order                                             C. Locality
processors to roll back. The high percentages of conditional                                                 In order to gain a deeper understanding of the code and data
branch instructions is likely to trigger many mispredictions                                             locality in Moby applications, we analyze the reuse distances
with large penalties, which will affect the overall performance.                                         (the number of distinct references between two successive uses
                                                                                                         of a line) of all references to two different cache levels. Fig-
B. Working Sets                                                                                          ure 1 shows the reuse distances of instruction and data requests
                                                                                                         for each Moby application. All the requests are captured when
   Working-set size can be measured at cacheline or page                                                 they access L1 instruction or data caches. Figure 2 shows the
granularities, depending on our purpose. We choose pages                                                 cumulative density function (CDF) of reuse distances for L2
(4KB) as our basic granularity in this paper because we                                                  cache requests. All the requests studied are those requests that
aim at studying main memory access behaviors. Note that                                                  miss in 32KB L1 instruction/data cache.
most typical operations of Moby applications only last several
seconds, and thus we consider working-set sizes to be the total                                              As shown in Figure 1, Moby applications present similar
number of pages touched during the whole execution.                                                      instruction and data locality behaviors. Typically, only about
                                                                                                         30% of instruction references have reuse distances less than
    Half the working sets in Table II approach or exceed                                                 four, which is the set associativity of the Pandaboard instruc-
100 megabytes. Only K9Mail and FrozenBubble, which only                                                  tion cache. Memory references with larger reuse distances
execute around a billion instructions, have working sets smaller                                         suffer misses under LRU replacement. Some instructions have
than 65 MB. Even so, all working sets exceed the capacity                                                a zero reuse distance because whenever a line is fetched into
of the last-level cache. In contrast to these large working sets,                                        the instruction queue, subsequent instructions will also be
mobile input sets are usually small, and the applications do not                                         found in the queue: no cache access is required. The figure
execute sustained memory accesses. For example, SinaWeibo                                                shows that highly associative instruction caches (64 or more)
typically loads tens of small text messages at a time, which                                             could service over 80% of instructions.

40
Processes Libraries
35
30
25
20
15
10
5
0

Fig. 3. Numbers of processes and invoked libraries

As for data requests, Figure 1 indicates that Moby ap-
plications generally present good data locality. For a four- Fig. 4. Instruction flow distributions for KingsoftOffice
way set associative data cache, nearly 70% of lines can be
reused. Requests with one reuse distance constitute 40% of
all accesses, which implies that data within a cacheline enjoy colorful segments imply that those binaries are executed con-
high temporal locality. tinuously without suffering interference from other binaries.
Figure 4 illustrates that the Android kernel and five other
Figure 2 shows the reuse distance distribution of L2 cache binaries (dalvik-jit-code-cache, libdvm.so, libcutils.so, libc.so,
accesses. For SinaWeibo, TTPod, and NeteaseNews, about and libnativehelper.so) dominate the execution of Kingsoft-
40% of memory references have reuse distances smaller than Office. These binaries can be organized into three groups,
16 (the associativity of the Pandaboard L2 cache). At this Java-language related, C-language related, and system related.
associativity, reused memory locations for the remaining ap- Moreover, the execution switches among different binaries
plications only reach 20%. Moreover, as reuse distance grows, frequently. For instance, the executions of the libc library
these “reusability ratios” increase only gradually until they and the Android kernel are interleaved. In such situations,
level off at 512. This implies that 16 is a good choice for instruction locality and branch prediction accuracy may be
L2 cache associativity on mobile platforms. affected, which results in poor performance for instruction-
related components.
D. Instruction Execution Flow
The instructions executed by most mobile applications ex- E. PCA Analysis
hibit complex behavior. Mobile applications tend to depend on Diversity is an important metric to evaluate the repre-
GUI-based display systems. Furthermore, for high portability sentative of a benchmark suite. We use principal component
and programmer productivity, most Android applications are analysis (PCA) to demonstrate the diversity of Moby appli-
written in the object-oriented Java language. Thus, Moby cations by analyzing both their microarchitecture-independent
applications may invoke many libraries and generate many and microarchitecture-dependent behaviors. PCA applies an
instructions. orthogonal transformation to a group of possibly correlated
Figure 3 depicts the number of processes spawned and variables to convert them into several uncorrelated variables
libraries invoked. Most applications create tens of process- (principal components) with different weights. Similar PCA
es/threads and access more than 15 libraries. Six of the Moby analysis has been conducted on mobile applications and tradi-
applications invoke more than 20 libraries, which increases tional SPEC benchmarks by Sunwoo et al. [13], whose results
code footprints and puts pressure on all instruction-related demonstrate that mobile applications differ greatly from SPEC
microarchitectural resources. Furthermore, multiple processes benchmarks, especially in instruction-side behaviors.
running in parallel inevitably cause interference in the caches, Figure 5 depicts the PCA map of the above
the TLB, and the predictors. microarchitecture-independent metrics for Moby applications,
To better understand dynamic instruction behaviors, we showing only two main principal components. The X-axis
collect instruction traces and map dynamic instructions back (i.e., Dim 1) shows the first principal component, which
to the static binaries. Given the many background processes represents more than 65% of the variability. Dim 2 shown
running within the Android OS, we record only instructions in the Y-axis accounts for another 20% variability. Hence,
closely related to the target application. A memory map these two principal components can influence the differences
file for each processor assists translation (as described in among all Moby applications. The distance between points
Section III-B). on the PCA map implies the dissimilarity of applications.
The closer two points are, the more similar the applications.
As an example, we present a part of the instruction execu- As shown in the figure, 10 Moby applications are evenly
tion flow of KingsoftOffice in Figure 4. The X-axis depicts scattered in different regions. This phenomenon means that
the number of instructions executed, and the Y-axis shows the mobile applications we choose are diverse with respect to
the corresponding static binary files for these instructions. The their inherent characteristics.

Stalled Due to TLB
30

Percent of Overall Cycles
Stalled Due to Dcache
25
Stalled Due to Icache
20
15
10
5
0

Fig. 7. Contribution to overall cycles broken down by component
Fig. 5. PCA results for microarchitecture-independent metrics

% Branches Mispredicted
4
12
3.5
10
3
8
2.5
2 6
1.5 4
1 2
0.5 0
0

Fig. 8. Branch misprediction rates
Fig. 6. CPI results

2) Stalled Cycle per Component: The processor’s pipeline
V. M ICROARCHITECTURE -D EPENDENT will stall if components fill and cannot allocate additional
C HARACTERIZATION resources for incoming requests. There are many such compo-
nents in ARM processors, including the cache, TLB, reorder
We explain the microarchitecture-dependent results of Mo- buffer, load/store buffer, and reservation stations.
by on the Pandaboard development board in this section. These
metrics are obtained from the hardware performance counters Figure 7 depicts the percent of stalled cycles caused by
provided by ARM processors. cache and TLB resources. Since other components cause very
few stall cycles, the remaining cycles can be considered to be
the processor’s active cycles. Two interesting observations can
A. Overall Performance be made from Figure 7. First, nearly 2% and 5% of processors
cycles are stalled waiting for the TLB and instruction cache
1) CPI: Cycles per instruction (CPI) initially characterizes for almost all Moby applications. But the pipeline stall cycles
the overall performance of a target application on measured incurred by the data cache vary from 3% to 20% for different
platform. Applications with high CPI perform poorly, which applications. Second, for applications such as K9Mail and
means that the microarchitecture of the measured platform TTPod, the instruction cache stalls the processor’s pipeline
could be improved to better cope with these applications. more often than the data cache. Unlike desktop applica-
Figure 6 depicts the CPI results for all Moby applications. tions [36] and server applications [14], whose data cache
Six out of ten applications have CPI higher than 3, and the dominates the pipeline stalls, the TLB and instruction cache of
CPIs of the remaining applications are around 2. Note that the mobile processors are primarily responsible for the observed
ideal CPI for Cortex A9 processor with its two-issue width performance degradation. Therefore, more attention should be
is 0.5, and hence these applications perform poorly. Mobile paid to optimizing mobile processor TLBs and instruction
processors like Cortex A9 processor could be better optimized caches.
for workloads like Moby. Moreover, we observe that the
four applications with relatively low CPI — KingsoftOffice, B. Branch Misprediction Rate
AdobeReader, MXPlayer, and TTPod — process large amounts
of data, unlike the applications with higher CPI. This implies The branch predictor plays an important role in ensuring
that instruction-related components might hinder the overall efficient out-of-order execution and exploiting instruction level
performance. parallelism.

50 80
Icache Dcache ITLB DTLB Data Inst. Total
Misses per 1K Instructions

45 70
40 60

% Miss Ratio
35
50
30
25 40
20 30
15 20
10
10
5
0 0

Fig. 9. Cache and TLB miss rates Fig. 10. L2 cache miss rates

Percent of Overall Requests
As shown in Figure 8, the branch misprediction rates for
60
NeteaseNews and BaiduMap reach up to 12%. This happens
because nearly 70% of branches are conditional, as shown 50
in Table II, and the execution of these applications switches 40
frequently among different binaries, as illustrated in Figure 4. 30
Each time instructions are switched, branch mispredictions are 20
likely occur. Note that unpredictable user behaviors for inter-
10
active mobile applications can further exacerbate the branch
misprediction rate. 0

C. Cache & TLB & Memory
1) L1 I/D Cache & I/D TLB: As illustrated by Gutierrez
et al. [14], Jiang et al. [36] and Ferdman et al. [37], the miss
rates of the instruction cache and instruction TLB are high Fig. 11. Ratio of data requests in L2 cache. The rest requests are
instructions.
due to the large code size of interactive applications and the
limited cache size of mobile processors. Figure 9 shows the
same observation. fraction of all memory requests, since there will be many DMA
According to the L1 cache reuse distance shown in Fig- requests issued by other I/O devices like the GPU and LCD
ure 1, less than 35% of instruction references have reuse display system.
distances smaller than four (the associativity of the Pandaboard
instruction and data caches). Given that data references with D. Core Utilization
similarly small reuse distances reach nearly 70%, it is obvious Mobile processors are still improving, in terms of both
that the data cache outperforms the instruction cache. frequency and numbers of cores. In order to study the core
Furthermore, since most mobile applications do not manip- utilization, we count the cycles executed by different cores.
ulate large amounts of data, their data references are relatively In Figure 12, we depict the ratio of cycles executed by Core
few compared to their instruction references. 0 compared to the total cycles executed by both cores on the
Pandaboard. Except for MXPlayer, Moby applications do most
The DTLB suffers a higher miss rate than the data cache of their work on Core 0. This suggests that most mobile appli-
for several applications (e.g., BBench). This suggests that cations are programmed without considering the existence of
the number of DTLB entries is insufficient to hold random multicore platform, and thus they cannot fully utilize precious
distributed data references. processor resources. Under this condition, simply integrating
2) L2 Cache: Figure 10 depicts the miss rates of different more cores in mobile processors just consumes more power
kinds of requests to the L2 cache. More than 10% instruction without improving performance.
requests miss in L2 cache for all Moby applications, and sev-
eral suffer more than 25% instruction misses. This result again E. PCA Analysis
demonstrates the large code footprints for mobile applications.
Figure 13 depicts the top two principal components of
Although mobile applications present good locality in the L1
PCA analysis based on the above microarchitecture-dependent
data cache, nearly 50% of the data requests miss in L2 cache.
characteristics. Dim1 is the primary principal component, and
It is interesting to observe that data references no longer its main contributor is the L2 miss rate. Dim2 is the second
dominate L2 requests, as shown in Figure 11. Except Kingsoft- principal component, and it includes L1D MPKI, DTLB MP-
Office, AdobeReader, and FrozenBubble, the L2 cache receives KI, and the branch misprediction rate. Although data points
more instruction requests than data requests. However, from for applications such as SinaWeibo and FrozenBubble in the
the view of memory, instruction requests account for a small Y-axis are a bit closer, these applications are widely spread

90                                                                      has become less suitable for current microarchitectural analy-
     Percent of Overall Core Cycles
                                      80                                                                      sis. Gutierrez et al. [14] present an interactive game, a video
                                      70                                                                      player, a media player, and BBench as typical benchmarks for
                                      60                                                                      smartphones. MobileBench [12] contains several web browsing
                                      50                                                                      applications, a photo rendering application and a video play-
                                      40
                                                                                                              er. Sunwoo et al. [13] study several smartphone workloads
                                      30
                                                                                                              (AndEBench, CaffeineMark, RL Benchmark, Angry Birds,
                                      20
                                                                                                              and KingsoftOffice) to measure the performance of the dalvik
                                      10
                                       0
                                                                                                              virtual machine, the SQLite and the whole system. Compared
                                                                                                              to these benchmarks, Moby contains some similar applications,
                                                                                                              and some with diverse behaviors not yet found in other suites.

                                                                                                                                      VII.     C ONCLUSION
Fig. 12.                                   Ratio of cycles executed by Core 0. Core 1 accounts for the rest       Mobile devices have already become the primary consumer
ratio.                                                                                                        computing devices, and their use still exhibits rapid growth.
                                                                                                              Efficient mobile processor design requires knowledge of typ-
                                                                                                              ical mobile applications. In this paper, we have presented a
                                                                                                              mobile benchmark suite — Moby — that includes popular
                                                                                                              applications executed under Android OS. Our analysis finds
                                                                                                              them to be sufficiently diverse to be considered representative.
                                                                                                                  In this study, we fully characterize Moby in order to assist
                                                                                                              other researchers in using it for their studies. We use the gem5
                                                                                                              simulator and the hardware performance counters provided
                                                                                                              by ARM processors to evaluate Moby’s microarchitecture-
                                                                                                              independent features (instruction mix, working set size, data
                                                                                                              and instruction locality, and binary execution behavior) and
                                                                                                              microarchitecture-dependent features (CPI and the behaviors
                                                                                                              of the branch predictor, caches, TLBs, and other memory
                                                                                                              components). We will continue to add more mobile applica-
                                                                                                              tions to the Moby benchmark suite as more mainstream or
                                                                                                              popular applications emerge. Furthermore, we will integrate
Fig. 13. PCA results microarchitecture-dependent characteristics on the                                       user-action automation tools to model the effects of user inputs
Pandaboard
                                                                                                              on applications.

across the primary principal component. Given the PCA results                                                                         ACKNOWLEDGMENT
of both microarchitecture-independent and microarchitecture-
dependent behaviors, we can conclude that Moby applications                                                      We would like to thank Sally McKee for her useful sug-
behavior vary significantly and in many ways.                                                                 gestions and hard work improving the writing quality. We also
                                                                                                              thank Yungang Bao, Kun Zhang, and other teammates from
                                                        VI.    R ELATED W ORK                                 ICT, and the anonymous reviewers for helpful suggestions
                                                                                                              and insightful feedback. This research is supported by the
    There are many kinds of benchmarks to evaluate the                                                        National Basic Research Program of China (973 Program)
performance of mobile devices. In the industry community,                                                     under the grant number 2011CB302502, the National Natural
commonly used benchmarks such as EEMBC [38], SiSoft                                                           Science Foundation of China (NSFC) under the grant number
Snadra [39], AnTuTu [8], 3D GLBenchmark [9], and Geek-                                                        60925009, 61272132, and 61221062, the Strategic Priority
bench [10] can measure the peek performance of mobile                                                         Research Program of the Chinese Academy of Sciences under
device components, including the CPU, memory, GPU, and                                                        the grant number XDA06010401, and the Huawei Research
multimedia support. On one hand, some of these benchmarks                                                     Program under the grant number YBCB2011030.
are not freely available to academia. On the other hand, the
peak performance of each component cannot represent the
                                                                                                                                           R EFERENCES
total performance of the system. Other benchmarks such as
SunSpider [11] and BrowserMark [40] only test the perfor-                                                      [1] “Android operating sytem for mobile devices,” http://www.android.com.
mance of specific applications or classes of applications (e.g.,                                               [2] “iOS operating system for apple,” http://www.apple.com/ios.
embedded Java benchmarks [43] or the MEVBench computer                                                         [3] “ARM architecture reference manual: ARM v7-A and ARM v7-R
vision applications [42]).                                                                                         edition.”
                                                                                                               [4] “Apple system on chips,” http://en.wikipedia.org/wiki/Apple\ System\
   In the research community, MiBench [41] has been widely                                                          on\ Chips.
used for embedded systems. Although it contains 35 embedded                                                    [5] “OMAP applications processors,” http://www.ti.com/lsds/ti/omap-
applications covering six categories, the applications differ                                                      applications-processors/features.page.
greatly from current mobile applications in terms of diversity,                                                [6] “Qualcomm                     Snapdragon                  processors,”
coding language, code size, and functionality. Hence, MiBench                                                      http://www.qualcomm.com/snapdragon.

[7]   “Intel Atom processor,” http://www.intel.com/content/www/us/en/             [28] “TTPod,” http://www.ttpod.com, http://t.cn/zTT2cNg.
       processors/atom/atom-processor.html.                                        [29] “FrozenBubble,” http://t.cn/zTTLjD8.
 [8]   “AnTuTu,” http://www.antutu.com/index.shtml.                                [30] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark
 [9]   “Gfxbench,” https://gfxbench.com/result.jsp.                                     suite: Characterization and architectural implications,” in Proceedings
[10]   “Geekbench,” http://www.primatelabs.com/geekbench.                               of the 17th international conference on Parallel Architectures and
                                                                                        Compilation Techniques. ACM, 2008, pp. 72–81.
[11]   “SunSpider,” http://www.webkit.org/perf/sunspider/sunspider.html.
                                                                                   [31] “Versatile Express products,” http://www.arm.com/products/tools/
[12]   D. Pandiyan, S.-Y. Lee, and C.-J. Wu, “Performance, energy character-
                                                                                        development-boards/versatile-express/index.php.
       izations and architectural implications of an emerging mobile platform
       benchmark suite c mobilebench,” in IEEE International Symposium on          [32] “ARM Cortex A9,” http://www.arm.com/products/processors/cortex-
       Workload Characterization (IISWC). IEEE, 2013.                                   a/cortex-a9.php.
[13]   D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. Emmons,             [33] “TopMC,” http://asg.ict.ac.cn/projects/topmc, 2011.
       and N. Paver, “A structured approach to the simulation, analysis and        [34] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM
       characterization of smartphone applications,” in IEEE International              SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.
       Symposium on Workload Characterization (IISWC). IEEE, 2013.                 [35] S. Bird, A. Phansalkar, L. K. John, A. Mericas, and R. Indukuru,
[14]   A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi,               “Performance characterization of spec cpu benchmarks on Intels core
       C. Emmons, and N. Paver, “Full-system analysis and characterization of           microarchitecture based processor,” in SPEC Benchmark Workshop,
       interactive smartphone applications,” in IEEE International Symposium            2007.
       on Workload Characterization (IISWC). IEEE, 2011, pp. 81–90.                [36] T. Jiang, R. Hou, L. Zhang, K. Zhang, L. Chen, M. Chen, and N. Sun,
[15]   “Google Play Store,” https://play.google.com/store.                              “Micro-architectural characterization of desktop cloud workloads,” in
[16]   N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,           IEEE International Symposium on Workload Characterization (IISWC).
       J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5             IEEE, 2012, pp. 131–140.
       simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2,         [37] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd-
       pp. 1–7, 2011.                                                                   jic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clear-
[17]   “Moby: A Mobile Benchmark Suite,” http://asg.ict.ac.cn/projects/moby,            ing the clouds: a study of emerging scale-out workloads on modern
       2013.                                                                            hardware,” in Proceedings of the seventeenth international conference
                                                                                        on Architectural Support for Programming Languages and Operating
[18]   OMAP4460 Pandaboard ES System Reference Manual, pandaboard.org,                  Systems. ACM, 2012, pp. 37–48.
       2011.
                                                                                   [38] “EDN       embedded      microprocessor      benchmark      consortium,”
[19]   “GNU Xnee webpage,” http://www.gnu.org/software/xnee.                            http://www.eembc.org.
[20]   K. Zhang, Y. Huang, and M. Chen, “Architecture characteristics and          [39] “SiSoft sandra,” http://www.sisoftware.net.
       analysis of mobile device applications,” in National Anual Conference
       on High Performance Computing, China (In Chinese), 2013, pp. 81–90.         [40] “Browsermark,” http://browsermark.rightware.com.
[21]   “K9Mail,” https://github.com/k9mail/k-9, http://t.cn/zTlAnPO.               [41] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and
                                                                                        R. B. Brown, “MiBench: A free, commercially representative embed-
[22]   “SinaWeibo,” http://www.weibo.com, http://t.cn/zTYHOxK.
                                                                                        ded benchmark suite,” in IEEE International Workshop on Workload
[23]   “NeteaseNews,” http://www.163.com, http://t.cn/zTYmGMj.                          Characterization. IEEE, 2001, pp. 3–14.
[24]   “KingsoftOffice,” http://www.kingsoftstore.com, http://t.cn/zTYsBQC.        [42] J. Clemons, H. Zhu, S. Savarese, and T. Austin, “MEVBench: A mobile
[25]   “AdobeReader,”                    http://www.adobe.com/products/eulas,           computer vision benchmarking suite,” in IEEE International Symposium
       http://t.cn/zTTPgDj.                                                             on Workload Characterization. IEEE, 2011, pp. 91–102.
[26]   “BaiduMap,” http://map.baidu.com, http://t.cn/zTT7y0Y.                      [43] C. Isen, L. John, J. P. Choi, and H. J. Song, “On the representativeness
[27]   “MXPlayer,”                         https://sites.google.com/site/mxvpen,        of embedded Java benchmarks,” in IEEE International Symposium on
       http://t.cn/zTTAq7Q.                                                             Workload Characterization, 2008. IEEE, 2008, pp. 153–162.

You can also read