Semantics-Aware Android Malware Classification Using Weighted Contextual API Dependency Graphs

Page created by Monica Gallagher
 
CONTINUE READING
Semantics-Aware Android Malware Classification Using
        Weighted Contextual API Dependency Graphs

                                Mu Zhang                           Yue Duan                   Heng Yin                 Zhiruo Zhao
                                          Department of EECS, Syracuse University, Syracuse, NY, USA
                                                   {muzhang,yuduan,heyin,zzhao11}@syr.edu

ABSTRACT                                                                                      1.    INTRODUCTION
The drastic increase of Android malware has led to a strong interest                             The number of new Android malware instances has grown ex-
in developing methods to automate the malware analysis process.                               ponentially in recent years. McAfee reports [3] that 2.47 million
Existing automated Android malware detection and classification                               new mobile malware samples were collected in 2013, which repre-
methods fall into two general categories: 1) signature-based and 2)                           sents a 197% increase over 2012. Greater and greater amounts of
machine learning-based. Signature-based approaches can be easily                              manual effort are required to analyze the increasing number of new
evaded by bytecode-level transformation attacks. Prior learning-                              malware instances. This has led to a strong interest in developing
based works extract features from application syntax, rather than                             methods to automate the malware analysis process.
program semantics, and are also subject to evasion. In this paper,                               Existing automated Android malware detection and classifica-
we propose a novel semantic-based approach that classifies An-                                tion methods fall into two general categories: 1) signature-based
droid malware via dependency graphs. To battle transformation                                 and 2) machine learning-based. Signature-based approaches [18,
attacks, we extract a weighted contextual API dependency graph as                             37] look for specific patterns in the bytecode and API calls, but they
program semantics to construct feature sets. To fight against mal-                            are easily evaded by bytecode-level transformation attacks [28].
ware variants and zero-day malware, we introduce graph similarity                             Machine learning-based approaches [5, 6, 27] extract features from
metrics to uncover homogeneous application behaviors while toler-                             an application’s behavior (such as permission requests and criti-
ating minor implementation differences. We implement a prototype                              cal API calls) and apply standard machine learning algorithms to
system, DroidSIFT, in 23 thousand lines of Java code. We evaluate                             perform binary classification. Because the extracted features are
our system using 2200 malware samples and 13500 benign sam-                                   associated with application syntax, rather than program semantics,
ples. Experiments show that our signature detection can correctly                             these detectors are also susceptible to evasion.
label 93% of malware instances; our anomaly detector is capable                                  To directly address malware that evades automated detection,
of detecting zero-day malware with a low false negative rate (2%)                             prior works distill program semantics into graph representations,
and an acceptable false positive rate (5.15%) for a vetting purpose.                          such as control-flow graphs [11], data dependency graphs [16, 21]
                                                                                              and permission event graphs [10]. These graphs are checked against
                                                                                              manually-crafted specifications to detect malware. However, these
Categories and Subject Descriptors                                                            detectors tend to seek an exact match for a given specification and
D.2.4 [Software Engineering]: Software/Program Verification—                                  therefore can potentially be evaded by polymorphic variants. Fur-
Validation; D.4.6 [Operating Systems]: Security and Protection—                               thermore, the specifications used for detection are produced from
Invasive software                                                                             known malware families and cannot be used to battle zero-day mal-
                                                                                              ware.
General Terms                                                                                    In this paper, we propose a novel semantic-based approach that
                                                                                              classifies Android malware via dependency graphs. To battle trans-
Security                                                                                      formation attacks [28], we extract a weighted contextual API de-
                                                                                              pendency graph as program semantics to construct feature sets. The
Keywords                                                                                      subsequent classification then depends on more robust semantic-
Android; Malware classification; Semantics-aware; Graph similar-                              level behavior rather than program syntax. It is much harder for
ity; Signature detection; Anomaly detection                                                   an adversary to use an elaborate bytecode-level transformation to
                                                                                              evade such a training system. To fight against malware variants
                                                                                              and zero-day malware, we introduce graph similarity metrics to un-
                                                                                              cover homogeneous essential application behaviors while tolerat-
                                                                                              ing minor implementation differences. Consequently, new or poly-
                                                                                              morphic malware that has a unique implementation, but performs
Permission to make digital or hard copies of all or part of this work for personal or         common malicious behaviors, cannot evade detection.
classroom use is granted without fee provided that copies are not made or distributed            To our knowledge, when compared to traditional semantics-aware
for profit or commercial advantage and that copies bear this notice and the full cita-        approaches for desktop malware detection, we are the first to ex-
tion on the first page. Copyrights for components of this work owned by others than           amine program semantics within the context of Android malware
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-          classification. We also take a step further to defeat malware vari-
publish, to post on servers or to redistribute to lists, requires prior specific permission   ants and zero-day malware by comparing the similarity of these
and/or a fee. Request permissions from permissions@acm.org.
                                                                                              programs to that of known malware at the behavioral level.
CCS’14, November 3–7, 2014, Scottsdale, Arizona, USA.
Copyright 2014 ACM 978-1-4503-2957-6/14/11 ...$15.00.
                                                                                                 We build a database of behavior graphs for a collection of An-
http://dx.doi.org/10.1145/2660267.2660359.                                                    droid apps. Each graph models the API usage scenario and pro-
gram semantics of the app that it represents. Given a new app, a
                                                                                 Submit              Vet                 Update
query is made for the app’s behavior graphs to search for the most                                                     Database &
similar counterpart in the database. The query result is a similarity                                                   Classifiers
                                                                           Developer    Android
score which sets the corresponding element in the feature vector of                    App Market
the app. Every element of this feature vector is associated with an                                 Report                    Offline Graph Database
                                                                                                                                   Construction &
individual graph in the database.                                                                          Online Detection        Training Phase
   We build graph databases for two sets of behaviors: benign and                       Figure 1: Deployment of DroidSIFT
malicious. Feature vectors extracted from these two sets are then
used to train two separate classifiers for anomaly detection and sig-   rudimentary. As an example, consider the Bouncer [22] vetting
nature detection. The former is capable of discovering zero-day         system that is used by the official Google Play Android market.
malware, and the latter is used to identify malware variants.           Though the technical details of Bouncer are not publicly avail-
   We implement a prototype system, DroidSIFT, in 23 thousand           able, experiments by Oberheide and Miller [24] show that Bouncer
lines of Java code. Our dependency graph generation is built on         performs dynamic analysis to examine apps within an emulated
top of Soot [2], while our graph similarity query leverages a graph     Android environment for a limited period of time. This method
matching toolkit [29] to compute graph edit distance. We evaluate       of analysis can be easily evaded by apps that perform emulator
our system using 2200 malware samples and 13500 benign sam-             detection, contain hidden behaviors that are timer-based, or oth-
ples. Experiments show that our signature detection can correctly       erwise avoid triggering malicious behavior during the short time
label 93% malware instances; our anomaly detector is capable of         period when the app is being vetted. Signature detection tech-
detecting zero-day malware with a low false negative rate (2%) and      niques adopted by the current anti-malware products have also been
an acceptable false positive rate (5.15%) for vetting purpose.          shown to be trivially evaded by simple bytecode-level transforma-
   In summary, this paper makes the following contributions:            tions [28].
                                                                           We propose a new technique, DroidSIFT, illustrated in Figure 1,
     • We propose a semantic-based malware classification to ac-
                                                                        that addresses these shortcomings and can be deployed as a re-
       curately detect both zero-day malware and unknown variants
                                                                        placement for existing vetting techniques currently in use by the
       of known malware families. We model program semantics
                                                                        Android app markets. This technique is based on static analysis,
       with weighted contextual API dependency graphs, introduce
                                                                        which is immune to emulation detection and is capable of analyz-
       graph similarity metrics to capture homogeneous semantics
                                                                        ing the entirety of an app’s code. Further, to defeat bytecode-level
       in malware variants or zero-day malware samples, and en-
                                                                        transformations, our static analysis is semantics-aware and extracts
       code similarity scores into graph-based feature vectors to en-
                                                                        program behaviors at the semantic level. More specifically, we
       able classifiers for anomaly and signature detections.
                                                                        achieve the following design goals:
     • We implement a prototype system, DroidSIFT, that performs
       static program analysis to generate dependency graphs. Droid-       • Semantic-based Detection. Our approach detects malware
       SIFT builds graph databases and produces graph-based fea-             instances based on their program semantics. It does not rely
       ture vectors by performing graph similarity queries. It then          on malicious code patterns, external symptoms, or heuristics.
       uses the feature vectors of benign and malware apps to con-           Our approach is able to perform program analysis for both
       struct two separate classifiers that enable anomaly and signa-        the interpretation and demonstration of inherent program de-
       ture detections, respectively.                                        pendencies and execution contexts.

     • We evaluate the effectiveness of DroidSIFT using 2200 mal-          • High Scalability. Our approach is scalable to cope with
       ware samples and 13500 benign samples. Experiments show               millions of unique benign and malicious Android app sam-
       that our signature detection can correctly label 93% malware          ples. It addresses the complexity of static program analysis,
       instances; our anomaly detector is capable of detecting zero-         which can be considerably expensive in terms of both time
       day malware with a low false negative rate (2%) and an ac-            and memory resources, to perform precisely.
       ceptable false positive rate (5.15%) for vetting purpose.
                                                                           • Variant Resiliency. Our approach is resilient to polymor-
                                                                             phic variants. It is common for attackers to implement ma-
Paper Organization. The rest of the paper is organized as follows.
                                                                             licious functionalities in slightly different manners and still
Section 2 describes the deployment of our learning-based detection
                                                                             be able to perform the expected malicious behaviors. This
mechanism and gives an overview of our malware classification
                                                                             malware polymorphism can defeat detection methods that
technique. Section 3 defines our graph representation of Android
                                                                             are based on exact behavior matching, which is the method
program semantics and explains our graph generation methodol-
                                                                             prevalently adopted by existing signature-based detection and
ogy. Section 4 elaborates graph similarity query, the extraction of
                                                                             graph-based model checking. To address this, our technique
graph-based feature vectors and learning-based anomaly & signa-
                                                                             is able to measure the similarity of app behaviors and tolerate
ture detection. Section 5 shows the experimental results of our ap-
                                                                             such implementation variants by using similarity scores.
proach. Discussion and related work are presented in Section 6
and 7, respectively. Finally, the paper concludes with Section 8.
                                                                           Consequently, we are able to conduct two kinds of classifica-
                                                                        tions: anomaly detection and signature detection. Upon receiving a
2.     OVERVIEW                                                         new app submission, our vetting process will conduct anomaly de-
  In this section, we present an overview of the problem and the        tection to determine whether it contains behaviors that significantly
deployment and architecture of our proposed solution.                   deviate from the benign apps within our database. If such a devia-
                                                                        tion is discovered, a potential malware instance is identified. Then,
2.1     Problem Statement                                               we conduct further signature detection on it to determine if this app
   An effective vetting process for discovering malicious software      falls into any malware family within our signature database. If so,
is essential for maintaining a healthy ecosystem in the Android app     the app is flagged as malicious and bounced back to the developer
markets. Unfortunately, existing vetting processes are still fairly     immediately.
[0,0,0,0.9,…,0.8]
                                                               buckets                              [1,0.6,0,0,…,0.7]
                                                                                   {         }
                           {             }                   [0,0,...,0,0,0,0,1]
                                                             [0,0,...,0,0,0,1,0]

                                                                                             }
                                                                                                    [0,0.9,0.7,0,…,0]

                                                                                                          ..
                                                                                                           .
                                                             [0,0,...,0,0,0,1,1]
                                                                                   {                [0.6,0.9,0,0,…,0]
        Android Apps

                                                                   ..
                                                                    .
                                                             [1,0,...,1,1,1,1,1]                    [0.8,0,0.8,0,…,1]
                                                             [1,1,...,1,1,1,1,1]

                            Behavior Graph                 Scalable Graph                         Graph-based Feature         Anomaly & Signature
                              Generation                   Similarity Query                        Vector Extraction              Detection
                                                          Figure 2: Overview of DroidSIFT
   If the app passes this hurdle, it is still possible that a new malware              (3) Graph-based Feature Vector Extraction. Given an app, we
species has been found. We bounce the app back to the developer                            attempt to find the best match for each of its graphs from the
with a detailed report when suspicious behaviors that deviate from                         database. This produces a similarity feature vector. Each ele-
benign behaviors are discovered, and we request justification for                          ment of the vector is associated with an existing graph in the
the deviation. The app is approved only after the developer makes                          database. This vector bears a non-zero similarity score in one
a convincing justification for the deviation. Otherwise, after fur-                        element only if the corresponding graph is the best match to
ther investigation, we may confirm it to indeed be a new malware                           one of the graphs for the given app.
species. By placing this information into our malware database, we
further improve our signature detection to detect this new malware                     (4) Anomaly & Signature Detection. We have implemented a
species in the future.                                                                     signature classifier and an anomaly detector. We have produced
   It is also possible to deploy our technique via a more ad-hoc                           feature vectors for malicious apps, and these vectors are used
scheme. For example, our detection mechanism can be deployed as                            to train the classifier for signature detection. The anomaly de-
a public service that allows a cautious app user to examine an app                         tection discovers zero-day Android malware, and the signature
prior to its installation. An enterprise that maintains its own private                    detector uncovers the type (family) of the malware.
app repository could utilize such a security service. The enterprise
service conducts vetting prior to adding an app to the internal app
                                                                                       3.    WEIGHTED CONTEXTUAL API DEPEN-
pool, thereby protecting employees from apps that contain malware                            DENCY GRAPH
behaviors.                                                                               In this section, we describe how we capture the semantic-level
                                                                                       behaviors of Android malware in the form of graphs. We start
2.2     Architecture Overview                                                          by identifying the key behavioral aspects that must be captured,
  Figure 2 depicts the workflow of our graph-based Android mal-                        present a formal definition, and then present a real example to
ware classification. This takes the following steps:                                   demonstrate these aspects.

                                                                                       3.1       Key Behavioral Aspects
(1) Behavior Graph Generation. Our malware classification con-                           We consider the following aspects as essential when describing
    siders graph similarity as the feature vector. To this end, we                     the semantic-level behaviors of an Android malware sample:
    first perform static program analysis to transform Android byte-
    code programs into their corresponding graph representations.                      1) API Dependency. API calls (including reflective calls to the
    Our program analysis includes entry point discovery and call                          private framework functions) indicate how an app interacts with
    graph analysis to better understand the API calling contexts,                         the Android framework. It is essential to capture what API calls
    and it leverages both forward and backward dataflow analysis                          an app makes and the dependencies among those calls. Prior
    to explore API dependencies and uncover any constant parame-                          works on semantic- and behavior-based malware detection and
    ters. The result of this analysis is expressed via Weighted Con-                      classification for desktop environments all make use of API de-
    textual API Dependency Graphs that expose security-related                            pendency information [16, 21]. Android malware shares the
    behaviors of Android apps.                                                            same characteristics.
                                                                                       2) Context. An entry point of an API call is a program entry
(2) Scalable Graph Similarity Query. Having generated graphs                              point that directly or indirectly triggers the call. From a user-
    for both benign and malicious apps, we then query the graph                           awareness point of view, there are two kinds of entry points:
    database for the one graph most similar to a given graph. To                          user interfaces and background callbacks. Malware authors com-
    address the scalability challenge, we utilize a bucket-based in-                      monly exploit background callbacks to enable malicious func-
    dexing scheme to improve search efficiency. Each bucket con-                          tionalities without the user’s knowledge. From a security ana-
    tains graphs bearing APIs from the same Android packages,                             lyst’s perspective, it is a suspicious behavior when a typical user
    and it is indexed with a bitvector that indicates the presence of                     interactive API (e.g., AudioRecord.startRecording()) is
    such packages. Given a graph query, we can quickly seek to the                        called stealthily [10]. As a result, we must pay special attention
    corresponding bucket index by matching the package’s vector                           to APIs activated from background callbacks.
    to the bucket’s bitvector. Once a matching bucket is located,
    we further iterate this bucket to find the best-matching graph.                    3) Constant. Constants convey semantic information by revealing
    Finding the best-matching graph, instead of an exact match, is                        the values of critical parameters and uncovering fine-grained
    necessary to identify polymorphic malware.                                            API semantics. For instance, Runtime.exec() may execute
,                   OnClickListener.                   Handler.
                         BroadcastReceiver.onReceive, Ø                                 onClick       Runnable.run   handleMessage

      ,                             e1              e2            e3
                    BroadcastReceiver.onReceive, Ø

                     ,
                                BroadcastReceiver.onReceive, Ø

           ,                                 a1              a2            a3
                 BroadcastReceiver.onReceive, Ø
                                                                                      Runnable.start     Handler.    SmsManager.
                ,                               sendMessage sendTextMessage
                            BroadcastReceiver.onReceive, Ø

         ,
                                                                           Figure 4: Callgraph for asynchronously sending an SMS message.
                        BroadcastReceiver.onReceive, Setc                  “e” and “a” stand for “event handler” and “action” respectively.
                   Setc = {”http://softthrifty.com/security.jsp”}
                                                                         manner. This graph contains five API call nodes. Each node con-
                       Figure 3: WC-ADG of Zitmo                         tains the call’s prototype, a set of any constant parameters, and the
                                                                         entry points of the call. Dashed arrows that connect a pair of nodes
    varied shell commands, such as ps or chmod, depending upon
                                                                         indicates that a data dependency exists between the two calls in
    the input string constant. Constant analysis also discloses the
                                                                         those nodes.
    data dependencies of certain security-sensitive APIs whose benign-
                                                                            By combining the knowledge of API prototypes with the data
    ness is dependent upon whether an input is constant. For exam-
                                                                         dependency information shown in the graph, we know that the app
    ple, a sendTextMessage() call taking a constant premium-
                                                                         is forwarding an incoming SMS to the network. Once an SMS is
    rate phone number as a parameter is a more suspicious behavior
                                                                         received by the mobile phone, Zitmo creates an SMS object from
    than the call to the same API receiving the phone number from
                                                                         the raw Protocol Data Unit by calling createFromPdu(byte[]).
    user input via getText(). Consequently, it is crucial to extract
                                                                         It extracts the sender’s phone number and message content by call-
    information about the usage of constants for security analysis.
                                                                         ing getOriginatingAddress() and getMessageBody(). Both
                                                                         strings are encoded into an UrlEncodedFormEntity object and
   Once we look at app behaviors using these three perspectives, we
                                                                         enclosed into HttpEntityEnclosingRequestBase by using the
perform similarity checking, rather than seeking an exact match,
                                                                         setEntity() call. Finally, this HTTP request is sent to the net-
on the behavioral graphs. Since each individual API node plays a
                                                                         work via AbstractHttpClient.execute().
distinctive role in an app, it contributes differently to the graph sim-
                                                                            Zitmo variants may also exploit various other communication-
ilarity. With regards to malware detection, we emphasize security-
                                                                         related API calls for the sending purpose. Another Zitmo instance
sensitive APIs combined with critical contexts or constant param-
                                                                         uses SmsManager.sendTextMessage() to deliver the stolen in-
eters. We assign weights to different API nodes, giving greater
                                                                         formation as a text message to the attacker’s phone. Such variations
weights to the nodes containing critical calls, to improve the “qual-
                                                                         motivate us to consider graph similarity metrics, rather than an ex-
ity” of behavior graphs when measuring similarity. Moreover, the
                                                                         act matching of API call behavior, when determining whether a
weight generation is automated. Thus, similar graphs have higher
                                                                         sample app is benign or malicious.
similarity scores by design.
                                                                            The context provided by the entry points of these API calls in-
3.2 Formal Definition                                                    forms  us that the user is not aware of this SMS forwarding behav-
                                                                         ior. These consecutive API invocations start within the entry point
   To address all of the aforementioned factors, we describe app be-
                                                                         method onReceive() with a call to createFromPdu(byte[]).
haviors using Weighted Contextual API Dependency Graphs (WC-ADG).
                                                                         onReceive() is a broadcast receiver registered by the app to re-
At a high level, a WC-ADG consists of API operations where some
                                                                         ceive incoming SMS messages in the background. Therefore, the
of the operations have data dependencies. A formal definition is
                                                                         createFromPdu(byte[]) and subsequent API calls are activated
presented as follows.
                                                                         from a non-user-interactive entry point and are hidden from the
   Definition 1. A Weighted Contextual API Dependency Graph is
                                                                         user.
a directed graph G = (V, E, α, β) over a set of API operations Σ
                                                                            Constant analysis of the graph further indicates that the forward-
and a weight space W , where:
                                                                         ing destination is suspicious. The parameter of execute() is nei-
   • The set of vertices V corresponds to the contextual API opera-
                                                                         ther the sender (i.e., the bank) nor any familiar parties from the
tions in Σ;
                                                                         contacts. It is a constant URL belonging to an unknown third-party.
   • The set of edges E ⊆ V × V corresponds to the data dependen-
cies between operations;
   • The labeling function α : V → Σ associates nodes with the la-       3.4 Graph Generation
bels of corresponding contextual API operations, where each label           We have implemented a graph generation tool on top of Soot [2]
is comprised of 3 elements: API prototype, entry point and constant      in 20k lines of code. This tool examines an Android app to conduct
parameter;                                                               entry point discovery and perform context-sensitive, flow-sensitive,
   • The labeling function β : V → W associates nodes with their         and interprocedural dataflow analyses. These analyses locate API
corresponding weights, where ∀w ∈ W , w ∈ R, and R represents            call parameters and return values of interest, extract constant pa-
the space of real numbers.                                               rameters, and determine the data dependencies among the API calls.

3.3      A Real Example                                                    Entry Point Discovery.
  Zitmo is a class of banking trojan malware that steals a user’s             Entry point discovery is essential to revealing whether the user
SMS messages to discover banking information (e.g., mTANs).                is aware that a certain API call has been made. However, this
Figure 3 presents an example WC-ADG that depicts the malicious             identification is not straightforward. Consider the callgraph seen
behavior of a Zitmo malware sample in a concise, yet complete,             in Figure 4. This graph describes a code snippet that registers a
Table 1: Calling Convention of Asynchronous Calls
Algorithm 1 Entry Point Reduction for Asynchronous Callbacks             Top-level Class   Run Method               Start Method
  Mentry ← {Possible entry point callback methods}                       Runnable          run()                    start()
  CMasync ← {Pairs of (BaseClass, RunM ethod) for asynchronous           AsyncTask         onPreExecute()           execute()
  calls in framework}                                                    AsyncTask         doInBackground()         onPreExecute()
  RSasync ← {Map from RunM ethod to StartM ethod for asyn-               AsyncTask         onPostExecute()          doInBackground()
  chronous calls in framework}                                           Message           handleMessage()          sendMessage()
  for mentry ∈ Mentry do
      c ← the class declaring mentry
      base ← the base class of c                                          public class AsyncTask{
      if (base, mentry ) ∈ CMasync then                                     public AsyncTask execute(Params... params){
           mstart ← Lookup(mentry ) in RSasync                                e x e c u t e S t u b (params);
           for ∀ call to mstart do                                          }
               r ← “this” reference of call                                 public AsyncTask e x e c u t e S t u b (Params...params){
               P ointsT oSet ← PointsToAnalysis(r)                            onPreExecute();
               if c ∈ P ointsT oSet then                                      Result result = doInBackground(params);
                    Mentry = Mentry − {mentry }                               o n P o s t E x e c u t e S t u b (result);
                    BuildDependencyStub(mstart , mentry )                   }
               end if                                                       public void o n P o s t E x e c u t e S t u b (Result result){
           end for                                                            onPostExecute(result);
      end if                                                                }
  end for                                                                 }
  output Mentry as reduced entry point set                                Figure 5: Stub code for dataflow of AsyncTask.execute
                                                                           Given these inputs, our algorithm iterates over Mentry . For ev-
                                                                        ery method mentry in this set, it finds the class c that declares this
onClick() event handler for a button. From within the event han-        method and the top-level base class base that c inherits from. Then,
dler, the code starts a thread instance by calling Thread.start(),      it searches the pair of base and mentry in the CMasync set. If a
which invokes the run() method implementing Runnable.run().             match is found, the method mentry is a “callee” by convention.
The run() method passes an android.os.Message object to the             The algorithm thus looks up mentry in the map SRasync to find
message queue of the hosting thread via Handler.sendMessage().          the corresponding “caller” mstart . Each call to mstart is further
A Handler object created in the same thread is then bound to this       examined and a points-to analysis is performed on the “this” refer-
message queue and its Handler.handleMessage() callback will             ence making the call. If class c of method mentry belongs to the
process the message and later execute sendTextMessage().                points-to set, we can ensure the calling relationship between the
   The sole entry point to the graph is the user-interactive callback   caller mstart and the callee mentry and remove the callee from
onClick(). However, prior work [23] on the identification of pro-       the entry point set.
gram entry points does not consider asynchronous calls and recog-          To indicate the data dependency between these two methods,
nizes all three callbacks in the program as individual entry points.    we introduce a stub which connects the parameters of the asyn-
This confuses the determination of whether the user is aware that an    chronous call to the corresponding parameters of its callee. Fig-
API call has been made in response to a user-interactive callback.      ure 5 depicts the example stub code for AsyncTask, where the
To address this limitation, we propose Algorithm 1 to remove any        parameter of execute() is first passed to doInBackground()
potential entry points that are actually part of an asynchronous call   through the stub executeStub(), and then the return from this
chain with only a single entry point.                                   asynchronous execution is further transferred to onPostExecute()
   Algorithm 1 accepts three inputs and provides one output. The        via onPostExecuteStub().
first input is Mentry , which is a set of possible entry points. The       Once the algorithm has reduced the number of entry point meth-
second is CMasync , which is a set of (BaseClass, RunM ethod)           ods in Mentry , we explore all code reachable from those entry
pairs. BaseClass represents a top-level asynchronous base class         points, including both synchronous and asynchronous calls. We
(e.g., Runnable) in the Android framework and RunM ethod is             further determine the user interactivity of an entry point by exam-
the asynchronous call target (e.g., Runnable.run()) declared in         ining its top-level base class. If the entry point callback overrides a
this class. The third input is RSasync , which maps RunM ethod          counterpart declared in one of the three top-level UI-related inter-
to StartM ethod. RunM ethod and StartM ethod are the callee             faces (i.e., android.graphics.drawable.Drawable.Callback,
and caller in an asynchronous call (e.g., Runnable.run() and            android.view.accessibility.AccessibilityEventSource,
Runnable.start()). The output is a reduced Mentry set.                  and android.view.KeyEvent.Callback), we then consider the
   We compute the Mentry input by applying the algorithm pro-           derived entry point method as a user interface.
posed by Lu et al. [23], which discovers all reachable callback
methods defined by the app that are intended to be called only by       Constant Analysis.
the Android framework. To further consider the logical order be-           We conduct constant analysis for any critical parameters of se-
tween Intent senders and receivers, we leverage Epicc [25] to           curity sensitive API calls. These calls may expose security-related
resolve the inter-component communications and then remove the          behaviors depending upon the values of their constant parameters.
Intent receivers from Mentry .                                          For example, Runtime.exec() can directly execute shell com-
   Through examination of the Android framework code, we gener-         mands, and file or database operations can interact with distinctive
ate a list of 3-tuples consisting of BaseClass, RunM ethod and          targets by providing the proper URIs as input parameters.
StartM ethod. For example, we capture the Android-specific call-           To understand these semantic-level differences, we perform back-
ing convention of AsyncTask with AsyncTask.onPreExecute()               ward dataflow analysis on selected parameters and collect all possi-
being triggered by AsyncTask.execute(). When a new asyn-                ble constant values on the backward trace. We generate a constant
chronous call is introduced into the framework code, this list is       set for each critical API argument and mark the parameter as “Con-
updated to include the change. Table 1 presents our current model       stant” in the corresponding node on the WC-ADG. While a more
for the calling convention of top-level base asynchronous classes in    complete string constant analysis is also possible, the computation
Android framework.                                                      of regular expressions is fairly expensive for static analysis. The
substring set currently generated effectively reflects the semantics       , where V and V 0 are respectively the vertices of two graphs, vI and
of a critical API call and is sufficient for further feature extraction.   vD are individual vertices inserted to and deleted from G, while EI
                                                                           and ED are the edges added to and removed from G.
API Dependency Construction.                                                  WGED presents the absolute difference between two graphs.
   We perform global dataflow analysis to discover data dependen-          This implies that wged(G,G’) is roughly proportional to the sum
cies between API nodes and build the edges on WC-ADG. How-                 of graph sizes and therefore two larger graphs are likely to be more
ever, it is very expensive to analyze every single API call made by        distant to one another. To eliminate this bias, we normalize the re-
an app. To address computational efficiency and our interests on           sulting distance and further define Weighted Graph Similarity based
security analysis, we choose to analyze only the security-related          upon it.
API calls. Permissions are strong indicators of security sensitivity          Definition 3. The Weighted Graph Similarity of two Weighted
in Android systems, so we leverage the API-permission mapping              Contextual API Dependency Graphs G and G’, with a weight func-
from PScout [8] to focus on permission-related API calls.                  tion β, is,
   Our static dataflow analysis is similar to the “split”-based ap-
                                                                                                            wged(G, G0 , β)
proach used by CHEX [23]. Each program split includes all code                wgs(G, G0 , β) = 1 −                                            (2)
reachable from a single entry point. Dataflow analysis is performed                                   wged(G, ∅, β) + wged(∅, G0 , β)
on each split, and then cross-split dataflows are examined. The dif-
ference between our analysis and that of CHEX lies in the fact that        , where ∅ is an empty graph. wged(G, ∅, β) + wged(∅, G0 , β) then
we compute larger splits due to the consideration of asynchronous          equates the maximum possible edit cost to transform G to G’.
calling conventions.
   We make a special consideration for reflective calls within our         4.2    Weight Assignment
analysis. In Android programs, reflection is realized by calling the          Instead of manually specifying the weights on different APIs
method java.lang.reflect.Method.invoke(). The “this” ref-                  (in combination of their attributes), we wish to see a near-optimal
erence of this API call is a Method object, which is usually ob-           weight assignment.
tained by invoking either getMethod() or getDeclaredMethod()
from java.lang.Class. The class is often acquired in a reflective          Selection of Critical API Labels.
manner too, through Class.forName(). This API call resolves a                 Given a large number of API labels (unique combinations of
string input and retrieves the associated Class object.                    API names and attributes), it is unrealistic to automatically as-
   We consider any reflective invoke() call as a sink and conduct          sign weights for every one of them. Our goal is malware clas-
backward dataflow analysis to find any prior data dependencies. If         sification, so we concentrate on assigning weights to labels for
such an analysis reaches string constants, we are able to statically       the security-sensitive APIs and critical combinations of their at-
resolve the class and method information. Otherwise, the reflective        tributes. To this end, we perform concept learning to discover
call is not statically resolvable. However, statically unresolvable        critical API labels. Given a positive example set (PES) contain-
behavior is still represented within the WC-ADG as nodes which             ing malware graphs and a negative example set (NES) containing
contain no constant parameters. Instead, this reflective call may          benign graphs, we seek a critical API label (CA) based on two re-
have several preceding APIs, from a dataflow perspective, which            quirements: 1) frequency(CA,PES) > frequency(CA,NES) and 2)
are the sources of its metadata.                                           frequency(CA,NES) is less than the median frequency of all critical
                                                                           API labels in NES. The first requirement guarantees that a critical
                                                                           API label is more sensitive to a malware sample than a benign one,
4.    ANDROID MALWARE CLASSIFICATION                                       while the second requirement ensures the infrequent presence of
  We generate WC-ADGs for both benign and malicious apps. Each             such an API label in the benign set. Consequently, we have se-
unique graph is associated with a feature that we use to classify          lected 108 critical API labels. Our goal becomes the assignment
Android malware and benign applications.                                   of appropriate weights to these 108 labels while assigning a default
                                                                           weight of 1 to all remaining API labels.
4.1     Graph Matching Score
   To quantify the similarity of two graphs, we first compute a graph      Weight Assignment.
edit distance. To our knowledge, all existing graph edit distance al-         Intuitively, if two graphs come from the same malware family
gorithms treat node and edge uniformly. However, in our case, our          and share one or more critical API labels, we must maximize the
graph edit distance calculation must take into account the differ-         similarity between the two. We call such a pair of graphs a “homo-
ent weights of different API nodes. At present, we do not con-             geneous pair”. Conversely, if one graph is malicious and the other
sider assigning different weights on edges because this would lead         is benign, even if they share one or more critical API labels, we
to prohibitively high complexity in graph matching. Moreover, to           must minimize the similarity between the two. We call such a pair
emphasize the differences between two nodes in different labels,           of graphs a “heterogeneous pair”. Therefore, we cast the problem
we do not seek to relabel them. Instead, we delete the old node and        of weight assignment to be an optimization problem.
insert the new one subsequently. This is because node “relabeling”            Definition 4. The Weight Assignment is an optimization problem
cost, in our context, is not the string edit distance between the API      to maximize the result of an objective function for a given set of
labels of two nodes. It is the cost of deleting the old node plus that     graph pairs {}:
of adding the new node.                                                                                   X                     X
   Definition 2. The Weighted Graph Edit Distance (WGED) of                max f ({< G, G0 >}, β) =         wgs(G, G0 , β) −       wgs(G, G0 , β)
two Weighted Contextual API Dependency Graphs G and G’, with                                         is a             is a
a uniform weight function β, is the minimum cost to transform G                                   homogeneous pair       heterogeneous pair
to G’:                                                                       s.t.
                           X              X
wged(G, G0 , β) = min(          β(vI ) +       β(vD ) + |EI | + |ED |)           1 ≤ β(v) ≤ θ, if v is a critical node;
                       vI ∈{V 0 −V } vD ∈{V −V 0 }                               β(v) = 1, otherwise.
                                                                    (1)                                                                       (3)
Homogeneous Heterogeneous

                                                                                                                                                    ge er
                                                                                                                                                  sa ag
                           Graph Pairs Graph Pairs                                 createFromPdu

                                                                                                                                             sM nyM er
                                                                                                                                               e s an
                                                                                                                                           Sm o g
                                                                                                                                         y. ph na
                                                                                                                                       on ele Ma
                                                                               getOriginatingAddress

                                                                                                                                     ph .T n
                                                                                                                                   le y tio
             β    +

                                                                                                                                 te on a
                                                                                                                               d. h oc
                                                                                       getMessageBody

                                                                                                                                     ke er
                                                                                                                                 .S ip s
                                                                                                                          va y r e
                                                                                                                             oi lep .L
                                f({},β)

                                                                                                                               et .C es
                                                                                                                                   oc h
                                                                                                                        ja x.cr os.P tim
                                                                                                                          dr .te ion

                                                                                                                                       t
                                                                                                                             .n pto oc
                  -

                                                                                                                          va d. n
                                                                                                                       an oid cat
                                                                                        
                                                                                                                       ja oi .Ru
                                                                                                                          dr .lo
                                                                                                                  e

                                                                                                                          dr ng
                                                                                                                ag

                                                                                                                       an roid

                                                                                                                       an .la
                                                                                                                 s
                                                                                        setEntity

                                                                                                              es

                                                                                                                          va
                                                                                                                          d
                                                                                                            sM

                                                                                                                      an

                                                                                                                       ja
                                                                                                           m
                                Hill Climber                                          execute                         [0,   0,   0,   …,        0,   0,   0,   1]

                                                                                                          y.S
                                                                                                     on
                                                                                                                      [0,   0,   0,   …,        0,   0,   1,   0]

                                                                                                     ph
                                                                                                   le
                                                                                                                      [0,   0,   0,   …,        0,   0,   1,   1]

                                                                                                 te
 Figure 6: A Feedback Loop to Solve the Optimization Problem

                                                                                              d.
                                                                                                                      [0,   0,   0,   …,        0,   1,   0,   0]

                                                                                            oi
                                                                                          dr
                                                                                                                      [0,   0,   0,   …,        0,   1,   0,   1]

                                                                                        an
, where β is the weight function that requires optimization; θ is the             [0, 0, 1, …, 0, 0, 0, 0]

                                                                                                                                      ... ...
upper bound of a weight. Empirically, we set θ to be 20.
   To achieve the optimization of Equation 3, we use the Hill Climb-                                                  [0, 0, 1, …, 0, 0, 0, 0]

                                                                                                                                      ... ...
ing algorithm [30] to implement a feedback loop that gradually im-
proves the quality of weight assignment. Figure 6 presents such
a system, which takes two sets of graph pairs and an initial weight             Figure 7: Bucket-based Indexing of Graph Database
function β as inputs. β is a discrete function which is represented as
a weight vector. At each iteration, Hill Climbing adjusts a single el-   sition to a hierarchical database structure, such as vantage point
ement in the weight vector and determines whether the change im-         tree [20], under each bucket.
proves the value of objective function f({},β). Any change            Figure 7 demonstrates the bucket query for the WC-ADG of Zitmo
that improves f({},β) is accepted, and the process contin-         shown in Figure 3. This graph contains six API calls, three of which
ues until no change can be found that further improves the value.        belong to a critical package: android.telephony.SmsManager.
                                                                         The generated bitvector for the Zitmo graph indicates the presence
4.3    Implementation                                                    of this API package, and an exact match for the bitvector is per-
  To compute the weighted graph similarity, we use a bipartite           formed against the bucket index. Notice that the presence of a sin-
graph matching tool [29]. We cannot directly use this graph match-       gle critical package is different from that of a combination of mul-
ing tool because it does not support assigning different weights         tiple critical packages. Thus, the result bucket used by this search
on different nodes in a graph. To work around this limitation, we        contains graphs that include android.telephony.SmsManager
enhanced the bipartite algorithm to support weights on individual        as the only critical package in use. SmsManager, being a criti-
nodes.                                                                   cal package, helps capture the SMS retrieval behavior and narrow
                                                                         down the search range. Because HTTP-related API packages are
4.4    Graph Database Query                                              not considered as critical, such an exact match over index will not
   Given an app, we match its WC-ADGs against all existing graphs        exclude Zitmo variants that use other I/O packages, such as SMS.
in the database. The number of graphs in the database can be fairly
large, so the design of the graph query must be scalable.                4.5    Malware Classification
   Intuitively, we could insert graphs into individual buckets, with
each bucket labeled according to the presence of critical APIs. In-      Anomaly Detection.
stead of comparing a new graph against every existing graph in the          We have implemented a detector to conduct anomaly detection.
database, we limit the comparison to only the graphs within a par-       Given an app, the detector provides a binary result that indicates
ticular bucket that possesses graphs containing a corresponding set      whether the app is abnormal or not. To achieve this goal, we build
of critical APIs. Critical APIs generally have higher weights than       a graph database for benign apps. The detector then attempts to
regular APIs, so graphs in other buckets will not be very similar to     match the WC-ADGs of the given app against the ones in database.
the input graph and are safe to ignore. However, API-based bucket        If a sufficiently similar one for any of the behavior graphs is not
indexing may be overly strict because APIs from the same package         found, an anomaly is reported by the detector. We have set the
usually share similar functionality. For instance, both getDeviceId()    similarity threshold to be 70% per our empirical studies.
and getSubscriberId() are located in TelephonyManager pack-
age, and both retrieve identity-related information. Therefore, we       Signature Detection.
instead index buckets based on the package names of critical APIs.          We next use a classifier to perform signature detection. Our sig-
   More specifically, to build a graph database, we must first build     nature detector is a multi-label classifier designed to identify the
an API package bitvector for all existing graphs within the database.    malware families of unknown malware instances.
Such a bitvector has n elements, each of which indicates the pres-          To enable classification, we first build a malware graph database.
ence of a particular Android API package. For example, a graph           To this end, we conduct static analysis on the malware samples
that calls sendTextMessage() and getDeviceId() will set the              from the Android Malware Genome Project [1,36] to extract WC-ADGs.
corresponding bits for the android.telephony.SmsManager and              In order to consider only the unique graphs, we remove any graphs
android.telephony.TelephonyManager packages. Graphs that                 that have a high level of similarity to existing ones. With experi-
share the same bitvector (i.e., the same API package combination)        mental study, we consider a high similarity to be greater than 80%.
are then placed into the same bucket. When querying a new graph          Further, to guarantee the distinctiveness of malware behaviors, we
against the database, we encode its API package combination into         compare these malware graphs against our benign graph set and
a bitvector and compare that bitvector against each database index.      remove the common ones.
Notice that, to ensure the scalability, we implement the bucket-            Next, given an app, we generate its feature vector for classifi-
based indexing with a hash map where the key is the API package          cation purpose. In such a vector, each element is associated with
bitvector and the value is a corresponding graph set.                    a graph in our database. And, in turn, all the existing graphs are
   Empirically, we found this one-level indexing efficient enough        projected to a feature vector. In other words, there exists a one-to-
for our problem. If the database grows much larger, we can tran-         one correspondence between the elements in a feature vector and
80	
                                                                             40	
  

                                                                                 Number	
  of	
  Graphs

                                                                                                                                                                  Number	
  of	
  Graphs
                     G1 G2 G3 G4 G5 G6 G7 G8                 ...   G861 G862                              60	
                                                                             30	
  
           ADRD      0     0     0   0   0     0.8 0.9   0   ...    0    0                                40	
                                                                             20	
  
                                                                                                          20	
                                                                             10	
  
        DroidDream   0.9   0     0   0   0.8 0.7 0.7     0   ...    0    0
                                                                                                            0	
                                                                              0	
  
       DroidKungFu   0     0.7   0   0   0.6   0   0.6   0   ...    0   0.9                                         1	
        4001	
   8001	
        12001	
                                        1	
         501	
       1001	
      1501	
  
                                                                                                                                    App	
  ID                                                                              App	
  ID
              Figure 8: An Example of Feature Vectors                                      (a) Graphs per Benign App.                                                                      (b) Graphs per Malware.
                                                                                                          600	
                                                                            300	
  
the existing graphs in the database. To construct the feature vec-

                                                                                 Number	
  of	
  Nodes

                                                                                                                                                                  Number	
  of	
  Nodes
tor of the given app, we produce its WC-ADGs and then query the                                           400	
                                                                            200	
  
graph database for all the generated graphs. For each query, a best                                       200	
                                                                            100	
  
matching graph is found. The similarity score is then put into the                                             0	
                                                                              0	
  
feature vector at the position corresponding to this best matching                                                     1	
     30001	
   60001	
   90001	
                                              1	
     4001	
   8001	
   12001	
  16001	
  
graph. Specifically, the feature vector of a known malware sample                                                                  Graph	
  ID                                                                             Graph	
  ID
is attached with its family label so that the classifier can understand               (c) Nodes per Benign Graph.                                                 (d) Nodes per Malware Graph.
the discrepancy between different malware families.
   Figure 8 gives an example of feature vectors. In our malware                                                                Figure 9: Graph Generation Summary.
graph database of 862 graphs, a feature vector of 862 elements is
constructed for each app. The two behavior graphs of ADRD are                  These facts serve as the basic requirements for the scalability of our
most similar to graph G6 and G7, respectively, from the database.              approach, since the runtime performance of graph matching and
The corresponding elements of the feature vector are set to the sim-           query is largely dependent upon the number of nodes and graphs,
ilarity scores of theose features. The rest of the elements remain set         respectively.
to zero.
   Once we have produced the feature vectors for the training sam-             5.3                                  Classification Results
ples, we can next use them to train a classifier. We select Naïve
Bayes algorithm for the classification. In fact, we can choose dif-
                                                                               Signature Detection.
ferent algorithms for the same purpose. However, since our graph-
                                                                                  We use a multi-label classification to identify the malware fam-
based features are fairly strong, even Naïve Bayes can produce sat-
                                                                               ily of the unrecognized malicious samples. Therefore, we expect to
isfying results. Naïve Bayes also has several advantages: it re-
                                                                               only include those malware behavior graphs, that are well labeled
quires only a small amount of training data; parameter adjustment
                                                                               with family information, into the database. To this end, we rely on
is straightforward and simple; and runtime performance is favor-
                                                                               the malware samples from the Android Malware Genome Project
able.
                                                                               and use them to construct the malware graph database. Conse-
                                                                               quently, we built such a database of 862 unique behavior graphs,
5.    EVALUATION                                                               with each graph labeled with a specific malware family.
                                                                                  We then selected 1050 malware samples from the Android Mal-
5.1    Dataset & Experiment Setup                                              ware Genome Project and used them as a training set. Next, we
   We collected 2200 malware samples from the Android Malware                  would like to collect testing samples from the rest of our collec-
Genome Project [1] and McAfee, Inc, a leading antivirus company.               tion. However, a majority of malware samples from McAfee are
To build a benign dataset, we received a number of benign samples              not strongly labeled. Over 90% of the samples are coarsely labeled
from McAfee, and we downloaded a variety of popular apps hav-                  as “Trojan” or “Downloader”, but in fact belong to a specific mal-
ing a high ranking from Google Play. To further sanitize this benign           ware family (e.g., DroidDream). Moreover, even VirusTotal can-
dataset, we sent these apps to the VirusTotal service for inspection.          not provide reliable malware family information for a given sam-
The final benign dataset consisted of 13500 samples. We performed              ple because the antivirus products used by VirusTotal seldom reach
the behavior graph generation, graph database creation, graph sim-             a consensus. This fact tells us two things: 1) It is a non-trivial
ilarity query and feature vector extraction using this dataset. We             task to collect evident samples as the ground truth in the context of
conducted the experiment on a test machine equipped with Intel(R)              multi-label classification; 2) multi-label malware detection or clas-
Xeon(R) E5-2650 CPU (20M Cache, 2GHz) and 128GB of physi-                      sification is, in general, a challenging real-world problem.
cal memory. The operating system is Ubuntu 12.04.3 (64bit).                       Despite the difficulty, we obtained 193 samples, each of which
                                                                               is detected as the same malware by major AVs. We then used those
5.2    Summary of Graph Generation                                             samples as testing data. The experiment result shows that our clas-
   Figure 9 summarizes the characteristics of the behavior graphs              sifier can correctly label 93% of these malware instances.
generated from both benign and malicious apps. Among them, Fig-                   Among the successfully labeled malware samples are two types
ure 9a and Figure 9b illustrate the number of graphs generated from            of Zitmo variants. One uses HTTP for communication, and the
benign and malicious apps. On average, 7.8 graphs are computed                 other uses SMS. While the former one is present in our malware
from each benign app, while 9.8 graphs are generated from each                 database, the latter one was not. Nevertheless, our signature de-
malware instance. Most apps focus on limited functionalities and               tector is still able to capture this variant. This indicates that our
do not produce a large number of behavior graphs. In 92% of be-                similarity metrics effectively tolerate variations in behavior graphs.
nign samples and 98% of malicious ones, no more than 20 graphs                    We further examined the 7% of the samples that were misla-
are produced from an individual app.                                           beled. It turns out that the mislabeled cases can be roughly put into
   Figure 9c and Figure 9d present the number of nodes of benign               two categories. First, DroidDream samples are labeled as Droid-
and malicious behavior graphs. A benign graph, on average, has                 KungFu. DroidDream and DroidKungFu share multiple malicious
15 nodes, while a malicious graph carries 16.4. Again, most of                 behaviors such as gathering privacy-related information and hidden
the activities are not intensive, so the majority of these graphs have         network I/O. Consequently, there exists a significant overlap be-
a small number of nodes. Statistics show that 94% of the benign                tween their WC-ADGs. Second, Zitmo, Zsone and YZHC instances
graphs and 91% of the malicious ones carry less than 50 nodes.                 are labeled as one another. These three families are SMS Tro-
12000

     Number of Unique Graphs
                                                                                                                        True Positive    False Positive

                                                                                           Number of Detections
                               10000                                                                              25
                                                                                                                  20
                                8000
                                                                                                                  15
                                6000                                                                              10
                                                                                                                   5
                                4000
                                                                                                                   0
                                2000

                                   0                                                                                              Detector ID
                                    3000 4000 5000 6000 7000 8000 9000 10000 11000
                                              Number of Benign Apps                                           Figure 11: Detection Ratio for Obfuscated Malware

    Figure 10: Convergence of Unique Graphs in Benign Apps                              Further, we would like to evaluate our anomaly detector with
                                                                                     malicious samples from new malware families. To this end, we
jans. Though their behaviors are slightly different from each other,                 retrieve a new piece of malware called Android.HeHe, which was
they all exploit sendTextMessage() to deliver the user’s informa-                    first found and reported in January 2014 [12]. Android.HeHe ex-
tion to an attacker-specified phone number. Despite the mislabeled                   ercises a variety of malicious functionalities such as SMS inter-
cases, we still manage to successfully label 93% of the malware                      ception, information stealing and command-and-control. This new
samples with a Naïve Bayes classifier. Applying a more advanced                      malware family does not appear in the samples from the Android
classification algorithm will further improve the accuracy.                          Malware Genome Project, which were collected from August 2010
                                                                                     to October 2011, and therefore cannot be labeled via signature de-
                                                                                     tection. DroidSIFT generates 49 WC-ADGs for this sample and,
Anomaly Detection.                                                                   once these graphs are presented to the anomaly detector, a warn-
   Since we wish to perform anomaly detection using our benign
                                                                                     ing is raised indicating the abnormal behaviors expressed by these
graph database, the coverage of this database is essential. In the-
                                                                                     graphs.
ory, the more benign apps that the database collects, the more be-
nign behaviors it covers. However, in practice, it is extremely diffi-
cult to retrieve benign apps exhaustively. Luckily, different benign                 Detection of Transformation Attacks.
apps may share the same behaviors. Therefore, we can focus on                           We collected 23 DroidDream samples, which are all intention-
unique behaviors (rather than unique apps). Moreover, with more                      ally obfuscated using a transformation technique [28], and 2 be-
and more apps being fed into the benign database, the database size                  nign apps that are deliberately disguised as malware instances by
grows slower and slower. Figure 10 depicts our discovery. When                       applying the same technique. We ran these samples through our
the number of apps increases from 3000 to 4000, there is a sharp                     anomaly detection engine and then sent the detected abnormal ones
increase (2087) of unique graphs. However, when the number of                        through the signature detector. The result shows that while 23 true
apps grows from 10000 to 11000, only 220 new, unique graphs are                      malware instances are flagged as abnormal ones in anomaly de-
generated, and the curve begins to flatten.                                          tection, the 2 clean ones also correctly pass the detection without
   We built a database of 10420 unique graphs from 11400 benign                      raising any warnings. We then compared our signature detection
apps. Then, we tested 2200 malware samples against the benign                        results with antivirus products. To obtain detection results of an-
classifier. The false negative rate was 2%, which indicates that                     tivirus software, we sent these samples to VirusTotal and selected
42 malware instances were not detected. However, we noted that                       10 anti-virus (AV) products (i.e., AegisLab, F-Prot, ESET-NOD32,
most of the missed samples are exploits or Downloaders. In these                     DrWeb, AntiVir, CAT-QuickHeal, Sophos, F-Secure, Avast, and
cases, their bytecode programs do not bear significant API-level                     Ad-Aware) that bear the highest detection rates. Notice that we
behaviors, and therefore generated WC-ADGs do not necessarily                        consider the detection to be successful only if the AV can correctly
look abnormal when compared to benign ones. At this point, we                        flag a piece of malware as DroidDream or its variant. In fact, to our
have only considered the presence of constant parameters in an API                   observation, many AV products can provide partial detection re-
call. We did not further differentiate API behaviors based upon                      sults based upon the native exploit code included in the app pack-
constant values. Therefore, we cannot distinguish the behaviors                      age or common network I/O behaviors. As a result, they usually
of Runtime.exec() calls or network I/O APIs with varied string                       recognize these DroidDream samples as “exploits” or “Download-
inputs. Nevertheless, if we create a custom filter for these string                  ers” while missing many other important malicious behaviors. Fig-
constants, we will then be able to identify these malware samples                    ure 11 presents the detection ratios of “DroidDream” across dif-
and the false negative rate will drop to 0.                                          ferent detectors. While none of the antivirus products can achieve
   Next, we used the remaining 2100 benign apps as test samples to                   a detection rate higher than 61%, DroidSIFT can successfully flag
evaluate the false positive rate of our anomaly detector. The result                 all the obfuscated samples as DroidDream instances. In addition,
shows that 5.15% of clean apps are mistakenly recognized as sus-                     we also notice that AV2 produces a relatively high detection ra-
picious ones during anomaly detection. This means, if our anomaly                    tio (52.17%), but it also mistakenly flags those two clean samples
detector is applied to Google Play, among the approximately 1200                     as malicious apps. Since the disguising technique simply renames
new apps per day [4], around 60 apps will be mislabeled as contain-                  the benign app package to the one commonly used by DroidDream
ing anomalies and be bounced back to the developers. We believe                      (and thus confuses this AV detector), such false positives again ex-
that this is an acceptable ratio for vetting purpose. Moreover, since                plain that external symptoms are not robust and reliable features for
we do not reject the suspicious apps immediately, but rather ask                     malware detection.
the developers for justifications instead, we can further eliminate
these false positives during this interactive process. In addition, as               5.4                          Runtime Performance
we add more benign samples into the dataset, the false positive rate                    Figure 12 illustrates the runtime performance of DroidSIFT. Specif-
will further decrease.                                                               ically, it demonstrates the cumulative time consumption of graph
You can also read