SkillVet: Automated Traceability Analysis of Amazon Alexa Skills - arXiv

Page created by Timothy Carter
 
CONTINUE READING
SkillVet: Automated Traceability Analysis of Amazon Alexa Skills - arXiv
1

                                                SkillVet: Automated Traceability Analysis of
                                                            Amazon Alexa Skills
                                                                 Jide S Edu, Xavier Ferrer-Aran, Jose M Such, Guillermo Suarez-Tangil

                                             Abstract—Third-party software, or skills, are essential components in Smart Personal Assistants (SPA). The number of skills has
                                             grown rapidly, dominated by a changing environment that has no clear business model. Skills can access personal information and this
                                             may pose a risk to users. However, there is little information about how this ecosystem works, let alone the tools that can facilitate its
                                             study. In this paper, we present the largest systematic measurement of the Amazon Alexa skill ecosystem to date. We study
                                             developers’ practices in this ecosystem, including how they collect and justify the need for sensitive information, by designing a
                                             methodology to identify over-privileged skills with broken privacy policies. We collect 199,295 Alexa skills and uncover that around 43%
arXiv:2103.02637v1 [cs.CR] 3 Mar 2021

                                             of the skills (and 50% of the developers) that request these permissions follow bad privacy practices, including (partially) broken data
                                             permissions traceability. In order to perform this kind of analysis at scale, we present SkillVet that leverages machine learning and
                                             natural language processing techniques, and generates high-accuracy prediction sets. We report a number of concerning practices,
                                             including how developers can bypass Alexa’s permission system through account linking and conversational skills, and offer
                                             recommendations on how to improve transparency, privacy and security. Resulting from the responsible disclosure we have conducted,
                                             13% of the reported issues no longer pose a threat at submission time.

                                             Index Terms—Voice Assistants, Smart Personal Assistants, Alexa, Smart Home, Privacy, Permission, Personal Data, Access Control.

                                                                                                                    F

                                        1    I NTRODUCTION                                                              behaviors [16]. Authors in [17] have very recently shown
                                        Smart Personal Assistants (SPA) are changing how users                          that the Amazon skill certification process can be subverted
                                        interact with technology and the way they consume ser-                          by crafting mock policy-violating skills. However, there is a
                                        vices [1], [2], driving new experiences and expectations [3].                   critical knowledge gap in understanding what are the actual
                                        Advances in natural language processing enable SPA to take                      data collection practices of skill developers in the wild. This
                                        voice commands, which they process using machine learn-                         gap prompts four core research questions: 1) What is the
                                        ing to deduce users’ intentions and fulfill users’ requests.                    current state of affairs in the third-party ecosystem of skills? 2)
                                        SPA are becoming very popular. According to recent reports,                     Is the collection of personal information explained? 3) Can we
                                        nearly 90 million adults in the US use at least one SPA                         pinpoint fine-grained privacy issues (i.e., at the permission level)?
                                        device [4]. The adoption is expected to rocket even further                     4) How can we do this in an automated way at scale?
                                        as SPA integrate into smartphones [5] or as a chatbot [6].                          In this paper, we perform the first systematic mea-
                                            Despite their growing popularity, intrinsic features of                     surement study of Amazon Alexa skills across the entire
                                        SPA expose users to various risks [7], [8], including 1) the                    marketplace of 11 different countries to shed light on the
                                        open nature of the voice channel, 2) the complexity of their                    third-party skills ecosystem and developers’ practices. We
                                        architecture, and 3) the use of a wide range of underlying                      characterize skills using their market categories, names,
                                        technologies. Although considerable research has been re-                       and developers, among others. We also characterize the
                                        cently conducted to understand and address these risks,                         data permissions they request during runtime and their
                                        many of them remain unresolved [7]. In particular, much                         traceability with data practice statements in the skill’s pri-
                                        of the research has focused on architectural elements of SPA                    vacy policy. Since it is time-consuming to analyze privacy
                                        devices (e.g., Amazon Echo) [9], [10], [11] and the speech                      policies manually, we also propose an automated machine-
                                        recognition capabilities of SPA [8], [12]. However, SPA in-                     learning-based system, SkillVet, to help with this at scale.
                                        corporate voice-driven applications generally developed by                      SkillVet identifies relevant policy statements and assesses
                                        third parties, so-called skills. The skills ecosystem widens the                the traceability as complete, partial, or broken. Our system
                                        attack surface of SPA [7], as malicious third-party actors may                  overcomes challenges in modeling privacy policies, such
                                        develop potentially harmful or unwanted software [13],                          as contradictions like negations and statements related to
                                        [14]. Unfortunately, it is unclear what are the risks third-                    multiple types of personal information.
                                        party skills may pose to users, beyond the potential for skill                      We then analyze suspicious skills in the look for concern-
                                        impersonation [14], [15] and other potentially suspicious                       ing practices. This is done following a black-box approach
                                                                                                                        since the skills’ code or executable is not available — skills
                                                                                                                        run in the cloud instead of the users’ device, e.g., as an AWS
                                        •   J. Edu, X.Ferrer Aran, J.Such was with the Department of Informatics,
                                            King’s College London.                                                      Lambda function [18] or in a server controlled by the skill
                                            E-mail: {jide.edu, xavier.ferrer aran,jose.such}@kcl.ac.uk.                 developer [19].
                                        •   G.Suarez-Tangil was with IMDEA Networks Institute.                              Our study makes the following key contributions:
                                            E-mail: guillermo.suarez-tangil@ imdea.org
                                                                                                                          1) We perform the largest systematic study of the Alexa
SkillVet: Automated Traceability Analysis of Amazon Alexa Skills - arXiv
skill ecosystem across 11 Amazon marketplaces (§2).              run in the Amazon cloud (e.g. as an AWS Lambda function),
    We take a snapshot of all markets and measure the                or in a server controlled by the developer [19]. Therefore, the
    distribution of skills, developers’ diversity, and the           source code of skills is not available and instrumentation
    invocation names used for 199k skills.                           beyond voice interaction extremely difficult. A skill can
 2) We study the permission model Amazon offers to de-               simply be activated with an invocation utterance, i.e.: the
    velopers, and investigate how skills make use of per-            trigger phrase and the skill invocation name. For example,
    missions to collect personal data (§3). We then design           to invoke “My Chef” skill, a user can say ‘Alexa, open My
    a methodology that systematically identifies potential           Chef’, or ‘Alexa, start My Chef’. “Alexa, open” and “Alexa,
    privacy issues by analyzing the traceability between             start” are trigger phrases to put the SPA into recording
    the permissions and the data practices stated by devel-          mode. When a skill is successfully invoked, specific phrases
    opers. We find a wealth of over-privileged skills with           can then be uttered to call intents (functions) within the
    broken traceability.                                             skill. Note that, however, in some cases, a skill needs to be
 3) We design a system to automate the traceability analy-           enabled before it can be invoked, for instance, when the skill
    sis at scale, SkillVet, based on machine learning and            requires permissions. Skills can be enabled via compatible
    natural language processing (§4). Our system can be              apps or online at the skill store.
    deployed by SPA providers to automatically conduct
    traceability analysis across marketplaces regularly and
                                                                     2.2   Crawling Amazon Alexa
    by developers to check the traceability of their skills.
    We demonstrate SkillVet’s accurate performance.                  Amazon lists the top most popular skills organized by
 4) We provide insights regarding the developers’ busi-              category (and sub-category) up until a maximum of 400
    ness models and concerning practices we observe (§5).            pages per category (23 categories and 66 subcategories in
    These practices include skills bypassing the permission          total). However, the skill marketplace is not entirely listed
    system of Alexa to collect personal data from users              in the Amazon Alexa index. Thus, crawling this index alone
    by: a) account linking, and b) conversational skills.            does not make the data collection exhaustive.
    We provide evidence of skills (and their traceability)               We overcome this limitation by looking into sub-
    conducting these practices and discuss the implications          categories for each category per marketplace. Every subcat-
    of our findings (§6).                                            egory across marketplaces returns less than 400 pages, so
To foster research in the area, we release our datasets to-          we can crawl all the skills within them. There are only two
gether with the ground-truth we have produced and our                subcategories with more than 400 pages, both in the US (out
code in https://github.com/jideedu/SkillVet                          of a total of 66 subcategories): Knowledge & Trivia (from
                                                                     the Games & Trivia category), and Education & Reference
                                                                     (from the same category). For these two specific cases, we
2     T HE S KILL E COSYSTEM                                         then use a best effort approach and re-crawl skills in those
In this section, we present our characterization of all Alexa        subcategories with a different sorting of the skill listing (i.e.,
skills across 11 markets.                                            ordering by average costumer review in addition to the default
                                                                     one — featured). Thus, we are confident that we capture
                                                                     most, if not all, of the market space, as we collect all the
2.1   SPA Architecture & Skills                                      skills available across all 11 marketplaces but the US. In the
SPA devices such as Amazon Echo are equipped with a                  US, we collect more skills than the sum of all approximate
microphone and a voice interpreter that records utterances.          numbers Amazon gives per subcategory,1 which is ≈57,000
Voice interpreters are often pre-activated and run in the            at crawling time (see below for the date), and we are able to
background, where they wait for the users to say the wake-           crawl 61,362 skills from the US market, as detailed later.
up word [20]. Once they identify the wake-up word, the SPA
device puts itself into recording mode. In recording mode,           2.3   Collection and Characterization
utterances are sent to the SPA cloud [21], where they are
processed using Machine Learning (ML) and Natural Lan-               Amazon uses different online marketplaces to offer targeted
guage Processing (NLP) to deduce the user’s intention. Once          services to several market segments. Amazon has 14 online
the intention is identified, the SPA delegates the request to        marketplaces, out of which 11 have an online store with
the most suitable voice application — called skill in Amazon         third-party skills. We crawl these 11 markets: US, UK, India,
Alexa — that matches the user intention. The answers or              Australia, Canada, Germany, Japan, Italy, Spain, France, and
information generated by the skill, as a response to the user        Mexico. Skills are only available to registered users and a
request, will then be given back to the users by the SPA.            user can only be linked with one marketplace at a time.
Skills are, in a way, similar to mobile apps, as they enhance            For all the skills we collect, we make a characterization
the capability of the ecosystem they are part of. They extend        of the market category they belong to, the utterances that
SPA functionality, allowing them to offer extensive services         activate the skill, and the developer that has created it. Two
that are tailored to users’ interests and experiences. The SPA       further key elements we extract are i) the permissions that
skill ecosystem provides an environment that offers the user         the skill requires to access personal information from the
the ability to run functions such as music playback, online          Amazon user’s account via the Alexa API at runtime, which
purchase, appointment booking, and different smart home
                                                                       1. Note that Amazon does not give the exact number of skills per
automation tasks. Unlike mobile apps, however, skills do             category/subcategory when they contain 1,000 skills or more, but it
not run on any user-controlled device. Instead, skills either        rounds it down to the closest number in units of thousands.

                                                                 2
Amazon makes publicly available in the marketplace; and ii)                                          TABLE 1
the privacy policy declared by the developers. We then use               Number of skills & developers. English-speaking markets represent
                                                                                          82.74% (92k skills and 36k devs).
these two elements, together with an interactive analysis,
to study when there is a privacy violation or a concerning                                                                Skills               Developers
practice that may harm users.                                                  Marketplace
                                                                                                                      N        Percent        N      Percent
    Unlike other platforms like Android, Amazon Alexa                          US                                  61,362      30.79%      27,030    30.58%
runs the skills in the cloud and the code of the skills is not                 UK                                  32,822      16.47%      14,487    16.39%
                                                                               India                               29,344      14.72%      12,823    14.51%
publicly available. To analyze the ecosystem of skills, we built               Canada                              26,027      13.06%      11,509    13.02%
a Web scrapper that recursively crawls all available mar-                      Australia                           23,909      12.00%      10,854    12.28%
ketplaces and extracts the metadata published by Amazon                        Germany                              9,096       4.56%       3,385     3.83%
about the skills. Our crawler exhaustively collects all skills                 Spain                                4,759       2.39%       2,441     2.76%
                                                                               Italy                                4,203       2.11%       2,049     2.37%
available in the Amazon Alexa marketplaces (for details                        Japan                                3,513       1.76%       1,407     1.59%
refer to Appendix 2.2). Note that one skill can belong to                      France                               2,288       1.15%       1,194     1.35%
different categories and be hosted in multiple marketplaces.                   Mexico                               1,972       0.99%       1,212     1.37%
                                                                                       Total                      199,295 100.00%          88,391 100.00%
In these cases, the URL may have different parameters but                            Unique                       112,029       56.2%      47,520    53.76%
the path displays a unique identifier per skill. We identify
unique skills posted across markets in a process we call de-
duplication.                                                           name or the URL of the framework that produced the skill.
    We use the following attributes (which we scrape from              Popular among them are VoiceApps.com, VoiceXP.com and
the marketplace) to characterize every skill: invocation               VoiceFlow.com. These platforms provide free sample skills
name, permissions, the category, developer’s information,              that a developer can customize regardless of their technical
privacy policy, terms of use, skill description, rating infor-         background. In particular, they offer visual, drag-and-drop
mation and reviews. While new skills are added daily, we               editors that lowers down any development barriers. We
base our study on those skills accessible to users between             mine all descriptions in our dataset and see that at least 7%
the 15th of July 2020 and the 29th of July 2020. This is the           of the skills have been developed with these platforms. In
time-frame that took our crawler to visit all marketplaces.            particular, we observe a ratio of 1 developer to 30 skills with
Since our data collection requires issuing a request to web            the most100000
                                                                                 popular automated platform (VoiceApps.com).
services, we avoid generating unnecessary traffic that may
affect their normal operations and limit the rate at which we
generate our requests.                                                                           10000
                                                                                Number of Developers

                                                                                                                                                     UK
                                                                                                                                                     JP
2.4   Skills and Developers                                                                            1000
                                                                                                                                                     AU
                                                                                                                                                     IN
Table 1 shows the breakdown of the total number of Alexa                                                                                             CA

skills and developers across Amazon marketplaces. Overall,                                                                                           DE
                                                                                                                                                     IT
                                                                                                        100
we see 199,295 skills published by 88,391 developers. Note                                                                                           MX
                                                                                                                                                     ES
that most of these skills have been developed in just a                                                                                              FR
few years.2 English-speaking marketplaces host the highest                                               10
                                                                                                                                                     US

number of skills and developers. On the other end, the
smallest market is Mexico with 1,972 skills and representing
1% of our dataset. However, there is overlap across mar-                                                  1
                                                                                                              1         10               100          1000     10000
kets as developers can publish the same skills in different                                                                    Number of Skills
markets [23]. This overlap is larger among English-speaking            Fig. 1. Developers vs the number of skills they publish per marketplace.
markets such as the US, UK, CA, and AU where Amazon
tends to migrate existing skills [24]. In particular, we observe
that close to 20,000 different skills in the UK market are also        2.5   Invocation Name Reuse
in the US market. When we de-duplicate skills, we observe
that 53.76% (47,520) of the developers and 56.2% (112,029)             There are 99,521 skill invocation names out of all 112,029
of the skills we collect are unique. In what follows we look           unique skills in our dataset. Figure 2 shows a scatter plot
at unique skills unless we analyze specific markets.                   with the aggregate of invocation names per marketplace.
Developers. We illustrate the relationship between develop-            In particular, the figure shows the total number of skill
ers and skills per marketplace in Figure 1. While most of the          invocation names (Y axis) w.r.t. how many other skills have
developers publish only one skill, a few of them have tens,            the same invocation name (X axis). We see that the number
and a handful have hundreds and even thousands of skills.              of skills with unique invocation names (X = 0) ranges from
In particular, 125 developers have published more than 50              1,914 (Mexico) to 54,486 (US) and totals 66,087 after cross-
skills, 69 developers more than 100 skills, and 5 developers           market skill deduplication. Recall that we only consider
more than 1K skills. We then study the description of all the          unique skills when aggregating results across marketplaces.
skills published by prominent developers and see that many             When we look at X > 0, we see 24,468 skills with in-
implicitly indicate that the skill has been developed with an          vocation name reuse (10,480 developers reuse 24,468 skill
automated platform. This is usually done by referring to the           invocation names). As values of X increase, we see more
                                                                       popular names. An important number of skills share the
  2. Alexa had only 135 skills in early 2016 [22].                     same invocation name in different markets. We observe that

                                                                   3
India predominantly appears in cases with large values of                                                                                      180

                                                                                                                        NUMBER OF OCCURENCES
                                                                                                                                               160        AU     CA    IN     UK   US
X , with about 34% of their invocation names being reused.                                                                                     140
                                                                                                                                               120
                                                                                                                                               100
                                10000
                                                                                                                                                80
                                                                       IN     AU                                                                60
                                                                       UK     JP                                                                40
                                 1000                                  CA     DE                                                                20
                                                                       IT     MX                                                                 0
        Skill Invoca on Names

                                                                       ES     FR
                                                                       US
                                  100

                                   10                                                                                                                SKILL INVOCATION NAMES

                                                                                                            Fig. 3. Top 10 skills invocation names (English-speaking).
                                    1
                                        0   5   10   15      20   25   30     35    40   45   50   55

                                                          Number of Skills                                  are. To do this, we use the Levenshtein edit distance [26]
                                                                                                            to compute the similarity between skill invocation names.
Fig. 2. Number of skills with the same name across Alexa marketplaces.
                                                                                                            We also consider the phonetic similarity between invocation
Most of the skills use a unique invocation name, but a high proportion of
skills use the same invocation name.                                                                        names [27], i.e., the phonetic transcription of an invocation
    Next, we study English-Speaking Markets (ESM) alone.                                                    name. This is because both textual and phonetic invocation
Table 2 shows the number of skills that reuse names by                                                      name similarity may be leveraged to exploit the speech
market category in ESM. Games and Trivia turn out                                                           recognition capabilities of SPA for an impersonation attack,
to be the category with the highest number of reused                                                        known as skill squatting [14], [15]. We use the open-
skill invocation names. We also observe 968 skills with                                                     source Carnegie Mellon University Pronouncing Dictionary
the same skill invocation name and developer name in                                                        (CMUdict) to get the phonetic transcription of words.3 For
a single market (IN-280, CA-121, DE-36, UK-192, ES-                                                         every skill in every English-speaking market, we lowercase
14, FR-8, US-279 , AU-38). One good example is the                                                          its invocation name, remove all non-ASCII characters and
“Africa Facts” in the UK market by Yasemin Woodward                                                         punctuation, and replace digits with their equivalent texts
with skill ids B07PGN7T9D, B07PHRMML4, B07PK4GNYC,                                                          (e.g., “1” becomes “one”). Afterwards, we use the CMU
B07PKVMYTB, B07PR67XTQ, and B07PFQ39Y6. This sug-                                                           dictionary to obtain the phonetic transcription of every
gests some developers might be attempting a form of sybil                                                   skill invocation name, ignoring those skill invocation names
attack [25], where they try to have multiple skills with the                                                without phonetic transcription. For every resulting skill, we
same invocation name (but different ID) to increase the                                                     estimate the shortest phonetic Levenshtein distance to any
chance for one of them to be selected.                                                                      other skill invocation name in the same market to iden-
                                                                                                            tify skills with similar sounding invocation names. In the
                                TABLE 2                                                                     process we factor in the dynamic lengths of different skills
      Number of reused skill invocation names per category in the                                           invocation name by normalizing the Levenshtein distance
                     English-speaking markets.                                                              by the length (in phonemes) of the longest skill invocation
                                                                                                            name compared.
 Category                                       Invocation names             %       Skills    %
 Games & Trivia                                 7,251                        30%     26,580    29%              Table 3 shows the quantity and percentage over the
 Education & Ref.                               4,250                        17%     13,863    15%          total number of skill invocation names we found with a
 Music & Audio                                  3,814                        16%     12,121    13%
 Lifestyle                                      1,697                        7%      6,074     7%           phonetic translation and a minimum edit distance when
 Smart Home                                     1,130                        5%      3,456     4%           compared to the other skills of ≤ 0.1 and ≤ 0.2 across the
 Health & Fitness                               977                          4%      3,505     4%
 Food & Drink                                   597                          2%      2,111     2%           five English-speaking marketplaces. We observe that in all
 Business & Finance                             783                          3%      2,494     3%           English-speaking marketplaces (except India) we have an
 Kids                                           683                          3%      2,356     3%
 Others                                         3,286                        13%     15,235    22%
                                                                                                            average of 6 to 8% skills invocation names with minimum
 Total                                          24,468                       100%    92,451    100%         Levenshtein distance ≤ 0.1, and around 26% with minimum
                                                                                                            distance ≤ 0.2. Most of the skills with lower phonetic
    Figure 3 shows the most popular skill invocation names                                                  Levenshtein distances ≤ 0.1 are plural versions of the orig-
in the 5 English-speaking marketplaces. “Cat Facts” is the                                                  inal skill, such as “Fun Fact Quiz” and “Fun Facts Quiz”
most popular skill invocation name used 171 times. Like-                                                    (India), or “Panda Facts” and “Panda Fact” (US). However,
wise, the word “Fact” is used by 1,979 developers appearing                                                 there are others, such as “Sweet Facts” and “Wheat Facts”
3,905 times across all marketplaces. We observe that many                                                   (US), which would have a higher edit distance without
of the fact skills are single interaction skills (they terminate                                            transcription but that when compared after transcription
their interaction after performing one task), out of which                                                  have a Levenshtein distance ≤ 0.1. In either case, we can
over 40% are developed using automated platforms. This                                                      see that there is a considerable amount of skill invocation
may be related to these platforms offering free fact skill                                                  names that are quite similar to each other across the five
templates as discussed above.                                                                               English-speaking markets. For instance, we see that in the
                                                                                                            US, 5% of skills (3,215) have a Levenshtein edit distance
                                                                                                            (after phonetic transcription) ≤ 0.1, and 15% of skills (7,332)
2.6   Invocation Name Similarity                                                                            have a distance ≤ 0.2%.
Because of the slight differences in invocation name, we
also sought to measure how similar skill invocation names                                                     3. https://github.com/cmusphinx/cmudict (version 0.7b).

                                                                                                        4
TABLE 3                                                                    TABLE 4
      skill invocation names with Levenshtein distance ≤ 0.2 across                     Permissions distribution by skills and marketplace.
                         English-speaking markets.
                                                                                                                      No of Unique Permissions
         Market           lev ≤ 0.1          lev ≤ 0.2    Total                    Skills           1                   2            3         4                ≥5
                                                                                                  N (%)               N (%)       N (%)      N (%)             N (%)
         US          3,830 (8.82%)    11,929 (27.47%)    43,410                 US (1,698)    1,169 (68.85)        301 (17.73)  143 (8.42) 51 (3.00)         34 (2.00)
         CA          1,514 (7.88%)     4,836 (25.17%)    19,209                  UK (581)      418 (71.94)         120 (20.65)   26 (4.48)  7 (1.20)         10 (1.72)
         AU            673 (6.86%)     4,439 (24.69%)    17,974                  CA (366)      262 (71.58)          74 (20.22)   14 (3.83)  6 (1.64)        10 (2.7.73)
                                                                                 IN (358)      250 (69.83)          66 (18.44)   20 (5.59)  7 (1.96)         15 (4.19)
         IN         3,048 (14.25%)     6,699 (31.31%)    21,389                  AU (348)      245 (70.40)          78 (22.41)   10 (2.87)  5 (1.44)         10 (2.87)
         UK          1,714 (7.80%)     5,321 (24.24%)    21,950                  DE (341)      236 (69.21)          83 (24.34)   12 (3.52)  7 (2.05)          3 (0.88)
                                                                                 JP (126)      100 (79.37)          21 (16.67)   4 (3.17)   1 (0.79)
                                                                                  ES (89)       56 (62.92)          24 (26.97)   6 (6.74)   3 (3.37)
2.7    Characterization Take-aways                                                IT (85)       58 (68.24)          21 (24.71)   5 (5.88)   1 (1.18)
                                                                                  FR (63)       44 (69.84)          15 (23.81)   4 (6.35)
Our market characterization provides a high-level overview                       MX (46)        30 (65.22)          13 (28.26)   3 (6.52)
                                                                                Total 4,101   2,868 (69.93)        816 (19.90)  247 (6.02) 88 (2.15)        82 (2.00)
of the skill ecosystem and how Alexa marketplaces are                           Uniq. 2,608   1,807 (69.29)        499 (19.13)  191 (7.32) 67 (2.57)        44 (1.69)
structured. In summary, the main takeaways include:
   • The skill ecosystem has grown considerably fast in
     recent years and it currently has few developers with                  generally used to offer services based on the user’s location.
     thousand of skills each. English-speaking markets gen-                 Figure 4 lists the 13 Alexa permissions together with the
     erally dominate and push skills to other markets.                      number of skills that use them per market.
   • We see skills developed with automated platforms. The
     use of these platforms lower down the development                                                 Timer
     barriers and enable bulk skill creation.                                                   Amazon Pay
   • We observe a high prevalence of skills with the same or                                     First Name
     similar name, with the associated potential for imper-                               Locaon Services
     sonation and sybil attacks as shown in previous works.                                Mobile Number
                                                                                         Lists Write Access
                                                                                          Lists Read Access
3     P ERMISSIONS AND T RACEABILITY
                                                                                                  Full Name
In this section, we first explore the data permissions in all                                    Reminders
Alexa marketplaces, analyzing which permissions are more                    Device Country and Postal Code
often requested by Alexa skills and developers among the                                      Email Address
different markets, and how they are distributed between                                 Alexa No caons
skills. Then, we look at skill privacy policies to understand                               Device Address
how developers disclose and justify the data permissions                                                           0        200           400    600        800         1000
they ask for. To do so, we collect and tag the largest                                                                            Number of Skills
traceability dataset for Alexa known to date, namely the                                      AU    CA        FR       DE    IN      IT     JP   MX    ES      UK         US
traceability-by-policy dataset (TBPD), which includes the                   Fig. 4. Permissions distribution across the markets.
traceability analysis of 1,758 skills requesting data per-
missions and their policies from the five English-speaking                      Overall, 2,608 (2.3%) skills across markets ask for data
marketplaces (Australia, UK, US, Canada, and India). Using                  access permissions. This could, in principle, be seen as good
the dataset, we analyze the skill traceability at the skill cat-            news. Still, 2.6K skills is a sizable set of third-party software
egory and developer dimensions and identify developers’                     accessing personal information, and the practices of their
practices.                                                                  developers require scrutiny. We also argue that skills may
                                                                            collect personal information using other means, i.e., through
                                                                            account linking or conversations, as discussed in §5.
3.1    Data Permissions
                                                                                Table 4 shows the permission distribution of skills per
Skills can request access to user information through the                   marketplace. As shown, most skills ask for one permission
Alexa skill API. This information becomes available on a                    (typically, the device address as discussed above), with all
per-skill and per-data-type basis when the user consents.                   marketplaces showing a similar pattern overall — albeit
To manage the consent: i) Alexa declares the permissions                    with slight differences between those markets with more
of skill on their store and gives users control through the                 or less number of skills. There is also an important minority
Web or the mobile app, and ii) skills have to inform users                  of skills asking for two or three permissions. Finally, there
in their privacy policies how they will use the personal in-                are few exceptions where skills ask for over three permis-
formation they collect. Alexa currently supports 11 types of                sions. The most noticeable one is a skill called “BILH Staff
permissions [28]. These are: Device Address (15.0%), Loca-                  Chatbot” by BIDMC (the Beth Leahy hospital). The skill is
tion Services (4.1%), Email Address (15.7%), Device Country                 available in the US , and it is requesting several unique
and Postal Code (13.7%), Reminder (9.1%), Customer Name                     permissions (i.e., Name, Email Address, Mobile Number,
(10.3%), List Read Access (5.3%), List write Access (4.9%),                 Reminders, Location Services, and Device Country and
Mobile Number (4.3%), Amazon Pay (1.8%), and Skill Per-                     Postal Code). The skill is a chatbot for employees to help
sonalization (0.0%).4 The most prevalent permissions are                    them fill out a COVID-19 staff daily health questionnaire.
                                                                            While the skill is not intended to diagnose users (according
  4. At the time of our analysis, we do not see any skills with Skill
Personalization yet. Also, our analysis reveals Notification (15.5%),       to the description), it asks for 13 different symptoms. The
which is now deprecated and replaced by Reminder and Timers (0.3%).         use of this skill poses an important privacy risk to employ-

                                                                        5
ees, because the Web form version of the questionnaire does            is partially traceable as the skills collect Device Address,
not ask for data such as postal code or location, so it is not         and this refers to the user’s full address, including the zip
clear why these are needed in Amazon.                                  code and the street number. Examples that provide partial
                                                                       information in their privacy policy include: 1) “Vote Sam
                                                                       Feldt” by Fanspoke that collects Mobile Number 2) “Flight
3.2   Traceability Analysis
                                                                       Level - An Aircraft Radar” by O. Schafer that collects Device
We look at privacy policies to understand how developers               Country and Postal Code.
disclose the permissions they request. Specifically, we study          Broken: If a skill requests for permissions but does not have
the traceability between data operations obvious to users              a privacy policy or the link to the privacy policy given
(recall users need to enable skills requesting permissions)            is not working, we consider the traceability broken. Even
and the data actions defined by developers in the skill                when providing a privacy policy, A skill is said to have
policies. Note that Amazon’s privacy requirements for skill            broken traceability if it provides no data implication in its
developers mandate that a skill must come with an adequate             privacy policy document. For instance, when a skill collects
privacy policy if it collects personal information. In particu-        the device address, and its privacy document states nothing
lar, any collection and use of personal information need to            related to device address, we mark such skill data practices
comply with what is stated in the privacy policy [29].                 disclosure as broken. We only mark a broken policy when
    To evaluate traceability between permissions and poli-             permissions are not traceable to the data practices defined
cies, we collect and tag what is the largest policy traceability       in the privacy document. In the UK marketplace, skills like
dataset for Alexa known to date, containing the traceability           “Daily Yoga” by Siva Pandeti collects the Device Country
analysis of more than 1,758 skills requesting data permis-             and Postal Code, “Bike Weather Man” by Marc Easen collect
sions in the five English-speaking marketplaces (Australia,            Device Country & Postal Code, and “Smoggy Alerts” by Teal
UK, US, Canada, and India). We develop a Selenium module               Dreams Software, Inc. collects the Device Country and Postal
in Python that automatically visits and downloads every                Code, and they all exhibit broken traceability.
skill’s privacy policy page, cleaning the HTML code to re-                 When analyzing traceability, a skill requesting “Device
move unnecessary markup code, normalizing punctuation,                 Country and Postal Code” permission is tagged to have
and extra white-spaces discarding non-ASCII characters,                adequately disclosed its data practices if “Device Address”
and finally converting text to lowercase (this is later ref-           is mentioned in the privacy policy. This is because “Device
ered as the pre-processing phase). Afterwards, each skill is           Address” refers to the user’s full address, including the
analyzed and evaluated as having broken, partial or complete           device country and postal code and the street number. Also,
traceability, following previous studies of traceability anal-         we did not find any skills using Skill Personalization, and
ysis in other domains [30], [31], [32], [33].                          grouped “List Write Access” and “List Read Access” under
    The type of traceability is identified by comparing the            the “Personal Information” category as acknowledging per-
permissions requested by the skill through the Amazon                  sonal information collection should be sufficient to disclose
Alexa API with the data practices covered in the skill policy.         these permissions. Finally, reminders/notifications are not
The different traceability types are explained next:                   explicitly considered in the privacy requirements for skills
Complete: A skill is said to offer complete traceability if it         by Amazon, as they may not have a personally identify-
provides adequate information in its privacy policy docu-              ing implication. Therefore, the traceability of all skills was
ment about its data practices, i.e., the data action defined           evaluated considering: “Location Services”, “Amazon Pay”,
in the privacy policy document can be completely mapped                “Mobile Number”, “EmailAddress”, “Name”, “Device Ad-
to the access of data permissions. For instance, a skill               dress”, “Device Country and Postal Code”, “Personal Infor-
like the “Aircraft Radar” in the UK market developed by                mation”.
Chris Dzombak offers complete traceability since it provides
adequate information about its data practices in its privacy
policy. The statement “aircraft radar uses your devices ad-            3.3   Traceability Results
dress to find your location and search for aircraft around             A total of 1,758 skills request data permissions in English-
you” can be mapped with the Device Address permission                  speaking marketplaces. Out of the 1,758 skills: 442 (25%)
the skill collects.                                                    have broken traceability, 306 (18%) have partial traceability,
Partial: This is when the transparency of data action de-              and 1,010 (57%) have complete traceability. When we ex-
fined in the privacy policy document maps partially to                 clude the complete ones, we see 43% of the skills displaying
the access permissions. For example, when the privacy                  bad privacy practices. Figure 5 shows the results by market,
document states that a skill collects personally identifiable          where we can see a very similar trend across marketplaces,
information without explicitly stating what this information           with UK appearing to have slightly more broken and less
is. A statement such as “we may require you to provide                 complete traceability results, but it also has the least number
us with certain personally identifiable information” offers            of skills and developers.
partial traceability since it does not explicitly state what           Traceability by Categories. To understand how traceability
information is collected. A skill is also said to offer partial        affects different type of skills, we first compare our results
traceability if not all its data permissions are covered in its        considering the 1,758 unique skills collected across different
privacy document. Partial disclosure of data practices may             skill subcategories and English-speaking marketplaces, as
also occur when data practices in a skill privacy document             shown in Table 5. We show the breakdown using subcate-
are not well mapped with the skill’s data permission. For              gories to give a more fine-grained perspective of particular
instance, the statement “we may also collect your zip code”            types of (or sector-specific) skills and associated traceability.

                                                                   6
gory. This subcategory has a high proportion of complete
                                                                                  traceability (44%), but it also has a very sizable proportion
                                                                                  of broken traceability skills (43%). For instance, in this
                                                                                  category, we find a very interesting case of “A Sales Guy”
                                                                                  by VoiceXP. This skill is supposed to provide information
                                                                                  about Keenan, that according to his website, is a person who
                                                                                  has been “selling something to someone for his entire life”.
                                                                                  Users that want to know about Keenan would have to give
                                                                                  up their mobile number, email address, full name, or device
                                                                                  address. The developer, VoiceXP, is a company that does so-
                                                                                  called voice domain registration services. We refer the reader to
Fig. 5. Traceability for skills requesting data permissions in English-
speaking markets.                                                                 §6 for further discussions on the type of concerning practices
                                                                                  used by skill developers.
We rank the different subcategories based on the number                               Regarding complete traceability, it is worth highlighting
of issues (broken and partial) normalized by the number of                        the bottom part of Table 5. Particularly, the Music & Audio
well defined policies (complete). Note that we do not list                        subcategory has the largest number of complete traceability
subcategories with less than 3 skills for the sake of clarity.                    skills (221), which is also a sizable proportion of skills within
                                                                                  the subcategory (69%). This is closely followed by Shopping,
                                TABLE 5                                           with many complete traceability skills (90) that make up
       Traceability per subcategory in English-speaking markets.                  71% of skills within that subcategory. Note that these two
    Categories                R          B             P             C
                                                                                  categories relate to industries with a larger tradition of
    Sports                     1   43 (91.49%)     1 (2.13%)     3 (6.38%)        offering services in the Web, where privacy has been in
    Social                     2   98 (85.96%)     7 (6.14%)     9 (7.89%)
    (Uncategorised)            3   56 (90.32%)     1 (1.61%)     5 (8.06%)        scrutiny for longer. Still, adding up the skills with broken
    Novelty & Humour           4    5 (71.43%)    2 (28.57%)     0 (0.00%)
    Kids                       5   10 (83.33%)     1 (8.33%)     1 (8.33%)
                                                                                  and partial traceability across the two subcategories gives a
    Games & Trivia             6   89 (57.42%)   34 (21.94%)   32 (20.65%)        total of 135 skills. Therefore, even the best ranking categories
    Schools                    7   3 (100.00%)     0 (0.00%)     0 (0.00%)
    Utilities                  8   17 (56.67%)    5 (16.67%)    8 (26.67%)        have a sizable number of broken/partial traceability skills.
    News                       9   30 (63.83%)     4 (8.51%)   13 (27.66%)
    Health & Fitness          10   60 (48.78%)   25 (20.33%)   38 (30.89%)            Finally, and rather interestingly, the Schools and Kids
    Organizers & Assistants   11    7 (63.64%)     1 (9.09%)    3 (27.27%)        subcategories are ranked in positions #7 and #5, and have
    Lifestyle                 12   74 (46.54%)   32 (20.13%)   53 (33.33%)
    Weather                   13   22 (41.51%)   13 (24.53%)   18 (33.96%)        100% and 0%, and 83% and 8% of their skills broken or
    Food & Drink              14   39 (42.86%)   20 (21.98%)   32 (35.16%)
    Streaming Services        15    4 (57.14%)    1 (14.29%)    2 (28.57%)        partially broken, respectively, for a total of 11 skills. This
    Productivity              16   42 (51.85%)    9 (11.11%)   30 (37.04%)
    Business & Finance        17   82 (52.23%)   16 (10.19%)   59 (37.58%)        is especially concerning as they may collect children’s per-
    Smart Home                18   44 (57.14%)     4 (5.19%)   29 (37.66%)        sonal information, as we discuss more thoroughly in §6.
    Travel & Transportation   19   28 (37.33%)   18 (24.00%)   29 (38.67%)
    Education & Reference     20   59 (43.70%)   16 (11.85%)   60 (44.44%)        Traceability by Permissions. To understand how traceabil-
    Movies & TV               21    5 (50.00%)    1 (10.00%)    4 (40.00%)
    Navigation & Trip         22    2 (40.00%)    1 (20.00%)    2 (40.00%)        ity varies across the different type of permissions, we also
    Connected Car             23    9 (39.13%)    3 (13.04%)   11 (47.83%)
    Public Transportation     24    2 (16.67%)    4 (33.33%)    6 (50.00%)        look at the traceability of skills per permission requested.
    Novelty & Humor           25    3 (25.00%)    3 (25.00%)    6 (50.00%)        Table 6 shows the distribution of traceability across different
    Home Services             26    5 (19.23%)    6 (23.08%)   15 (57.69%)
    Self Improvement          27    2 (50.00%)     0 (0.00%)    2 (50.00%)        permissions for the 1,758 analyzed skills. The permissions
    Wine & Beverages          28    1 (20.00%)    1 (20.00%)    3 (60.00%)
    Music & Audio             29   94 (29.38%)     5 (1.56%)   221 (69.06%)       are first grouped into Broken, Partial, Complete, with re-
    Shopping                  30   26 (20.63%)    10 (7.94%)   90 (71.43%)
    Calendars & Reminders     31    1 (14.29%)    1 (14.29%)    5 (71.43%)        spect to the policies of the skills where these permissions
    Pets & Animals            32    1 (33.33%)     0 (0.00%)    2 (66.67%)        are requested. A total of 2,616 permissions are requested
    Event Finders             33    1 (25.00%)     0 (0.00%)    3 (75.00%)
    Local                     34     0 (0.00%)    1 (25.00%)    3 (75.00%)        (622 by skills with broken traceability, 485 by skills with
    Knowledge & Trivia        35     1 (8.33%)     0 (0.00%)   11 (91.67%)
   B = Broken, P = Partial, C = Complete, R (Rank) ∼ (B+P)/(C+1)                  partial traceability (75 of these are traceable), and 1,509 by
                                                                                  skills that exhibit complete traceability). The most requested
    Sports has the largest proportion of skills with broken                       permission is the Device Address which is requested 648
traceability, totaling 91% of all the skills in the subcategory.                  times by 464 developers. While Amazon Pay is the least
For instance, the “Western States” skill by de Peck, Inc col-                     asked permission requested only 56 times by 33 developers,
lects the Device Address without any privacy document to                          it tends to be requested more by skills that have com-
disclose the data practices. Like many other subcategories,                       plete traceability. In contrast, Location Services permission
there are no skills with complete traceability in Novelty &                       requested by 90 unique developers is found more in skills
Humour. When looking at the combination of broken and                             that exhibit broken traceability.
partially broken traceability, Games & Trivia is also one of                      The Good, the Bad and the Ugly Developers. We next
the most problematic subcategories, particularly consider-                        study how many “good”, “bad” and “average” developers
ing the number of skills it has. For instance, we find “Pixated                   there are. Table 7 shows the number of developers per
Salat”, a prayer skill “to find Salah or Prayer time”. The skill                  type of traceability considering the 5 English marketplaces.
is developed by Pixated ltd, a strategic advertising group,                       Overall, we see a total aggregate across markets of 1,730
and asks for the device’s location. However, the link to the                      cases where developers request permissions in their skills,
policy provided is broken, and it links to the main site of the                   out of which 1,123 are unique developers that post skills
advertising group instead (http://pixated.agency).                                in several markets. When looking at unique developers, we
    It is also important to note that even in categories that                     see:
do not rank high, there are also skills with concerning data                      The Good. There are 566 developers with all their skills
collection practices, like the Education & Reference cate-                        showing complete traceability. All the skills developed by

                                                                              7
TABLE 6                                                                 Broken    Paral   Complete
 Distribution of traceability across different permissions for the 1,758
        analyzed skills in the 5 English-speaking marketplaces.
                                                                                          100%

 Permission          D         R     B            P            C                           80%
 Device Address      464      648    188 (29% )   130 (20% )   330 (51%)
 Device Country      330      569    106 (19% )   77 (14%)     386 (68% )                  60%
 Email Address       251      428    99 (23% )    77 (18%)     252 (59% )
 Personal Info.      144      324    67 (21% )    41 (13%)     216 (67% )                  40%
 Name                173      350    86 (25% )    82 (23%)     182 (52% )
                                                                                           20%
 Mobile Number       97       139    36 (26% )    37 (27%)     66 (47% )
 Location Services   90       102    35 (34% )    31 (30%)     36 (35% )
                                                                                            0%
 Amazon Pay          33        56    5 (9% )      10 (18%)     41 (73% )
                                                                                                    CA       US      UK        IN      AU
             Total   1,582   2,616   622 (24% )   485 (19% )   1,509 (58% )
           Unique    1,123   1,758   422 (25% )   306 (18% )   1,010 (57% )       Fig. 6. Traceability in skills reusing privacy policies (English-speaking
                                                                                  markets). Total: 1498 skills, broken: 196, partial: 358, complete: 944.
D = Developer, R = Requested, B = Broken, P = Partial, C = Complete,
                                                                                  privacy policies of others. Thus, we measure the prevalence
                              TABLE 7                                             of reused policies and study the impact this practice has
Developers’ disclosure practices (Broken, Partial or Complete) in all 5           on traceability by systematically mapping policy links to
                  English-speaking marketplaces.                                  skills in the same English-speaking marketplace. Figure 6
                                                                                  shows an overview of our results, where 175 privacy pol-
        Markets   Developers       B     P       C    P+C
           US          731        219   145     330   15
                                                                                  icy links are reused across marketplaces by 1,498 skills
          UK           402        197    41     146   7                           (with the following breakdown5 : CA-172, US-825, UK-180,
          CA           224         88    27     102   2                           IN-172 and AU-149) from 235 developers (CA-37, AU-26,
           IN          215        107    23      79   5                           IN-9, UK-39, US-124). Out of all the skills where we see
          AU           202         90    22      82   3
            Total     1,730       701   258     739   32                          reusing policies, only 29% offer a comprehensive policy
         Unique       1,123       350   207     566   21                          that adequately describes their practices, 44% show bro-
       D = Developer, B = Broken, P = Partial, C = Complete                       ken traceability between policy documentation and data
                                                                                  permissions, while 27% show partial traceability. For ex-
these developers have statements in their privacy policies                        ample, the skills “Curious City” and “Hits 94” developed
clearly stating and justifying the permissions they request.                      by Voxmatic.io both with data permission Device Address
This accounts for about 50% of the developers. For instance,                      have a privacy policy hosted in a domain that is down
developers such as GoVocal.AI, Blutag Inc, and Ixartz in                          (http://voxmatic.io/privacy), though both are fully func-
the US marketplace have complete traceability between the                         tional when spoken to.
permissions their skills ask and their privacy policies.                              Interestingly, although many of the skills with the same
                                                                                  policy are published by the same developer and collect the
The Bad. There are 350 developers with all their skills broken.
                                                                                  same permissions, there are cases where the skill collects
This accounts for about 31% of the developers. Skills do not
                                                                                  a different set of permissions (CA-111, AU-92, IN-109, UK-
offer an adequate explanation in general when we analyze
                                                                                  63, US-548). For example, in the Canada marketplace, the
the partial skills, their privacy statements, and we look
                                                                                  skills “Big Sky” and “I’m Driving” (both from developer
at their reviews. One example is Tagrem Corp in the US
                                                                                  Philosophical Creations and with partial traceabilities) use
market, with all the skills they developed exhibiting broken
                                                                                  the privacy policy link https://driving.big-sky-alexa.com/
traceability, as it does not acknowledge the collection of any
                                                                                  privacy. However, the first asks from Alexa Notifications
personal information in the skills’ privacy policy while the
                                                                                  and Device Address permissions while the second asks for
skills do request for permissions.
                                                                                  Email Address and Mobile Number. There are also some
The Ugly. We see 207 developers with all their skills with
                                                                                  cases where completely different developers still reused the
partial traceability. This accounts for about 18% of the
                                                                                  same privacy policy. We see 24 unique instances involving
developers. They appear to have a sloppy attitude when
                                                                                  54 different developers across all of the English speaking
writing privacy policies and informing users of how the
                                                                                  markets. For instance, in the US market, 4 developers,
personal information they request is used. For example,
                                                                                  Evezilla Ltd, Kavson Ltd, Rai Integration Ltd, and Mikk London,
the developer Geekycoders with over 30 skills in the US
                                                                                  use the same policy link https://www.starfishmint.com/
market have partial traceability in all of them. While the
                                                                                  policy/privacy.html.
developer requests for only one type of permission (First
                                                                                      Finally, another interesting observation is that while
Name), the statement “we do not collect your personal
                                                                                  some skills reuse broken policies and, therefore, automat-
information when you use any part of our products, unless
                                                                                  ically inherit the broken traceability, other skills reusing
the app specifically lists in any pre-download description”
                                                                                  policies have different traceability due to the different
only offers partial traceability since it does not explicitly
                                                                                  permissions the skill request. For instance, the “Master-
state what information is collected.
                                                                                  ing Python Networking Facts” skill by Network Automa-
 This shows that both average and bad developing prac-                            tion Nerds LLC, which asks for Device Country and Postal
 tices are commonplace and widespread in Alexa across                             Code, exhibits complete traceability with the privacy policy
 marketplaces, particularly in the US. Interventions for                          at “https://www.alexa.com/help/privacy”. However, the
 developers, instead of just for individual skills, may also                      “Stock Price” skill by JJ exhibits partial traceability when the
 be an effective way of vetting marketplaces.
                                                                                    5. The breakdown is read as follows: CA-172 means 172 skills using
Reused Policies. We observe that some skills are reusing the                      one of the policies in the Canadian market.

                                                                              8
we strive to have representative statements for all categories
         Sentence                    Policy Tags
                                                   Traceability                   in Alexa, some less actionable permissions such as “Device
         Classifier                                 Analyzer                      Address” or other recently incorporated permissions like
                                                                   Complete
              Preprocessing                        Preprocessing
                                                                                  “Amazon Pay” are still not commonly used among skills,
                                                                                  and they are therefore not easy to find and tag. To mitigate
                                                                                  this issue, we extended the PBSD with sentences from the
Policy
                                                                    Partial       350 Android privacy policies (APP-350) [33] only for per-
            Model        Model
                                                                                  missions that map with Alexa’s data permission categories,
         Permission 1 Permission N                                                i.e.: the Device Address, Device country and postal code, Email
                                              Permissions                         Address, Location Services, and Mobile Number.
MARKET                                         Requested            Broken
                                                                                                                 TABLE 8
                                                                                          Sentence count per permission in PBSD and the extended
Fig. 7. Overview of SkillVet.                                                                                 PBSD+APP350.

same policy is used while asking for the Lists Read Access                                       Alexa Data Permission     #PBSD     #PBSD+APP350
permission, which differs from the other skill’s permissions.                                             Amazon Pay           135             135
                                                                                                        Device Address         254             776
                                                                                        Device Country and Postal Code         244             518
4     AUTOMATED T RACEABILITY A NALYSIS                                                                  Email Address         205            2181
                                                                                                      Location Services        186             533
To facilitate the traceability analysis of skills, we create a                                         Mobile Number            97             941
system, SkillVet, based on machine learning and natural                                                          Name          416             416
                                                                                                  Personal Information       2,482           2,482
language processing. Given a skill, it first identifies and
                                                                                                                  None       6,390           6,390
classifies all statements in its privacy policy that relate to                                                    Total     10,409          14,372
data practices over personal information. We map each data
                                                                                      The problem we tackle is a multi-label classification
statement with one or more Alexa permissions. These are
                                                                                  problem, as sentences can be referring to more than one
the permissions the skill justifies in the privacy policy. We
                                                                                  permission. For instance, the sentence “We collect your
then compare these permissions with the ones the skill
                                                                                  name and postal address” indicates the collection of per-
is authorized to request through the Amazon API during
                                                                                  missions Name and Device country and postal code. One well-
runtime. Depending on if the permissions requested match
                                                                                  known way to tackle multi-label classification problems is to
those found in the policy, the skill is then classified as having
                                                                                  transform the problem into a number of binary classification
a complete, partial, or broken privacy policy. Figure 7 presents
                                                                                  problems.6 In particular, the sentence classifier is formed
an overview of the system. The automated traceability
                                                                                  by 9 binary models, one per each of the 8 Alexa data
analysis system consists of two parts, the Sentence Classifier
                                                                                  permission categories we model, and one for the sentences
and the Traceability Analyzer. The Sentence Classifier (§4.1)
                                                                                  that do not belong to any permission category. All binary
spawns several models that perform the mapping men-
                                                                                  classifiers are trained following a one-vs-all strategy. For
tioned above. The Traceability Analyzer (§4.2) collects all
                                                                                  every classifier modeling a particular permission, we label
data permissions found in the privacy policy and compares
                                                                                  all sentences relating to that permission as the positive class
them with the original set of data permissions requested by
                                                                                  and all other sentences as the negative one. Given a sentence
the skill. This is done to determine the traceability between
                                                                                  from a privacy policy, each of the classifiers then evaluates
the use of a permissions and its justification in the policy.
                                                                                  whether that sentence belongs to the permission category
                                                                                  being tested or not. Therefore, we obtain a multi-label
4.1      Sentence Classifier                                                      classification of the sentence. That is, every classifier makes
To identify the privacy permissions requested in each sen-                        an independent assessment and as a result we are able to
tence automatically, we manually created the permission-by-                       model sentences describing more than one permission.
sentence dataset (PBSD). The PBSD compiles 10,409 anno-
tated sentences from 532 original Alexa policies randomly                         4.2     Traceability Analyzer
chosen from the TBPD described in §3 (the rest of TBPD is
                                                                                  The traceability analyzer collects the data permissions out-
used as unseen data for the evaluation as detailed in §4.3).
                                                                                  puted by the Sentence Classifier, and compares them with
The annotations are tags that associate each statement to one
                                                                                  the permissions requested by the skill in two steps:
(or more) of the Alexa permission categories of interest. That
is, PBSD contains sentences with permissions needed for the                       Policy Preprocessing. Given an Alexa policy, the Trace-
traceability analysis as discussed in detail in §3.2 (Amazon                      ability Analyzer first splits the policy into sentences and
Pay, Device Address, Device country and postal code, Email                        preprocesses each of the sentences in the same way as
Address, Location Services, Mobile Number, Name, and Personal                     presented in §3.2. After this, it removes sentences non-
Information) and the negative label, named None, which                            related to data permissions, such as those in which the
includes sentences that do not belong to any of the Alexa                         authors provide some contact details, e.g.: “if you have any
permission classes considered. The distribution of tagged                         inquiries about this skill, please contact us by email”. These
sentences per permission category in the PBSD is presented                        sentences are often structured similarly in all policies, use
in Table 8. Note the uneven distribution of sentences among
                                                                                     6. Note that other transformations may be possible [34], e.g., trans-
classes. The permissions with the largest number of state-                        forming the multi-label classification problem into a multi-class classi-
ments are “Personal Information” and “Name”. Although                             fication problem but that would require 28 classes.

                                                                              9
similar terms, and are usually wrongly assessed by classi-                                                        TABLE 9
fiers, which are unable to identify that the sentence does not                                 K-fold validation for the Sentence Classifier.
describe a data permission request. To detect this special set                                   Alexa Data Permission         F1    Acc
of sentences, we use a keyword blacklist containing words                                        Amazon Pay                  .987    .991
such as ‘contact us’ or ‘call us’. The same is done with                                         Device Address              .919    .908
                                                                                                 Device Country + P.C.       .921    .909
negation terms such as “does not” or “doesn’t”. This step                                        Email Address               .967    .965
filters out negative sentences such as “this skill does not                                      Location Services           .960    .973
need your email address”, which are quite common among                                           Mobile Number               .978    .977
skill policies. If a sentence contains any of these blacklisted                                  Name                        .977    .983
                                                                                                 Personal Information        .986    .982
keywords, the sentence is ignored and not classified, as                                         None                        .993    .991
modeling negative statements is important to determine the
correct meaning of a privacy policy [35].                                                                  TABLE 10
Traceability. After the policy preprocessing, the Traceability                 Confusion matrix comparing SkillVet with a human analysis for the
Analyzer runs each of the 9 classifiers through all the                                           972 unseen skills from TBPD.
remaining sentences, collecting all data permissions in the
                                                                                                       Actual Traceability on 972 unseen policies
policy. Note that the None classifier prevails over the others                   SkillVet
                                                                                                  Broken (B)     Partial (P)    Complete (C)     Total
to handle contradictions in data policies [35]. That is, we                               B       264 (98.5%) 8 (4.9%)          11 (2.0%)        283
consider that a policy does not have a (proper) data state-                               P       0 (0.0%)       134 (81.7%) 22 (4.1%)           156
ment if there are permissions associated to a policy in addi-                             C       4 (1.5%)       22 (13.4%)     507 (93.9%)      533
                                                                                       Total      268            164            540              972
tion to a positive classification of the None classifier. Next,
through the comparison between permissions requested by
the skill and those automatically found in its policy, the                     classifiers are able to differentiate and classify each of the
Traceability Analyzer classifies the traceability of the skill                 permissions correctly, obtaining F1-scores and an accuracy
as broken, partial or complete, in the same way as in §3.                      of over .9 for all data permissions.
                                                                               SkillVet Evaluation. To evaluate the performance of
                                                                               SkillVet as a whole, we compare the traceability re-
4.3   Implementation and Evaluation                                            sults we obtained with the unseen remaining subset of 972
Sentence Classifier Training and Validation. Each of the                       skills and their policies from the traceability-by-policy dataset
9 models is a binary SVM classifier built on top of an n-                      (TBPD)8 . That is, the policies of these skills are not used to
gram binary vectorizer and a tf-idf layer7 . We test various                   create the PBSD as detailed in §4.1, so they are not used at
parameters over a set of 5-fold cross-validation runs for                      any point for training, validation, fine-tuning, or evaluation
each of the 9 models, including the size of the n-grams,                       of the sentence classifiers. As shown in Table 10, SkillVet
the SVM loss method, the SVM alpha value, and differ-                          is able to correctly classify 93.1% (905) of the previously
ent oversampling and undersampling strategies to balance                       unseen 972 policies. The highest accuracy measures are
the classes of the permission-by-sentence dataset (PBSD). We                   obtained when identifying broken and complete policies
make sure that each data permission is represented consis-                     (98.5% and 93.9% accuracy), while partial traceability seems
tently among the instances selected for training and testing                   to work comparatively worse, with a few miss-classified as
in order to increase the robustness of the classifiers against                 complete.
sentences of all types. The parameters that consistently                           We further investigate the reasons behind the 67 errors
returned better F1 and accuracy scores over the different                      in total (out of 972 skills), which we organize into the
5-fold cross-validation are then used to train with all data.                  following three main types: i) Class error (32 cases, 48%)
In a nutshell, the best parameters are: we use n-grams                         occurs when the policy is wrongly analyzed due to an error
of size 1, 2 and 3, stratified random undersampling and                        in one of the sentence classifiers producing the wrong class;
oversampling as balancing strategy, modified huber loss as the                 ii) Fair error (15 cases, 22%) occurs when even a human
SVM loss function, and an svm alpha of 10−5 for the SVM                        struggles to identify the correct class because the sentence
classifier. After undersampling and oversampling certain                       can be interpreted in different ways, e.g., the most common
classes, on average, each classifier was trained and tested                    cases are sentences that refer to address but it is not clear
using between 2K and 8K of the total sentences found in the                    whether it is the device address or the email address, in fact,
permission-by-sentence dataset. Interestingly, we observe that                 the manual analysis for traceability would consider this as
the classifiers performs best when we use at the same time                     partial precisely because it is not exactly clear what address
n-grams of different sizes (1,2,3).                                            the sentence is referring to; iii) Filter error (20 cases, 30%)
    The final version of all binary classifiers that are de-                   is when the error is attributed to the preprocessing stage
ployed in SkillVet, are trained using 100% of the the                          of SkillVet, e.g., most of the cases are due to sentences
PBSD dataset, and then evaluated using a set of extra unseen                   where there is a complex contradiction, in the sense that
523 sentences to check their performance at a sentence-                        there is a negation but associated to a positive non-collection
permission level (see below for the evaluation of SkillVet                     action, actually suggesting a good practice (e.g., not shared
as a whole). The average F1-score and accuracy metrics
obtained are reported in Table 9. Note how the sentence                          8. Note that we did not consider a further 254 skills in TBPD, because
                                                                               their traceability is trivial: they have missing policies, dead links,
  7. Note that approaches such as sentence embeddings [36] could be            or empty policies. A simple check without sentence classification is
used, but the repetition of similar constructs among policies indicates        enough to assess them broken. Including them would potentially bias
that a simpler, n-gram classifier is also appropriate.                         the evaluation away from the more complex, error-prone cases.

                                                                          10
You can also read