Diffusion of User Tracking Data in the Online Advertising Ecosystem

Page created by Byron Collins
 
CONTINUE READING
Diffusion of User Tracking Data in the Online Advertising Ecosystem
Proceedings on Privacy Enhancing Technologies ; 2018 (4):85–103

Muhammad Ahmad Bashir and Christo Wilson

Diffusion of User Tracking Data in the Online
Advertising Ecosystem
Abstract: Advertising and Analytics (A&A) companies              sold in Real Time Bidding (RTB) auctions. The rise of
have started collaborating more closely with one an-             RTB has forced Advertising and Analytics (A&A) com-
other due to the shift in the online advertising industry        panies to collaborate more closely with one another, in
towards Real Time Bidding (RTB). One natural way to              order to exchange data about users and facilitate bid-
understand how user tracking data moves through this             ding on impressions [10, 58]. The move towards RTB has
interconnected advertising ecosystem is by modeling it           also caused A&A companies to specialize into particular
as a graph. In this paper, we introduce a novel graph            roles. For example, Supply-Side Platforms (SSPs) work
representation, called an Inclusion graph, to model the          with publishers (e.g., CNN) to help manage their re-
impact of RTB on the diffusion of user tracking data             lationship with ad exchanges, while Demand-Side Plat-
in the advertising ecosystem. Through simulations on             forms (DSPs) try to optimize ad placement and bidding
the Inclusion graph, we provide upper and lower esti-            on behalf of advertisers. In short, due to RTB, the online
mates on the tracking information observed by A&A                advertising ecosystem has become enormously complex.
companies. We find that 52 A&A companies observe                      A natural way to model this complex ecosystem
at least 91% of an average user’s browsing history un-           is in the form of a graph. Graph models that accu-
der reasonable assumptions about information sharing             rately capture the relationships between publishers and
within RTB auctions. We also evaluate the effectiveness          A&A companies are extremely important for practi-
of blocking strategies (e.g., AdBlock Plus), and find that       cal applications, such as estimating revenue of A&A
major A&A companies still observe 40–90% of user im-             companies [26], predicting whether a given domain is
pressions, depending on the blocking strategy.                   a tracker [34], or evaluating the effectiveness of domain-
                                                                 blocking strategies on preserving users’ privacy.
Keywords: Online Tracking, RTB, Cookie Matching
                                                                      However, to date, technical limitations have pre-
DOI 10.1515/popets-2018-0033                                     vented researchers from developing accurate graph mod-
Received 2018-02-28; revised 2018-06-15; accepted 2018-06-16.    els of the online advertising ecosystem. For example,
                                                                 Gomer et al. [29] propose a Referer graph, where nodes
                                                                 represent publishers or A&A domains, and two nodes ai
1 Introduction                                                   and aj are connected if an HTTP message to aj is ob-
                                                                 served with ai as the HTTP Referer. Unfortunately, as
In the last decade, the online display advertising indus-        we will show, graphs built using Referer information
try has massively grown in size and scope. According             may contain erroneous edges in cases where a third-
to the Interactive Advertising Bureau (IAB), revenue             party script is embedded directly into a first-party con-
from the online display ad industry in the U.S. totaled          text (i.e., is not sandboxed in an iframe).
$88B in 2017, a growth of 21.4% from 2016 [63]. This                  In this paper, to model the diffusion of user track-
increased spending is fueled by advances that enable             ing data within RTB auctions, we propose a novel and
advertisers to target users with increasing levels of pre-       accurate representation of the advertising graph called
cision, even across different devices and platforms.             an Inclusion graph. The Inclusion graph corrects the
     Another recent change in the online display adver-          technical problem of the Referer graph by using the
tising ecosystem is the shift from ad networks to ad             actual inclusion relationships between domains to rep-
exchanges, where advertisers bid on impressions being            resent edges, rather than imprecise Referer relation-
                                                                 ships. We are able to construct Inclusion graphs, thanks
                                                                 to advances in browser instrumentation that allow re-
                                                                 searchers to conduct web crawls that record the exact
Muhammad Ahmad Bashir: Northeastern University, E-
                                                                 provenance of all HTTP(S) requests [6, 10, 41].
mail: ahmad@ccs.neu.edu
Christo Wilson: Northeastern University, E-mail:                      We use crawled data consisting of around 2M im-
cbw@ccs.neu.edu                                                  pressions from popular e-commerce websites collected
Diffusion of User Tracking Data in the Online Advertising Ecosystem   86

by a specially instrumented version of Chrome [10] to
construct the Inclusion graph. In § 4, we examine the
                                                                2 Background and Related Work
fundamental graph properties of the Inclusion graph
                                                                In this section, we review technical details of and current
and compare it to a Referer graph, created using the
                                                                computer science research on the online display adver-
same dataset to understand their salient differences. In
                                                                tising ecosystem. We start by discussing related work on
§ 5, we demonstrate a concrete use case for the In-
                                                                user privacy and tracking. Next, we present examples of
clusion graph by using simulations to model the flow
                                                                the current display ad serving process and define the
of tracking data to A&A companies. Furthermore, we
                                                                roles of different actors in the ecosystem, followed by
compare the efficacy of different real-world and graph
                                                                a brief overview of efforts to empirically measure these
theoretic “blocking” strategies (e.g., AdBlock Plus [2],
                                                                processes. Lastly, we examine prior work that modeled
Ghostery [25], and Disconnect [18]) at reducing the flow
                                                                the ad ecosystem as a graph.
of tracking information to A&A companies.
     Overall, we make the following key contributions:
– We introduce the Inclusion graph as a model for
                                                                2.1 Tracking and Blocking
     capturing the complexity of the online advertising
     ecosystem. We use the Inclusion graph as a sub-
                                                                To show relevant ads to users, advertisers rely heavily
     strate for modeling the flow of impressions to A&A
                                                                on collecting information about users as they browse
     companies by taking into account the browsing be-
                                                                the web. This data collection is achieved by embedding
     havior of users and the dynamics of RTB auctions.
                                                                trackers into webpages that gather browsing informa-
–   We find that the Inclusion graph has substantive            tion about each user.
    differences in graph structure compared to the Ref-              The area of tracking has been well studied. Kr-
    erer graph because 48.4% of resource inclusions in          ishnamurthy et al. and others have documented the
    our crawled data have an inaccurate Referer.                pervasiveness of trackers and the associated user pri-
–   Through simulations, we find that 52 A&A com-               vacy implications over time [15, 20, 26, 33, 37–39]. Fur-
    panies are each able to observe 91% of an average           thermore, tracking techniques have evolved over time.
    user’s impressions as they browse, under modest as-         Persistent cookies [35], local state in browser plug-
    sumptions about data sharing in RTB auctions. 636           ins [7, 68, 69], and various browser fingerprinting meth-
    A&A companies are able to observe at least 50%              ods [1, 21, 36, 51, 55, 57, 65] are some of the tech-
    of an average user’s impressions. Even under the            niques that have been deployed to track users. Engle-
    strictest simulation assumptions, the top 10 A&A            hardt et al. [20] found evidence of tracking via the
    companies observe 89-99% of all user impressions.           Audio and Battery Status JavaScript APIs. In addi-
–   We simulate the effect of five blocking strategies,         tion to tracking users themselves, advertisers try to
    and find that AdBlock Plus (the world’s most pop-           maximize their knowledge of each user’s interest pro-
    ular ad blocking browser extension [45, 62], is in-         file by sharing information with each other via cookie
    effective at protecting users’ privacy because major        matching [1, 10, 23, 58]. Falahrastegar et al. examine
    ad exchanges are whitelisted under the Acceptable           how tracking differs across geographic regions [22].
    Ads program [73]. In contrast, Disconnect blocks                 Users have become increasingly concerned with the
    the most information flows to A&A companies, fol-           amount and types of tracking information collected
    lowed by removal of top 10% A&A nodes. However,             about them [47, 70]. Several surveys have investigated
    even with strong blocking, major A&A companies              users’ concerns about targeted ads, their preferences to-
    still observe 40–80% of user impressions.                   wards tracking, and usage of privacy tools [8, 42, 48, 66,
                                                                71]. Concerns about the privacy implications of track-
    The raw data we use in this study is publicly avail-        ing (as well as the insecurity of online ad networks [75])
able.1 We have also publicly released the source code           has led to increased adoption of tools that block track-
and data from this study.2                                      ers and ads. Two studies have examined the usage of ad
                                                                blockers in-the-wild [45, 62], while Walls et al. looked at
                                                                efforts to whitelist “acceptable advertisers” [73].
                                                                     Merzdovnik et al. critically examined the effec-
                                                                tiveness of tracker blocking tools [49]; in contrast,
1 http://personalization.ccs.neu.edu/Projects/Retargeting/
                                                                Nithyanand et al. studied advertisers’ efforts to counter
2 http://personalization.ccs.neu.edu/Projects/AdGraphs/
Diffusion of User Tracking Data in the Online Advertising Ecosystem     87

Publisher             Exchange             HTTP(S) Request/Response             pression” is used when advertising or tracking content
      SSP       DSP/Advertiser                           RTB Bidding            is rendered in a user’s browser after they visit a web-
                                                                               page [17]. To participate in RTB auctions, A&A com-
                 p1              p2        s1                              a3
                                                                               panies must implement cookie matching, which is a pro-
                                                            e2
                                                                              cess by which different A&A companies exchange their
                                                                         a2
                 a1                                                            unique tracking identifiers for specific users. Several
                                                      
                                                                               studies have examined the emergence of cookie match-
                                                            e1             a1
                e1                                                           ing [1, 10, 23, 58]. Ghosh et al. theoretically model the
                                                                                incentives for A&A companies to collaborate with their
(a) Cookie Matching                   (b) RTB Example with Two Exchanges
    Example                               and Two Auctions                      competitors in RTB auction systems [24].
                                                                                     Figure 1(a) illustrates the typical process used by
Fig. 1. Examples of (a) cookie matching and (b) showing an ad
                                                                                A&A companies to match cookies. When a user visits
to a user via RTB auctions. (a) The user visits publisher p1 Ê
which includes JavaScript from advertiser a1 Ë. a1 ’s JavaScript                a website Ê, JavaScript code from a third-party adver-
then cookie matches with exchange e1 by programmatically gen-                   tiser a1 is automatically downloaded and executed in
erating a request that contains both of their cookies Ì. (b) The                the user’s browser Ë. This code may set a cookie in the
user visits publisher p2 , which then includes resources from SSP               user’s browser, but this cookie will be unique to a1 , i.e.,
s1 and exchange e2 Ê–Ì. e2 solicits bids Í and sells the impres-
                                                                                it will not contain the same unique identifiers as the
sion to e1 Î Ï, which then holds another auction Ð, ultimately
                                                                                cookies set by any other A&A companies. Furthermore,
selling the impression to a1 Ñ Ò.
                                                                                the Same Origin Policy (SOP) prevents a1 ’s code from
                                                                                reading the cookies set by any other domain. To facili-
ad blockers [56]. Mughees et al. examined the prevalence                        tate bidding in future RTB auctions, a1 must match its
of anti-ad blockers in the wild [53]. In this work, we ex-                      cookie to the cookie set by an ad exchange like e1 . As
pand on the existing blocking literature by taking the                          shown in the figure, a1 ’s JavaScript accomplishes this
effects of ad auctions and cookie matching into account.                        by programmatically causing the browser to send a re-
     The research community has proposed a variety                              quest to e1 Ì. The JavaScript includes a1 ’s cookie in the
of mechanisms to stop online tracking that go beyond                            request, and the browser automatically adds a copy of
blacklists of domains and URLs. Li et al. [43] and                              e1 ’s cookie, thus allowing e1 to create a match between
Ikram et al. [32] used machine learning to identify track-                      its cookie and a1 ’s.
ers, while Papaodyssefs et al. [60] examined the use of                              Figure 1(b) shows an example of how an ad may
private cookies to avoid being tracked. Nikiforakis et                          be shown on publisher p2 using RTB auctions. When a
al. propose the complementary idea of adding entropy                            user visits p2 Ê, JavaScript code is automatically down-
to the browser to evade fingerprinting [54]. However, de-                       loaded and executed either from a Supply Side Platform
spite these efforts, third-party trackers are still pervasive                   (SSP) Ë or an ad exchange. SSPs are A&A companies
and pose real privacy issues to users [49].                                     that specialize in maximizing publisher revenue by for-
                                                                                warding impressions to the most lucrative ad exchange.
                                                                                Eventually the impression arrives at the auction held by
2.2 The Online Advertising Ecosystem                                            ad exchange e2 Ì, and e2 solicits bids from advertisers
                                                                                and Demand Side Platforms (DSPs) Í. DSPs are A&A
Numerous studies have chronicled the online advertis-                           companies that specialize in executing ad campaigns on
ing ecosystem, which is composed of companies that:                             behalf of advertisers. Note that all participants in the
track users, serve ads, act as platforms between publish-                       auction observe the impression; however, because
ers (websites that rely on advertising revenue to pay for                       only e2 ’s cookie is available at this point, auction par-
content creation) and advertisers, or all of the above.                         ticipants that have not matched cookies with e2 will not
Mayer et al. present an accessible introduction to this                         be able to identify the user.
topic in [46]. In this work, we collectively refer to                                The process of filling an impression may continue
companies engaged in analytics and advertising                                  even after an RTB auction is won, because the win-
as A&A companies.                                                               ner may be yet another ad exchange or ad network. As
    Recently, the online ad ecosystem has begun to shift                        shown in Figure 1(b), the impression is purchased from
from ad networks to ad exchanges, which implement                               e2 by e1 Î Ï, who then holds another auction Ð and
Real Time Bidding (RTB) auctions to sell impressions                            ultimately sells to a1 (the advertiser from the cookie
to advertisers. In the advertising industry, the term “im-                      matching example) Ñ Ò. Ad exchanges and ad networks
Diffusion of User Tracking Data in the Online Advertising Ecosystem       88

routinely match cookies with each other to facilitate the
flow of impression inventory between markets.
                                                             3 Methodology
Measurement Studies.            Barford et al. broadly       Our goal is to capture the most accurate representation
characterized the web adscape and identified systemat-       of the online advertising ecosystem, which will allow us
ically important ad networks [9]. Rodriguez et al. mea-      to model the effect of RTB on diffusion of user tracking
sured the ad ecosystem that serves mobile devices [72],      data. In this section, we introduce the dataset used in
while Zarras et al. specifically examined ad networks        this study and describe how we use it to build a graph
that serve malicious ads [75]. Gill et al. modeled the       representation of the ad ecosystem.
revenue earned by different A&A companies [26], while
other studies have used empirical measurements to de-
termine the value of individual users to online advertis-    3.1 Dataset
ers [58, 59]. Many studies have used a variety of meth-
ods to study the targeted ads that are displayed to users    In this work, we use the dataset provided by Bashir et
under a variety of circumstances [9–11, 16, 30, 44].         al. [10]. The goal of [10] was to causally infer the infor-
                                                             mation sharing relationships between A&A companies
                                                             by (1) crawling products from popular e-commerce web-
2.3 Ad Ecosystem Graphs                                      sites and then (2) observing corresponding retargeted
                                                             ads on publishers. Bashir et al. conducted web crawls
A natural structure for modeling the online ad ecosys-       that covered 738 major e-commerce websites (e.g., Ama-
tem is a graph, where nodes represent publishers and         zon) and 150 popular publishers (e.g., CNN).3 The au-
A&A companies, and edges capture relationships be-           thors chose top e-commerce sites from Alexa’s hierarchi-
tween these entities. Gomer et al. [29] built and analyzed   cal list of online shops [4], and manually chose publish-
graphs of the ad ecosystem by making use of the Ref-         ers from the Alexa Top-1K. They crawled 10 manually
erer field from HTTP requests. In this representation, a     selected products per e-commerce site to signal strong
relationship di → dj exists if there is an HTTP request      intent to trackers and advertisers, followed by 15 ran-
to domain dj with a Referer header from domain di .          domly chosen pages per publisher to elicit display ads.
     While Gomer et al. provided interesting insights        In total, Bashir et al. repeated the entire crawl nine
into the structure of the ad ecosystem, their referral-      times, resulting in data for around 2M impressions.
based graph representation has a significant limitation.
As we describe in § 3.3, relying on the HTTP Referer
does not always capture the correct relationships be-        3.2 Inclusion Trees
tween A&A parties, thus leading to incorrect graphs of
the ad ecosystem. We re-create this graph representa-        Bashir et al. [10] used a specially instrumented ver-
tion using our dataset (see § 3) and compare its prop-       sion of Chromium for their web crawls. Their crawler
erties to a more accurate representation in § 4.             recorded the inclusion tree for each webpage, which is
     Kalavri et al. [34] created a bipartite graph of pub-   a data structure that captures the semantic relation-
lishers and associated A&A domains, then transformed         ships between elements in a webpage (as opposed to the
it to create an undirected graph consisting solely of        DOM, which captures syntactic relationships) [6, 41].
A&A domains. In their representation, two A&A do-            The crawler also recorded all HTTP request and re-
mains are connected if they were included by the same        sponse headers associated with each visited URL.
publisher. This construction leads to a highly dense             To illustrate the importance of inclusion trees, con-
graph with many complete cliques. Kalavri et al. lever-      sider the example webpage shown in Figure 2(a). The
aged the tight community structure of A&A domains            DOM shows that the page from publisher p ultimately
to predict whether new, unknown URLs were A&A or             includes resources from four third-party domains (a1
not. However, this co-occurrence representation has a        through a4 ). It is clear from the DOM that the request
conceptual shortcoming: it may include edges between         to a3 is responsible for causing the request to a4 , since
A&A domains that do not directly communicate or have         the script inclusion is within the iframe. However, it
any business relationship. Due to this shortcoming, we
do not explore this graph representation in this work.
                                                             3 For simplicity, we refer to these e-commerce websites as pub-
                                                             lishers, to distinguish them from A&A domains.
Diffusion of User Tracking Data in the Online Advertising Ecosystem     89

                                               p.com/index.html                Cookie Matching.         The Bashir et al. dataset also
                                       includes labels on edges of the inclusion trees indicat-
                                                          ing cases where cookie matching is occurring. These la-
    
                                                                 a2.com/pixel.jpg
                                                                                     bels are derived from heuristics (e.g., string matching
    
                       a3.com/banner.html       to identify the passing of cookie values in HTTP pa-
    
                                                                 a4.com/ads.js
                                                                                     rameters) and causal inferences based on the presence

                                                                                     of retargeted ads. We use this data in § 5 to constrain
 (a) DOM Tree for http://p.com/index.html              (b) Inclusion Tree
                                                                                     some of our simulations.
                                                                    a1
                                   a1          a2
 Publisher
                         p                              p           a2
      A&A                                                                            3.3 Graph Construction
                                  a3           a4                   a3       a4

                        (c) Inclusion Graph              (d) Referer Graph
                                                                                     A natural way to model the online ad ecosystem is using
                                                                                     a graph. In this model, nodes represent A&A compa-
Fig. 2. An example HTML document and the corresponding in-
clusion tree, Inclusion graph, and Referer graph. In the DOM                         nies, publishers, or other online services. Edges capture
representation, the a1 script and a2 img appear at the same                          relationships between these actors, such as resource in-
level of the tree; in the inclusion tree, the a2 img is a child of the               clusion or information flow (e.g., cookie matching).
a1 script because the latter element created the former. The
Inclusion graph has a 1:1 correspondence with the inclusion tree.                    Canonicalizing Domains.              We use the data
The Referer graph fails to capture the relationship between the                      described in § 3.1 to construct a graph for the
a1 script and a2 img because they are both embedded in the                           online advertising ecosystem. We use effective 2nd -
first-party context, while it correctly attributes the a4 script to                  level domain names to represent nodes. For example,
the a3 iframe because of the context switch.
                                                                                     x.doubleclick.net and y.doubleclick.net are repre-
                                                                                     sented by a single node labeled doubleclick. Through-
is not clear which domain generated the requests to a2                               out this paper, when we say “domain”, we are referring
and a3 : the img and iframe could have been embedded                                 to an effective 2nd -level domain name.5
in the original HTML from p, or these elements could                                      Simplifying domains to the effective 2nd -level is a
have been created dynamically by the script from a1 .                                natural encoding for advertising data. Consider two in-
In this case, the inclusion tree shown in Figure 2(b) re-                            clusion trees generated by visiting two publishers: pub-
veals that the image from a2 was dynamically created                                 lisher p1 forwards the impression to x.doubleclick.net
by the script from a1 , while the iframe from a3 was                                 and then to advertiser a1 . Publisher p2 forwards to
embedded directly in the HTML from p.                                                y.doubleclick.net and advertiser a2 . This does not
    The instrumented Chromium binary used by                                         imply that x.doubleclick and y.doubleclick only sell
Bashir et al. was able to correctly determine the prove-                             impressions to a1 and a2 , respectively. In reality, Dou-
nance of webpage elements, regardless of how they were                               bleClick is a single auction, regardless of the subdo-
created (e.g., directly in HTML, via inline or remotely                              main, and a1 and a2 have the opportunity to bid on
included script tags, dynamically via eval(), etc.), or                              all impressions. Individual inclusion trees are snapshots
where they were located (in the main context or within                               of how one particular impression was served; only in
iframes). This was accomplished by tagging all scripts                               aggregate can all participants in the auctions be enu-
with provenance information (i.e., first-party for inline                            merated. Further, 3rd -level domains may read 2nd -level
scripts), and then dynamically monitoring the execu-                                 cookies without violating the Same Origin Policy [52]:
tion of each script. New scripts created during the ex-                              x.doubleclick.com and y.doubleclick.com may both
ecution of a given script (e.g., via document.write())                               access cookies set by .doubleclick, and do in practice.
were linked to their parent.4 More details about how                                      The sole exception to our domain canonicalization
Chromium was instrumented and inclusion trees were                                   process is Amazon’s Cloudfront Content Delivery Net-
extracted are available in [6].                                                      work (CDN). We routinely observed Cloudfront hosting
                                                                                     ad-related scripts and images in our data. We manu-
                                                                                     ally examined the 50 fully-qualified Cloudfront domains
4 Note that JavaScript within a given page context executes se-
rially, so there is no ambiguity created by concurrency. Although
Web Workers may execute concurrently, they cannot include                            5 None of the publishers and A&A domains in our dataset have
third party scripts or modify the DOM.                                               two-part TLDs, like .co.uk, which simplifies our analysis.
Diffusion of User Tracking Data in the Online Advertising Ecosystem                                             90

                                                              % Overlap with A&A
(e.g., d31550gg7drwar.cloudfront.net) that were pre-

                                                               from Alexa Top-5K
                                                                                   100                                              900

                                                                                                                # Unique External
                                                                                                                                    800

                                                                                                                  A&A Domains
or proceeded by A&A domains in our data, and mapped                                80                                               700
                                                                                                                                    600
                                                                                   60                                               500
each one to the corresponding A&A company (e.g.,                                   40                                               400
                                                                                                                                    300
adroll in this case).                                                              20                                               200
                                                                                                                                    100
                                                                                    0                                                 0
Inclusion graph.         We propose a novel representa-                                  0   250 500 750 1000                             0   3K 6K 9K 12K 15K
tion called an Inclusion graph that is the union of all                                  Top x A&A Domains                                    # Pages Crawled

inclusion trees in our dataset. Our representation is a di-
                                                              Fig. 3. Overlap between fre-                      Fig. 4. Unique A&A domains
rected graph of publishers and A&A domains. An edge           quent A&A domains and A&A                         contacted by each A&A do-
di → dj exists if we have ever observed domain di includ-     domains from Alexa Top-5K.                        main as we crawl more pages.
ing a resource from dj . Edges may exist from publishers
to A&A domains, or between A&A domains. Figure 2(c)
                                                              A&A domains from publishers and non-A&A third par-
shows an example Inclusion graph.
                                                              ties like CDNs. In the inclusion trees from the Bashir et
Referer graph.       Gomer et al. [29] also proposed a di-    al. dataset [10], each resource is labeled as A&A or non-
rected graph representation consisting of publishers and      A&A using the EasyList and EasyPrivacy rule lists. For
A&A domains for the online advertising ecosystem. In          all the A&A labeled resources, we extract the associated
this representation, each publisher and A&A domain is         2nd -level domain. To eliminate false positives, we only
a node, and edge di → dj exists if we have ever observed      consider a 2nd -level domain to be A&A if it was labeled
an HTTP request to dj with Referer di . Figure 2(d)           as A&A more than 10% of the time in the dataset.
shows an example Referer graph corresponding to the
given webpage. The Bashir et al. [10] dataset includes
all HTTP request and response headers from the crawl,         3.5 Coverage
and we use these to construct the Referer graph.
     Although the Referer and Inclusion graphs seem           There are two potential concerns with the raw data we
similar, they are fundamentally different for technical       use in this study: does the data include a representative
reasons. Consider the examples shown in Figure 2: the         set of A&A domains? and does the data contain all of
script from a1 is included directly into p’s context,         the outgoing edges associated with each A&A domain?
thus p is the Referer in the request to a2 . This results     To answer the former question, we plot Figure 3, which
in a Referer graph with two edges that does not cor-          shows the overlap between the top x A&A domains in
rectly encode the relationships between the three par-        our dataset (ranked by inclusion frequency by publish-
ties: p → a1 and p → a2 . In other words, HTTP Referer        ers) with all of the A&A domains included by the Alexa
headers are an indirect method for measuring the se-          Top-5K websites.6 We observe that 99% of the 150 most
mantic relationships between page elements, and the           frequent A&A domains appear in both samples, while
headers may be incorrect depending on the syntactic           89% of the 500 most frequent appear in both. These
structure of a page. Our Inclusion graph representation       findings confirm that our dataset includes the vast ma-
fixes the ambiguity in the Referer graph by explicitly        jority of prominent A&A domains that users are likely
relying on the inclusion relationships between elements       to encounter on the web.
in webpages. We analyze the salient differences between            To answer the second question, we plot Figure 4,
the Referer and Inclusion graph in § 4.                       which shows the number of unique external A&A do-
Weights.       Additionally, we also create a weighted        mains contacted by A&A domains in our dataset as
version of these graphs. In the Inclusion graph, the          the crawl progressed (i.e., starting from the first page
weight of di → dj encodes the number of times a re-           crawled, and ending with the last). Recall that the
source from di sent an HTTP request to dj . In the Ref-       dataset was collected over nine consecutive crawls span-
erer graph, the weight of di → dj encodes the number          ning two weeks of time, each of which visited 9,630 in-
of HTTP requests with Referer di and destination dj .         dividual pages spread over 888 domains.
                                                                   We observe that the number of A&A →A&A edges
                                                              rises quickly initially, going from 0 to 800 in 3,600
3.4 Detection of A&A Domains

For us to understand the role of A&A companies in             6 Our dataset and the Alexa Top-5K data were both collected
the advertising graph, we must be able to distinguish         in December 2015, so they are temporally comparable.
Diffusion of User Tracking Data in the Online Advertising Ecosystem          91

                                                                   Avg. Deg.        Avg. Path    Cluster.               Degree
       Graph Type      |V|      |E|    |VWCC |      |EWCC |         (In   Out)        Length       Coef.    S∆ [31]     Assort.
       Inclusion      1917    26099       1909        26099     13.612 13.612          2.748†     0.472‡    31.254‡      -0.31‡
       Referer        1923    41468       1911        41468     21.564 21.564          2.429†     0.235‡    10.040‡      -0.29‡
Table 1. Basic statistics for Inclusion and Referer graph. We show sizes for the largest WCC in each graph. † denotes that the metric
is calculated on the largest SCC. ‡ denotes that the metric is calculated on the undirected transformation of the graph.

crawled pages. Then, the growth slows down, requiring                that should be in the core of the network are incorrectly
an additional 12,000 page visits to increase from 800 to             attached to publishers along the periphery.
900. In other words, almost all A&A edges were dis-                  Structure and Connectivity.             As shown in Ta-
covered by half-way through the very first crawl; eight              ble 1, the Inclusion graph has large, well-connected
subsequent iterations of the crawl only uncovered 12.5%              components. The largest Weakly Connected Compo-
more edges. This demonstrates that the crawler reached               nent (WCC) covers all but eight nodes in the Inclusion
the point of diminishing returns, indicating that the vast           graph, meaning that very few nodes are completely dis-
majority of connections between A&A domains that ex-                 connected. This highlights the interconnectedness of the
isted at the time are contained in the dataset.                      ad ecosystem. The average node degree in the Inclusion
                                                                     graph is 13.6, and
Diffusion of User Tracking Data in the Online Advertising Ecosystem         92

        2000                                                                 Betweenness Centrality    Weighted PageRank
        1600                                                                       google-analytics            doubleclick
                                                                                        doubleclick      googlesyndication
|WCC|

        1200
                                                                                   googleadservices                  2mdn
        800
                                                                                          facebook                   adnxs
        400                                                                      googletagmanager                   google
          0                                                                       googlesyndication        adsafeprotected
               0   10   20     30       40    50     60         70                            adnxs        google-analytics
                                    k                                                        google      scorecardresearch
                                                                                            addthis                   krxd
Fig. 5. k-core: size of the Inclusion graph WCC as nodes with
                                                                                              criteo        rubiconproject
degree ≤ k are recursively removed.
                                                                     Table 2. Top 10 nodes ranked by betweenness centrality and
                                                                     weighted PageRank in the Inclusion graph.
network that have disassortative connectivity, which we
examine in the next section.
                                                                     segmented by ad exchange (e.g., customers and part-
                                                                     ners centered around DoubleClick). This is a known
4.2 Cores and Communities                                            deficiency in modularity maximization based methods,
                                                                     that they tend to produce communities with no real-
We now examine how nodes in the Inclusion graph con-                 world correspondence [5]. Girvan–Newman found 10
nect to each other using two metrics: k-cores and com-               communities, with the largest community containing
munity detection. The k-core of a graph is the subset                1,097 nodes (57% of all nodes) and 16,424 edges (63%
of a graph (nodes and edges) that remain after recur-                of all edges). Out of 1,097 nodes, 64% are A&A. How-
sively removing all nodes with degree ≤ k. By increas-               ever, the modularity score was zero, which means that
ing k, the loosely connected periphery of a graph can be             the Girvan–Newman communities contain a random as-
stripped away, leaving just the dense core. In our sce-              sortment of internal and external (cross-cluster) edges.
nario, this corresponds to the high-degree ad exchanges,                 Overall, these results demonstrate that the web dis-
ad networks, and trackers that facilitate the connections            play ad ecosystem is not balkanized into distinct groups
between publishers and advertisers.                                  of companies and publishers that partner with each
     Figure 5 plots k versus the size of the WCC for the             other. Instead, the ecosystem is highly interdependent,
Inclusion graph. The plot shows that the core of the                 with no clear delineations between groups or types of
Inclusion graph rapidly declines in size as k increases,             A&A companies. This result is not surprising consider-
which highlights the interdependence between A&A do-                 ing how dense the Inclusion graph is.
mains and the lack of a distinct core.
     Next, to examine the community structure of the
Inclusion graph, we utilized three different community               4.3 Node Importance
detection algorithms: label propagation by Raghavan et
al. [64], Louvain modularity maximization [12], and the              In this section, we focus on the importance of specific
centrality-based Girvan–Newman [27] algorithm. We                    nodes in the Inclusion graph using two metrics: be-
chose these algorithms because they attempt to find                  tweenness centrality and weighted PageRank. As be-
communities using fundamentally different approaches.                fore, we focus on the largest WCC. The betweenness
     Unfortunately, after running these algorithms on                centrality for a node n is defined as the fraction of all
the largest WCC, the results of our community analy-                 shortest paths on the graph that traverse n. In our sce-
sis were negative. Label propagation clustered all nodes             nario, nodes with high betweenness centrality represent
into a single community. Louvain found 14 communities                the key pathways for tracking information and impres-
with an overall modularity score of 0.44 (on a scale of              sions to flow from publishers to the rest of the ad ecosys-
-1 to 1 where 1 is entirely disjoint clusters). The largest          tem. For weighted PageRank, we weight each edge in the
community contains 771 nodes (40% of all nodes) and                  Inclusion graph based on the number of times we ob-
3252 edges (12% of all edges). Out of 771 nodes, 37%                 serve it in our raw data. In essence, weighted PageRank
are A&A. However, none of the 14 communities cor-                    identifies the nodes that receive the largest amounts of
responded to meaningful groups of nodes, either seg-                 tracking data and impressions throughout each graph.
mented by type (e.g., publishers, SSPs, DSPs, etc.) or
Diffusion of User Tracking Data in the Online Advertising Ecosystem      93

     Table 2 shows the top 10 nodes in the Inclusion         These questions have direct implications for under-
graph based on betweenness centrality and weighted           standing users’ online privacy. The first two questions
PageRank. Prominent online advertising companies are         are about quantifying a user’s online footprint, i.e., how
well represented, including AppNexus (adnxs), Face-          much of their browsing history can be recorded by dif-
book, and Integral Ad Science (adsafeprotected). Sim-        ferent companies. In contrast, the third question inves-
ilar to prior work, we find that Google’s advertising do-    tigates how well different blocking strategies perform at
mains (including DoubleClick and 2mdn) are the most          protecting users’ privacy.
prominent overall [29]. Unsurprisingly, these companies
all provide platforms, i.e., SSPs, ad exchanges, and ad
networks. We also observe trackers like Google Analyt-       5.2 Simulation Setup
ics and Tag Manager. Interestingly, among 14 unique
domains across the two lists, ten only appear in a single    To answer these questions, we simulate the browsing
list. This suggests that the most important domains in       behavior of typical users using the methodology from
terms of connectivity are not necessarily the ones that      Burklen et al. [14].9 In particular, we simulate a user
receive the highest volume of HTTP requests.                 browsing publishers over discreet time steps. At each
                                                             time step our simulated user decides whether to remain
                                                             on the current publisher according to a Pareto distri-
                                                             bution (exponent = 2), in which case they generate a
5 Information Diffusion                                      new impression on that publisher. Otherwise, the user
                                                             browses to a new publisher, which is chosen based on a
In § 4, we examined the descriptive characteristics of
                                                             Zipf distribution over the Alexa ranks of the publishers.
the Inclusion graph, and discuss the implications of
                                                             Burklen et al. developed this browsing model based on
this graph structure on our understanding of the on-
                                                             large-scale observational traces, and derive the distri-
line advertising ecosystem. In this section, we take the
                                                             butions and their parameters empirically. This brows-
next step and present a concrete use case for the In-
                                                             ing model has been successfully used to drive simulated
clusion graph: modeling the diffusion of user tracking
                                                             experiments in other work [40].
data across the ad ecosystem under different types of ad
                                                                 We generated browsing traces for 200 users. On av-
and tracker blocking (e.g., AdBlock Plus and Ghostery).
                                                             erage, each user generated 5,343 impressions on 190
We model the flow of information across the Inclusion
                                                             unique publishers. The publishers are selected from the
graph, taking into account different blocking strategies,
                                                             888 unique first-party websites in our dataset (see § 3.1).
as well as the design of RTB systems and empirically ob-
                                                                 During each simulated time step the user generates
served transition probabilities from our crawled dataset.
                                                             an impression on a publisher, which is then forwarded
                                                             to all A&A domains that are directly connected to the
                                                             publisher. This emulates a webpage with multiple slots
5.1 Simulation Goals
                                                             for display ads, each of which is serviced by a differ-
                                                             ent SSP or ad exchange. However, it is insufficient to
Simulation is an important tool for helping to under-
                                                             simply forward the impression to the A&A domains di-
stand the dynamics of the (otherwise opaque) online
                                                             rectly connected to each publisher; we also must account
advertising industry. For example, Gill et al. used data-
                                                             for ad exchanges and RTB auctions [10, 58], which may
driven simulations to model the distribution of revenue
                                                             cause the impression to spread farther on the graph.
amongst online display advertisers [26].
                                                             We discuss this process next. The simulated time step
    Here, we use simulations to examine the flow
                                                             ends when all impressions arrive at A&A domains that
of browsing history data to trackers and advertisers.
                                                             do not forward them. Once all outstanding impressions
Specifically, we ask:
                                                             have terminated, time increments and our simulated
1. How many user impressions (i.e., page visits) to
                                                             user generates a new impression, either from their cur-
    publishers can each A&A domain observe?
                                                             rently selected publisher or from a new publisher.
2. What fraction of the unique publishers that a user
   visits can each A&A domain observe?
3. How do different blocking strategies impact the
   number of impressions and fraction of publishers ob-      9 To the best of our knowledge, there are no other empirically
   served by each A&A domain?                                validated browsing models besides [14].
Diffusion of User Tracking Data in the Online Advertising Ecosystem   94

        1                                     1                              x-axis is in log scale). This demonstrates that heavy-
      0.8                                   0.8
                                                                             hitters like DoubleClick, GoogleSyndication, OpenX,
      0.6                                   0.6
CDF

                                      CDF
      0.4                                   0.4                              and Facebook are likely to purchase impressions that
      0.2                                   0.2
                                                                             go up for auction in our simulations.
        0                                     0
            0 0.2 0.4 0.6 0.8 1                   1 10 100 1K 10K100K
                                                                             Indirect Propagation.          Unfortunately, precisely ac-
            Termination Probability               Mean Weight on Incoming
                   per Node                               Edges              counting for indirect propagation is not currently possi-
                                                                             ble, since it is not known exactly which A&A domains
Fig. 6. CDF of the termination        Fig. 7. CDF of the weights on
probability for A&A nodes.            incoming edges for A&A nodes.          are ad exchanges, or which pairs of A&A domains share
                                                                             information. To compensate, we evaluate three different
                                                                             indirect impression propagation models:
5.2.1 Impression Propagation                                                 – Cookie Matching-Only: As we note in § 3.2, the
                                                                                  Bashir et al. [10] dataset includes 200 empirically
    Our simulations must account for direct and indirect                          validated pairs of A&A domains that match cookies.
propagation of impressions. Direct flows occur when one                           In this model, we treat these 200 edges as ground-
A&A domain sells or redirects an impression to another                            truth and only indirectly disseminate impressions
A&A domain. We refer to these flows as “direct” be-                               along these edges. Specifically, if ai observes an im-
cause they are observable by the web browser, and are                             pression, it will indirectly share with aj iff ai → aj
thus recorded in our dataset. Indirect flows occur when                           exists and is in the set of 200 known cookie match-
an ad exchange solicits bids on an impression. The ad-                            ing edges. This is the most conservative model we
vertisers in the auction learn about the impression, but                          evaluate, and it provides a lower-bound on impres-
this is not directly observable to the browser; only the                          sions observed by A&A domains.
winner is ultimately known.
                                                                             –   RTB Relaxed: In this model, we assume that
Direct Propagation.          To account for direct propa-                        each A&A domain that observes an impression, in-
gation, we assign a termination probability to each A&A                          directly shares it with all A&A domains that it is
node in the Inclusion graph that determines how often                            connected to. Although this is the correct behavior
it serves an ad itself, versus selling the impression to a                       for ad exchanges like Rubicon and DoubleClick, it
partner (and redirecting the user’s browser accordingly).                        is not correct for every A&A domain. This is the
We derive the termination probability for each A&A                               most liberal model we evaluate, and it provides an
node empirically from our dataset. When an impression                            upper-bound on impressions observed by A&A do-
is sold, we determine which neighboring node purchases                           mains.
the impression based on the weights of the outgoing                          –   RTB Constrained: In this model, we select a sub-
edges. For a node ai , we define its set of outgoing neigh-                      set of A&A domains E to act as ad exchanges.
bors as No (ai ). The probability of selling to neighbor                         Whenever an A&A domain in E observes an impres-
                                P
aj ∈ No (ai ) is w(ai → aj )/ ∀ay ∈No (ai ) w(ai → ay ),                         sion, it shares it with all directly connected A&A
where w(ai → aj ) is the weight of the given edge.                               domains, i.e., to solicit bids. This model represents
     Figure 6 shows the termination probability for A&A                          a more realistic view of information diffusion than
nodes in the Inclusion graph. We see that 25% of                                 the Cookie Matching-Only and RTB Relaxed mod-
the A&A nodes have a termination probability of one,                             els because the graph contains few but extremely
meaning that they never sell impressions. The remaining                          well connected exchanges.
75% of A&A nodes exhibit a wide range of termination
probabilities, corresponding to different business mod-                          For RTB Constrained, we select all A&A nodes with
els and roles in the ad ecosystem. For example, Dou-                         out-degree ≥ 50 and in/out degree ratio r in the range
bleClick, the most prominent ad exchange, has a termi-                       0.7 ≤ r ≤ 1.7 to be in E. These thresholds were cho-
nation probability of 0.35, whereas Criteo, a well-known                     sen after manually looking at the degrees and ratios
advertiser specializing in retargeting, has a termination                    for known ad exchanges and ad exchanges marked by
probability of 0.63.                                                         Bashir et al. [10]. This results in |E| = 36 A&A nodes
     Figure 7 shows the mean incoming edge weights for                       being chosen as ad exchanges (out of 1,032 total A&A
A&A nodes in the Inclusion graph. We observe that                            domains in the Inclusion graph). We enforce restrictions
the distribution is highly skewed towards nodes with                         on r because A&A nodes with disproportionately large
extremely high average incoming weights (note that the                       amounts of incoming edges are likely to be trackers (in-
Diffusion of User Tracking Data in the Online Advertising Ecosystem         95

   Node Type                    Edge Type                Activation      a4 and a5 (i.e., it services their ad campaigns by bidding
     Publisher               Cookie Matched               Direct         on their behalf). Light grey edges capture cases where
     Exchange            Non-Cookie Matched
                                                         Indirect        the two endpoints have been observed cookie matching
DSP/Advertiser
                                                                         in the ground-truth data. Edge e2 → a3 is a false nega-
                                                 a1
                                  e1             0                  a4   tive because matching has not been observed along this
                  p1
                                  0                                 0
(a) Example                                      a2                      edge in the data, but a3 must match with e2 to mean-
    Graph                                        0
                  p2              e2                                a5   ingfully participate in the auction.
                                  0              a3                 0
                                                 0                            Figure 8(b)–(d) show the flow of impressions under
                   False negative edge
                                                                         our three models. In all three examples, a user visits
                                                 a1                      publishers p1 and p2 , generating two impressions. Fur-
                                  e1             1                  a4
                  p1                                                     ther, in all three examples a2 wins both auctions on
                                  1                                 0
(b) Cookie                                       a2                      behalf of a5 ; thus e1 , e2 , a2 , and a5 are guaranteed to
    Matching                                     2
                  p2              e2                                a5   observe impressions. As shown in the figure, a2 and a5
                                  1              a3                 2
                                                 0                       observe both impressions, but other nodes may observe
                   False negative impression
                                                                         zero or more impressions depending on their position
                                                 a1                      and the dissemination model. In Figure 8(b), a3 does not
                                  e1             1                  a4
                  p1                                                     observe any impressions because its incoming edge has
                                  1                                 0
(c) RTB                                          a2
    Constrained                                  2                       not been labeled as cookie matched; this is a false nega-
                  p2              e2                                a5
                                  1              a3                 2    tive because a3 participates in e2 ’s auction. Conversely,
                                                 1
                                                      False positive
                                                                         in Figure 8(d), all nodes always share all impressions,
                                                      impressions        thus a4 observes both impressions. However, these are
                                                 a1                      false positives, since DSPs like a2 do not routinely share
                                  e1             1                  a4
                  p1
                                  1                                 2    information amongst all their clients.
(d) RTB                                          a2
    Relaxed                                      2
                  p2              e2                                a5
                                  1              a3                 2
                                                 1
                                                                         5.2.2 Node Blocking
Fig. 8. Examples of our information diffusion simulations. The
observed impression count for each A&A node is shown below               To answer our third question, we must simulate the ef-
its name. (a) shows an example graph with two publishers and             fect of “blocking” A&A domains on the Inclusion graph.
two ad exchanges. Advertisers a1 and a3 participate in the RTB
                                                                         A simulated user that blocks A&A domain aj will not
auctions, as well as DSP a2 that bids on behalf of a4 and a5 .
(b)–(d) show the flow of data (dark grey arrows) when a user
                                                                         make direct connections to it (the solid outlines in Fig-
generates impressions on p1 and p2 under three diffusion models.         ure 8). However, blocking aj does not prevent aj from
In all three examples, a2 purchases both impressions on behalf of        tracking users indirectly: if the simulated user contacts
a5 , thus they both directly receive information. Other advertisers      ad exchange ai , the impression may be forwarded to
indirectly receive information by participating in the auctions.         aj during the bidding process (the dashed outlines in
                                                                         Figure 8). For example, an extension that blocks a2 in
formation enters but is not forwarded out), while those                  Figure 8 will prevent the user from seeing an ad, as
with disproportionately large amounts of outgoing edges                  well as prevent information flow to a4 and a5 . However,
are likely SSPs (they have too few incoming edges to be                  blocking a2 does not stop information from flowing to
an ad exchange). Table 6 in the appendix shows the                       e1 , e2 , a1 , a3 , and even a2 !
domains in E, including major, known ad exchanges                             We evaluate five different blocking strategies to
like App Nexus, Advertising.com, Casale Media, Dou-                      compare their relative impact on user privacy under our
bleClick, Google Syndication, OpenX, Rubicon, Turn,                      three impression propagation models:
and Yahoo. 150 of the 200 known cookie matching edges                    1. We randomly blocked 30% (310) of the A&A nodes
in our dataset are covered by this list of 36 nodes.                          from the Inclusion graph.10
     Figure 8 shows hypothetical examples of how im-                     2. We blocked the top 10% (103) of A&A nodes from
pressions disseminate under our indirect models. Fig-                       the Inclusion graph, sorted by weighted PageRank.
ure 8(a) presents the scenario: a graph with two publish-
ers connected to two ad exchanges and five advertisers.
a2 is a bidder in both exchanges, and serves as a DSP for                10 We also randomly blocked 10% and 20% of A&A nodes, but
                                                                         the simulation results were very similar to that of random 30%.
Diffusion of User Tracking Data in the Online Advertising Ecosystem        96
# Nodes Activated

                    300                             6                                          First, we look at the number of nodes that are ac-
                    250                             5

                                       Tree Depth
                    200                             4                                     tivated by direct propagation in trees rooted at each
                    150                             3                                     publisher. Figure 9a shows that our models are conser-
                    100                             2
                     50                             1                                     vative in that they generate smaller trees: the median
                      0                             0
                                                                                          original tree contains 48 nodes, versus 32, seven, and six
                          O

                              R

                          R

                          C

                                                        O

                                                                 R

                                                                      R

                                                                               C
                            TB l

                            TB

                            M

                                                                 TB

                                                                          TB

                                                                                M
                            rig

                                                        rig
                                in

                                                            in
                                                                                          from our models. One caveat to this is that publishers
                                 -R

                                 -C

                                                                     -R

                                                                           -C
                                   a

                                                            al
                                                                                          in our simulated trees have a wider range of fan-outs
(a) Number of nodes                    (b) Tree depth
                                                                                          than in the original trees. The median publishers in the
Fig. 9. Comparison of the original and simulated inclusion trees.                         original and simulated trees have 11 and 12 neighbors,
Each bar shows the 5th , 25th , 50th (in black), 75th , and 95th                          respectively, but the 75th percentile trees have 16 and
percentile value.                                                                         30 neighbors, respectively.
                                                                                               Second, we investigate the depth of the inclusion
                                                                                          trees. As shown in Figure 9b, the median tree depth in
3. We blocked all 594                           A&A     nodes             from      the
                                                                                          the original trees is three, versus two in all our models.
   Ghostery [25] blacklist.
                                                                                          The 75th percentile tree depth in the original data is
4. We blocked all 412 A&A nodes from the Discon-
                                                                                          four, versus three in the RTB Relaxed and RTB Con-
   nect [18] blacklist.
                                                                                          strained models, and two in the most restrictive Cookie
5. We emulated the behavior of AdBlock Plus [2],                                          Matching-Only model. These results show that overall,
   which is a combination of whitelisting A&A nodes                                       our models are conservative in that they tend to gener-
   from the Acceptable Ads program [73], and black-                                       ate slightly shorter inclusion trees than reality.
   listing A&A nodes from EasyList [19]. After                                                 Third, we look at the set of A&A domains that are
   whitelisting, 634 A&A nodes are blocked.                                               included in trees rooted at each publisher. For a pub-
                                                                                          lisher p that contacts a set Aop of A&A domains in our
We chose these methods to explore a range of graph                                        original data, we calculate fp = |Asp ∩Aop |/|Aop |, where Asp
theoretic and practical blocking strategies. Prior work                                   is the set of A&A domains contacted by p in simulation.
has shown that the global connectivity of small-world                                     Figure 10 plots the CDF of fp values for all publishers in
graphs is resilient against random node removal [13], but                                 our dataset, under our three models. We observe that for
we would like to empirically determine if this is true for                                almost 80% publishers, 90% A&A domains contacted in
ad network graphs as well. In contrast, prior work also                                   the original trees are also contacted in trees generated
shows that removing even a small fraction of top nodes                                    by the RTB Relaxed model. This falls to 60% and 16%
from small-world graphs causes the graph to fracture                                      as the models become more restrictive.
into many subgraphs [50, 74]. Ghostery and Disconnect                                          Fourth, we examine the number of ad exchanges
are two of the most widely-installed tracker blocking                                     that appear in the original and simulated trees. Exam-
browser extensions, so evaluating their blacklists allows                                 ining the ad exchanges is critical, since they are respon-
us to quantify how good they are at protecting users’                                     sible for all indirect dissemination of impressions. As
privacy. Finally, AdBlock Plus is the most popular ad                                     shown in Figure 11, inclusion trees from our simula-
blocking extension [45, 62], but contrary to its name,                                    tions contain an order of magnitude fewer ad exchanges
by default it whitelists A&A companies that pay to be                                     than the original inclusion trees, regardless of model.11
part of its Acceptable Ads program [3]. Thus, we seek to                                  This suggests that indirect dissemination of impressions
understand how effective AdBlock Plus is at protecting                                    in our models will be conservative relative to reality.
user privacy under its default behavior.
                                                                                          Number of Selected Exchanges.           Finally, we in-
                                                                                          vestigate the impact of exchanges in the RTB Con-
                                                                                          strained model. We select the top x A&A domains by
5.3 Validation
                                                                                          out-degree to act as exchanges (subject to their in/out
                                                                                          degree ratio r being in the range 0.7 ≤ r ≤ 1.7), then
To confirm that our simulations are representative of
                                                                                          execute a simulation. As shown in Figure 12, with 20
our ground-truth data, we perform some sanity checks.
We simulate a single user in each model (who generates
5K impressions) and compare the resulting simulated
                                                                                          11 Because each of our models assumes that a different set of
inclusion trees to the original, real inclusion trees.
                                                                                          A&A nodes are ad exchanges, we must perform three corre-
                                                                                          sponding counts of ad exchanges in our original trees.
Diffusion of User Tracking Data in the Online Advertising Ecosystem                  97

(Frac. of Publishers)
                          1                                                 1 Simulation                                        1
                              CM
                        0.8 RTB-C                                          0.8                                                0.8
                            RTB-R
                        0.6                                                0.6                                                0.6
        CDF

                                                                    CDF

                                                                                                                        CDF
                        0.4                                                0.4                                                0.4
                                                                                                       CM                                         5        30
                        0.2                                                0.2                       RTB-C                    0.2                10        50
                                                                                            Original RTB-R                                       20       100
                         0                                                  0                                                   0
                              0     0.2    0.4  0.6  0.8     1                   1        10     100   1000 10000                   0    0.2   0.4    0.6   0.8      1
                                   Frac. of A&A Contacted                            # of Ad Exchanges per Tree                         Fraction of Impressions

Fig. 10. CDF of the fractions of A&A                               Fig. 11. Number of ad exchanges in                   Fig. 12. Fraction of impressions observed
domains contacted by publishers in our                             our original (solids lines) and simulated            by A&A domains in RTB-C model when
original data that were also contacted in                          (dashed lines) inclusion trees.                      top x exchanges are selected.
our three simulated models.

  Blocking                        Cookie Matching-Only   RTB Constrained    RTB Relaxed          Cookie Matching-Only       RTB Constrained           RTB Relaxed
 Scenarios                        %E              %W     %E         %W      %E      %W         doubleclick       90.1   google-analytics  97.1   pinterest           99.1
No Blocking                       16.9            31.0   33.9       55.9    71.8     81.3      criteo            89.6   quantserve        92.0   doubleclick         99.1
AdBlock Plus                      12.3            28.0   25.6       50.3    48.4     68.6      quantserve        89.5   scorecardresearch 91.9   twitter             99.1
Random 30%                        12.1            21.8   22.1       34.2    48.7     54.8      googlesyndication 89.0   youtube           91.8   googlesyndication   99.0
  Ghostery                        3.52            9.87   6.82       18.2    13.5     21.9      flashtalking      88.8   skimresources     91.6   scorecardresearch   99.0
  Top 10%                         6.03            5.01   8.18       5.52    26.8     13.4      mediaforge        88.8   twitter           91.3   moatads             99.0
 Disconnect                       2.98            3.66   4.72       6.01    16.3     11.6      adsrvr            88.6   pinterest         91.2   quantserve          99.0
                                                                                               dotomi            88.6   criteo            91.2   doubleverify        99.0
Table 3. Percentage of Edges that are triggered in the Inclusion                               steelhousemedia   88.6   addthis           91.1   crwdcntrl           99.0
graph during our simulations under different propagation models                                adroll            88.6   bluekai           91.1   adsrvr              99.0
and blocking scenarios. We also show the percentage of edge                                    Table 4. Top 10 nodes that observed the most impressions under
Weights covered via triggered edges.                                                           our simulations with no blocking.

or more exchanges the distribution of impressions ob-                                          to have significant impact relative to the No Blocking
served by A&A domains stops growing, i.e., our RTB                                             baseline, in terms of removing edges or weight, under
Constrained model is relatively insensitive to the num-                                        the Cookie Matching-Only and RTB Constrained mod-
ber of exchanges. This is not surprising, given how dense                                      els. Further, the top 10% blocking strategy removes
the Inclusion graph is (see § 4). We observed similar re-                                      less edges than Disconnect or Ghostery, but it reduces
sults when we picked top nodes based on PageRank.                                              the remaining edge weight to roughly the same level as
                                                                                               Disconnect, whereas Ghostery leaves more high-weight
                                                                                               edges intact. These observations help to explain the out-
                                                                                               comes of our simulations, which we discuss next.
5.4 Results
                                                                                               No Blocking.         First, we discuss the case where no
We take our 200 simulated users and “play back” their                                          A&A nodes are blocked in the graph. Figure 13 shows
browsing traces over the unmodified Inclusion graph, as                                        the fraction of total impressions (out of ∼5,300) and
well as graphs where nodes have been blocked using the                                         fraction of unique publishers (out of ∼190) observed by
strategies outlined above. We record the total number                                          A&A domains under different propagation models. We
of impressions observed by each A&A domain, as well as                                         find that the distribution of observed impressions under
the fraction of unique publishers observed by each A&A                                         RTB Constrained is very similar to that of RTB Re-
domain under different impression propagation models.                                          laxed, whereas observed impressions drop dramatically
                                                                                               under Cookie Matching-Only model. Specifically, the
Triggered Edges.        Table 3 shows the percentage of
                                                                                               top 10% of A&A nodes in the Inclusion graph (sorted
edges between A&A nodes that are triggered in the In-
                                                                                               by impression count) observe more than 97% of the im-
clusion graph under different combinations of impres-
                                                                                               pressions in RTB Relaxed, 90% in RTB Constrained,
sion propagation models and blocking strategies. No
                                                                                               and 29% in Cookie Matching-Only. We observe simi-
blocking/RTB Relaxed is the most permissive case; all
                                                                                               lar patterns for fractions of publishers observed across
other cases have less edges and weight because (1) the
                                                                                               the three indirect propogating models. Recall that the
propagation model prevents specific A&A edges from
                                                                                               Cookie Matching-Only and RTB Relaxed models func-
being activated and/or (2) the blocking scenario ex-
                                                                                               tion as lower- and upper-bounds on observability; that
plicitly removes nodes. Interestingly, AdBlock Plus fails
Diffusion of User Tracking Data in the Online Advertising Ecosystem                                         98

       1                                                      1                                                             1
                                                                       RTB Constrained                                               RTB Constrained

      0.8                                                    0.8                                                           0.8
                               Publishers                                     RTB Relaxed
                Impressions                                                                                                                 RTB Relaxed
      0.6                                                    0.6                                                           0.6

                                                       CDF

                                                                                                                     CDF
CDF

      0.4                                                    0.4                                                           0.4

                                                                                                 Disconnect
      0.2                 Cookie Matching-Only               0.2                                   Ghostery                0.2                                  Top 10%
                              RTB Constrained                                                  AdBlock Plus                                                  Random 30%
                                  RTB Relaxed                                                   No Blocking                                                   No Blocking
       0                                                      0                                                             0
            0      0.2      0.4     0.6      0.8   1               0         0.2         0.4      0.6      0.8   1               0         0.2         0.4      0.6      0.8   1
                         Observed Fraction                                         Fraction of Impressions                                       Fraction of Impressions

Fig. 13. Fraction of impressions (solid                (a) Disconnect, Ghostery, AdBlock Plus                        (b) Top 10% and Random 30% of nodes
lines) and publishers (dashed lines) ob-
                                                       Fig. 14. Fraction of impressions observed by A&A domains under the RTB Constrained
served by A&A domains under our three
                                                       (dashed lines) and RTB Relaxed (solid lines) models, with various blocking strategies.
models, without any blocking.

the results from the RTB Constrained model are so sim-                                     with no blocking and in Table 5 with AdBlock Plus are
ilar to the RTB Relaxed model is striking, given that                                      almost identical, save for some reordering.
only 36 nodes in the former spread impressions indi-                                           Next, we examine Ghostery and Disconnect in Fig-
rectly, versus 1,032 in the latter.                                                        ure 14a. As expected, the amount of information seen by
     Although the overall fraction of observed impres-                                     A&A domains decreases when we block domains from
sions drops significantly in the Cookie Matching-Only                                      these blacklists. Disconnect’s blacklist does a much bet-
model, Table 4 shows that the top 10 A&A domains                                           ter job of protecting users’ privacy in our simulations:
observe 99%, 96%, and 89% of impressions on aver-                                          after blocking nodes using the Disconnect blacklist, 90%
age under RTB Relaxed, RTB Constrained, and Cookie                                         of the nodes see less than 40% of the impressions in the
Matching-Only respectively. Some of the top ranked                                         RTB Constrained model, and less than 53% in the RTB
nodes are expected, like DoubleClick, but other cases are                                  Relaxed model. In contrast, when using the Ghostery
more interesting. For example, Pinterest is connected                                      blacklist, 90% of the nodes see less than 75% of the im-
to 178 publishers and 99 other A&A domains. In the                                         pressions in both RTB models. Table 5 shows that top
Cookie Matching-Only model, it ranks 47 because it is                                      10 A&A domains are only able to observe at most 40–
directly embedded in relatively few publishers, but it                                     59% and 73–83% of impressions when the Disconnect
ascends up to rank seven and one, respectively, once in-                                   and Ghostery blacklists are used, respectively, depend-
direct sharing is accounted for. This drives home the                                      ing on the indirect propagation model.
point that although Google is the most pervasively em-                                         As shown in Figure 14b, blocking the top 10%
bedded advertiser around the web [15, 65], there are                                       of A&A nodes from the Inclusion graph (sorted by
a roughly 52 other A&A companies that also observe                                         weighted PageRank) causes almost as much reduction
greater than 91% of users’ browsing behaviors (in the                                      in observed impressions as Disconnect. Table 5 helps to
RTB Constrained model), due to their participation in                                      orient the top 10% blocking strategy versus Disconnect
major ad exchanges.                                                                        and Ghostery in terms of overall reduction in impression
With Blocking.         Next, we discuss the results when                                   observability and the impact on specific A&A domains.
AdBlock Plus (i.e., the Acceptable Ads whitelist and Ea-                                   In contrast, blocking 30% of the A&A nodes at ran-
syList blacklist) is used to block nodes. AdBlock Plus                                     dom has more impact than AdBlock Plus, but less than
has essentially zero impact on the fraction of impres-                                     Disconnect and Ghostery. Top 10 nodes under the “no
sions observed by A&A domains: the results in Fig-                                         blocking” and “random 30%” (not shown) strategies ob-
ure 14a under the RTB Constrained and RTB Relaxed                                          serve similar impression fractions. Both of these results
models are almost coincident with those for the models                                     agree with the theoretical expectations for small-world
when no blocking is applied at all. The problem is that                                    graphs, i.e., their connectivity is resilient against ran-
the major ad networks and exchanges are all present                                        dom blocking, but not necessarily targeted blocking.
in the Acceptable Ads whitelist, and thus all of their                                         We do not show results for our most restrictive
partners are also able to observe the impressions, even                                    model (i.e., Cookie Matching-Only) in Figure 14, since
if they are (sometimes) prevented from actually show-                                      the majority of A&A companies view almost zero im-
ing ads to the user. Indeed, the top 10 nodes in Table 4                                   pressions. Specifically, 90% of A&A companies view less
You can also read