A Calculus of Tracking: Theory and Practice - Privacy ...

Page created by Guy Montgomery
 
CONTINUE READING
A Calculus of Tracking: Theory and Practice - Privacy ...
Proceedings on Privacy Enhancing Technologies ; 2021 (2):259–281

Giorgio Di Tizio* and Fabio Massacci

A Calculus of Tracking: Theory and Practice
Abstract: Online tracking techniques, the interactions             databases that provide internet snapshots1 and that can
among trackers, and the economic and social impact of              be used to experiment with tracking behavior, the effec-
these procedures in the advertising ecosystem have re-             tiveness of tracking mitigations as well as derive metrics
ceived increasing attention in the last years. This work           for a tracker market share or trackers concentrations
proposes a novel formal model that describes the foun-             [11], at least for what is measurable from the Internet.2
dations on which the visible process of data sharing be-                The major privacy concern of the users is with
haves in terms of the network configurations of the In-            whom and for which purposes their personal informa-
ternet (included CDNs, shared cookies, etc.). From our             tion is shared and not by which technology [13, 14]. In
model, we define relations that can be used to evaluate            other words, users are not worried that an entity col-
the impact of different privacy mitigations and deter-             lects data when interacted with it but that these data
mine if websites should comply with privacy regulations.           are shared with other entities. An interesting question,
We show that the calculus, based on a fragment of intu-            so far unanswered, is how can we provide a third party
itionistic logic, is tractable and constructive: any formal        independent verification of empirical tracking claims. A
derivation in the model corresponds to an actual track-            study might claim that tool A is more effective than
ing practice that can be implemented given the current             tool B at mitigating trackers but there is hardly any
configuration of the Internet. We apply our model on a             way for a third-party to check why and how unless one
dataset obtained from OpenWPM to evaluate the effec-               re-runs the entire study and trace all results. Even the
tiveness of tracking mitigations up to Alexa Top 100.              claim that a tracker burstnet.com can potentially know
                                                                   whoever visited website amazon.com is hard to check
Keywords: online tracking, ad-blocker, formal model
                                                                   unless one re-run the entire study.
DOI 10.2478/popets-2021-0027                                            We answer such question by a formally grounded
Received 2020-08-31; revised 2020-12-15; accepted 2020-12-16.      mechanism: a calculus of tracking for the internet. We
                                                                   associate a formal relation between various way to ex-
                                                                   change data (access and inclusion of web pages, cookie
1 Introduction                                                     syncing, redirects, etc.) as they are measurable from the
                                                                   internet (and available in open datasets such as Web
Online tracking of users for targeted advertising is the           Census) and identify formal rules that capture how in-
reality of today’s Internet, and the extent of such track-         ternet visits can be tracked.
ing has been the subject of an intense research activ-                  We can formally prove that a tracker can poten-
ity [1]. Research studies span from facts finding studies          tially know that a user visited a website. By using a
(e.g. [2, 3]) to technical analysis both for classical tech-       fragment of intuitionistic logic we can extract from the
niques (e.g. based on cookies syncing [4]) and novel tech-         proof the actual web pages configuration that makes
niques (e.g. based on browser fingerprinting [5]). Eco-            such tracking possible. The formal model allows one to
nomic analysis are also not uncommon (e.g. [6]). A num-            determine if a website should comply with privacy laws
ber of mitigation tools also emerged to (partly) block             (e.g. COPPA) or to compare different mitigations and
trackers (e.g. Ghostery, Disconnect, Adblock Plus, etc.)           conclude whether a mitigation is strictly better than
and researchers have also investigated whether they are            another or at least quantitatively better along a Pareto
effective (e.g. [7, 8]) and their side-effects (e.g. [9, 10]).     frontier. For this approach to be a useful link between
     These robust research activities, which we sample             theory and practice, such calculus must be tractable and
in Tab. 1 and Tab. 2, generated a number of open

                                                                   1 E.g. Web Census: http://webtransparency.cs.princeton.edu/
*Corresponding Author: Giorgio Di Tizio: University of             webcensus
Trento, E-mail: giorgio.ditizio@unitn.it                           2 Obviously, data exchange agreements between owners of
Fabio Massacci: University of Trento, Vrije Universiteit Am-       seemingly different trackers do affect conclusions based on the
sterdam, Email: fabio.massacci@unitn.it                            internet. Detecting such agreements is hard given the present
                                                                   information asymmetry [12].
A Calculus of Tracking: Theory and Practice - Privacy ...
A Calculus of Tracking: Theory and Practice      260

the analysis can be performed on the fraction of the In-        in the back-office data sharing. As such we could as-
ternet visited by a user as available from open datasets        sert that certain websites perform fingerprinting but,
(e.g. OpenWPM) and we do so up to Alexa Top 100.                in absence of a publicly known relation between these
    The overview of the key works in this area (§2) sum-        websites on how fingerprints (or other personal infor-
marized in Tab. 1 shows that the majority of the re-            mation) are shared, this would be a meager knowledge.
search focused on large scale analysis and technological        Furthermore, the application of the framework is not
analysis, while no attempt has been done on formally            intended for large scale analysis of the whole Internet,
describing the sharing procedures. We fill this gap with        where formal reasoning hardly scale.
the following contributions:

 – we present the first formal model (predicates and
   rules) that describes the passage of tracking infor-         2 Analysis of Tracking Practices
   mation across websites that can be externally mea-
   surable (§3 and §4);                                         Over the last years, researchers have identified different
 – we prove that inferences are tractable and that one          tracking techniques on the Internet. To make the paper
   can reconstruct the configuration responsible for a          self-contained we present an overview of the techniques
   concrete tracking practice from the proof (§5);              and refer the reader to Tab. 8 in the Appendix A.1 for
 – we formalize some of the interesting tracking rela-          additional information. Tab. 2 reports some research
   tions that can be captured by our system (§6);               questions addressed by the state-of-the-art. Most pa-
 – we extended the model to consider the uncertainty            pers examined the effectiveness of tracker blocking tools,
   of the Internet interactions (§7);                           while others focused on trackers’ pervasiveness and the
 – we discussed scalability issues and challenges (§8)          techniques used to track Internet users. We consid-
   and we instantiated our model in a state of the art          ered works about Online Advertising Ecosystem, pri-
   theorem prover to determine tracking practices and           vacy policies, and formal modeling of the Internet.
   websites that should comply with COPPA (§8.2);
 – we compare the effectiveness of different mitiga-
   tions (Ghostery, Disconnect, Adblock Plus, and               A Summary of User Identification
   Privacy Badger) on the Top 5, Top 10, Top 50 and             HTTP Cookies are IDs associated with a user and are
   Top 100 visited domains (§9).                                set by websites through JavaScript codes or HTTP
                                                                responses. Cookies are automatically attached by the
                                                                browser to all subsequent requests to the websites. The
We conclude by discussing the added value and the lim-          major difference compared to browser fingerprinting is
itations of our approach (§10) and future work (§11).           that the ID is stored locally on the user’s machine [7, 27].
     Goals: we provide a framework that generates a                  Browser Fingerprinting is used by websites to col-
third-party independent verification of tracking prac-          lect information from the browser to build an unique fin-
tices for individual cases, i.e. for single users that browse   gerprint [28]. For example, to personalize the content,
a limited number of websites. Large scale studies over          a website can request device-specific information like
millions of domains are unrealistic and inappropriate to        user-agent, HTTP headers, plugins, fonts, screen
help users in determining the best countermeasure to en-        resolution, OS, canvas and AudioContext [5, 29, 30]
hance their privacy or determine which websites should          via HTTP headers or JavaScript codes [22]. These at-
comply with COPPA. Furthermore, these studies lack              tributes can be used to generate a unique fingerprint for
transparency and do not provide concrete evidence of            tracking purposes. Other approaches exploit O.S. and
the effectiveness of the mitigations analyzed (as well as       hardware properties to generate device fingerprints that
provide in some cases contrasting results). Our frame-          allow cross-browser tracking [31, 32].
work fills the gap by providing an explanation, in the               Other Browser Storage, for example HTML5
form of a formal proof, that can help users to evaluate         localStorage, Flash LSOs, and HTTP headers (e.g.
the effectiveness of different adds-on or provide a proof       ETag) [33, 34], are used by websites to store IDs and
that shows if a website should comply with COPPA.               track users even if HTTP cookies are deleted.
     Non-goals: we do not consider methods that rely                 Other tracking techniques exploit browsing his-
on back-office data exchange for user identification. For       tory [35] and caching process of DNS records [36].
example, collecting browser fingerprinting and using it
A Calculus of Tracking: Theory and Practice - Privacy ...
A Calculus of Tracking: Theory and Practice           261

Table 1. Research Topics Addressed by the State of the Art

Research       Research Topic                              [15][2][6][16][17][3][18][7][19][4][20][21][8][22][23][24][25][26]   Our work
               Analysis of the tracking ecosystem                     X X X X                             X                       X
               Tracking coverage                                X     X       X      X X X         X X                            X
Fact finding   Search context exposure to tracking                    X              X
               Detection of hidden flow among track-                      X             X X        X
               ers
               Detection of privacy regulation violation                                                      X
               Detect duty compliance to privacy regul.                                                                           X
Economic       Cookie syncing incentives                     X                                    X                               X
analysis       Revenue with and without cookies              X X X                                X
               Effectiveness of blocking techniques                          X     X X   X            X                           X
Technical      Development of a detection mechanism                              X     X                  X X
analysis       Classification of Trackers                                        X X     X
               Analysis of Tracking techniques                                       X   X                X
               Formal expression of privacy policy                                                                X X X
Logic
               Formal analysis of tracking                                                                                        X

Data Sharing Across Websites                                         al. [18] represented traffic logs through 1-mode and 2-
Cookie Syncing is an increasingly popular technique [4]              mode graphs to highlight the connected communities of
employed by trackers to share the IDs associated to a                the trackers and proposed an automated tracker detec-
user [17, 19, 30]. A common cookie syncing technique is              tion mechanism based on graphs properties. Similarly,
to pass the IDs as parameters in an HTTP request. This               in our paper we (implicitly) employ graphs to repre-
procedure allows the websites to map different IDs to a              sent the interactions among websites that are exploited
single user and link information from different trackers.            in the tracking ecosystem. However, we also provide a
                                                                     formal description of the trackers’ interactions to prove
                                                                     tracking practices and privacy compliance.
Analysis of the Online Advertising Ecosystem
Ghosh et al. [15] analyzed the leakage of information in
the Real-Time Bidding (RTB) protocol and modeled the                 Formal Models for IT-Security
revenue of advertisers w/ and w/o syncing. Gill et al. [2]           Speicher et al. [38] used a model based on AI planning
employed HTTP traces to model the revenue earned by                  with grounded predicates in the context of the email in-
different trackers, whereas Marotta et al. [6] empirically           frastructure. Simeonovski et al. [39] proposed a model
estimated the value of targeted advertisement depend-                based on property graphs in the context of Internet
ing on the presence or absence of cookies. We aim to                 core services. We instead focus on tracking practices and
address an orthogonal problem, i.e. how can we formally              present a stronger formal foundation compared to the
prove that cookies are employed for tracking users, in-              previous works by proposing a formal Gentzen calculus.
dependently from how trackers utilize them.
     Iqbal et al. [37] developed a graph-based ML clas-
sifier of ads and tracker. The tool builds a graph repre-            Analysis of Children’s Online Privacy Protection Act
sentation of the HTML structure, the network requests,               Compliance
and the JavaScript in the web page to determine track-               Data collection and web tracking are regulated by data
ing practices from specific features. Gomer et al. [16]              protection laws. For example, the General Data Protec-
analyzed how the search context exposes users to track-              tion Regulation (GDPR) [40] is currently in force in the
ing practices using directed network graphs based on re-             EU. Still, it is not uncommon to observe violations [41].
ferral header information. Bashir et al. [3, 17] detected                When it comes to collect personal information
flows of information between advertisers based on re-                from children additional laws are applied. In the
targeted ads. They constructed an inclusion graph to                 U.S. the Children’s Online Privacy Protection Rule
model the advertising ecosystem, analyzed the graph                  (COPPA) [42] imposes requirements for websites that
properties, and simulated the impacts of tracker block-
ing tools like Disconnect and Adblock Plus. Kalavri et
A Calculus of Tracking: Theory and Practice      262

Table 2. Research Questions From the State of the Art                not yet a tool for parents to determine such violations
                                                                     due to the complex interactions among websites.
Papers belong to some research classes: Fact finding (F), Eco-
nomic analysis (E), Technical analysis (T), and Logic (L).                Several papers tried to formally express privacy pol-
                                                                     icy (e.g. [25, 26]). In the context of COPPA, Barth
 Paper Class   Research Questions                                    et al. [24] proposed a framework to formally describe
 [15] E        What induce publishers to perform cookie sync?
                                                                     this policy based on first-order temporal logic. Although
 [2]   F,E     What is the revenue given different information?
 [6]   E       How much cookies influence publishers’ revenues?
                                                                     similar to our work, [24] is a theoretical model that fo-
 [16] F        How does search context impact users’ privacy?        cuses on the description of privacy policies. In contrast,
               What are the tracking ecosystem characteristics?      our model describes tracking behaviors for advertising
 [17]   F      Can retargeted ads detect exchange of info?           and generate proof that websites should comply with
 [3]    F,T    Can we capture interactions among publishers and      COPPA based on data from the Internet. For example,
               Ads company based on web inclusions?
                                                                     parents can determine which websites should comply
               How effective are tracking mitigations?
                                                                     based on their children’s visited websites. The effective
 [18]   F,T    Can we identify trackers via mode graph?
 [7]    F,T    How can we classify web tracking behaviors?           compliance can be manually verified by the parents and
               How effective are tracking mitigations?               the proof can prompt further investigations by the FTC.
 [19]   F,T    How trackers are distributed in the Top 1M?
               How effective are tracking mitigations?
               Can we automatically collect tracking behaviors?
 [4]    F,T    Which are the characteristics of cookie syncing?
               Which is the impact for the privacy of the users?     3 A Formal Model for Tracking
 [20]   T      Are invisible pixels used extensively by trackers?
               Can we classify trackers based on invisible pixels?
                                                                     We define a privacy threat when a website can know
               How effective are mitigations vs invisible pixels?
 [21]   F,E    What is the impact of cookie syncing and RTB?
                                                                     that a user visited other websites as a result of a pro-
               Which is the price value of user’s private data?      cess of data sharing. The major concern for a user is not
 [8]    F,T    How effective are adds-on for desktop and mobile?     that a website knows this is a recurring user, but that
               What are challenges of mobile tracking prot.?         unrelated websites get knowledge of this activity thanks
 [22]   F,T    How to identify fingerprinting using font-probing?    to the exchange of data. We did not model tracking tech-
               How effective are the tools against fingerprinting?
                                                                     niques that are not been observed in the wild (e.g. [47]).
 [23]   F,T    How to detect 3rd-party privacy violations?
 [24]   L      How can we formally reasoning about norms of
               transmission of PII?
 [25]   L      How can we formalize and check privacy rules?         3.1 Predicates
 [26]   L      How can we formalize privacy rules?
                                                                     Tab. 3 contains predicates that capture network in-
                                                                     teractions among websites and users (in our analy-
collect personally identifiable information (PII)3 from              sis we reduced the websites to their PS4 and PS+1
children under 13 y/o. COPPA requires to post a pri-                 of the URL5 ) as well as the type of mitigations con-
vacy policy containing the personal information col-                 sidered. IncludeContent(w, w 0 ) indicates an inclusion
lected, with who and for which goal this data is shared,             of some content of web site w0 in website w. The
to get a verifiable parental consent (for example call-              predicate IncludeContentcookie (w, w 0 ) describes, in addi-
ing a tool-free number), and to allow parents to review              tion, a transmission of cookies collected by w to w0 .
the PII collected and revoke the consent. Determine if a             Redirectcookie (w, w 0 ) (Redirect(w, w 0 )) indicates a HTTP
website should comply with COPPA is not easy. There                  redirection from the website w to the website w0 with
have been several violations in the past, for example                (without) a transfer of cookies collected by w.
by Playdom [43] and Youtube [44], with fines of several                  Block_request(w) indicates that the connection to
millions of dollars. Apart from U.S. FTC reports, some               the website w is blocked, for example by an add-on.
studies developed frameworks to analyze Android apps                 Block_tp_cookie(w) indicates that the website w is not
to determine violations [23, 45, 46]. However, there is              allowed to set cookies. These two mitigations protect
                                                                     users using different techniques. If Block_request(w)

3 E.g. First and last name, home address, SSN, persistent iden-      4 https://publicsuffix.org/
tifiers (e.g. cookies, fingerprinting), etc.                         5 For example, https://s.ytimg.com is reduced to ytimg.com.
A Calculus of Tracking: Theory and Practice          263

Table 3. Ground Truth Network Interactions and Mitigations

The predicates are obtained from ground truth data (G) and are not derived from rules of the model.

IncludeContent(w, w 0 )          G    website w includes 3rd-party content from the website w0 (e.g. within an i-frame tag).
IncludeContentcookie (w, w 0 )   G    website w includes 3rd-party content from the website w0 sending its cookies.
Redirect(w, w 0 )                G    website w redirects visitors to the website w0 . w does not append cookies in the redirection.
Redirectcookie (w, w 0 )         G    website w redirects visitors to the website w0 . w appends its cookies in the URL (or payload).
Visit(w)                         G    intentional access to website w by a user.
Block_request(w)                 G    extension blocks connections directed to the website w based on filter lists (e.g. Disconnect).
Block_tp_cookie(w)               G    3rd-party cookie blocking for website w.

Table 4. Web Tracking Predicates

The predicates are derived (D) from one or more rules of the model.

Link(w, w 0 )                D       websites w and w0 have a possible path to share information w/o exchange of cookies.
Linkcookie (w, w 0 )         D       websites w and w0 have a possible path to share information via cookies from w to w0 .
Access(w, w 0 )              D       website w forces to access a resource in w0 via a redirection or an inclusion.
Accesscookie (w, w 0 )       D       website w forces to access a resource in w0 via a redirection (inclusion) attaching w’s cookies.
Cookie_sync(w, w 0 )         D       website w synchronizes its cookies with the website w0 . The operation is unidirectional.
Knows(w, w 0 )               D       potential ability of website w to track users on a (possibly different) website w0 .
req_COPPA(w)                 D       website w should comply to COPPA.

is evaluated as true (e.g. Disconnect blocks the                     Definition 1 (Internet Snapshot). The symbol N ,
website w), then all requests to w are blocked. If                   possibly with subscripts, denotes finite (possibly empty)
Block_tp_cookie(w) is evaluated as true (e.g. 3rd-party              set of instantiated predicates from Tab. 3 and 4 that
cookie blocking protection is active), it means that the             captures the interactions (inclusions, redirections, etc.)
browser does not allow w to set HTTP cookies, however                among websites observed on the Internet.
it does not block HTTP requests directed to w.
     Tab. 4 summarizes the predicates that describe a                We denote the pure "status" of the internet with N and
possible exchange of users information between web-                  the set of predicates capturing a specific mitigation X
sites. Linkcookie (w, w 0 ) and Link(w, w 0 ) identify a pos-        on the internet snapshot N with NX∗ . For example, N =
sible path, as a result of an inclusion or a redirection,            {IncludeContent(w, w 0 ), Redirect(w 0 , w 00 ), . . .} and NX∗ =
between w and w0 that can be exploited for tracking.                 {Block_request(w 0 ), Block_request(w 00 ), . . .}.
Access(w, w 0 ) and Accesscookie (w, w 0 ) capture a success-             The turnstyle ` separates the assumptions on the
fully redirection or inclusion that forces a user to con-            left from the propositions on the right. The sequence of
tact website w0 from the website w. Cookie_sync(w, w 0 )             formulas on the left of ` are in conjunction. The hori-
indicates cookie syncing between w and w0 to share IDs.              zontal line separates preconditions from postconditions.
     The predicates in Tab. 3 are obtained from the col-                  The capital letters A, B and C, possibly with sub-
lected data as we will see in §8. The predicates in Tab. 4           scripts, denote formulae of the quantifier-free frag-
are inferred from the rules of the model.                            ment of IFOL, whose predicates are drawn from
                                                                     Tab. 3 and 4. The variable w ∈ W, possibly with
                                                                     apices, is a variable over websites. Constants (e.g
3.2 General Derivation Rules                                         facebook.com) denote websites. WL and WR stand
                                                                     for Weakening Left/Right, CL for Contraction Left.
The model includes the classical Gentzen rules for the               We have the following rules for intuitionistic logic:
quantifier-free fragment of intuitionistic first-order logic                         N `A       A, N ` B            N , A, A ` B
(IFOL). We use IFOL since its proofs are constructive                         (Ax)                         (Cut)                   (CL)
                                                                     A`A                    N `B                     N,A ` B
and thus there is a pairing between proofs and attacks.
                                                                       N `B              N `            N `A              N,A `
                                                                                (WL)            (WR)               (¬L)            (¬R)
                                                                     N,A ` B            N `B           N , ¬A `           N ` ¬A
A Calculus of Tracking: Theory and Practice    264

                                                                     cess that brings to track users and no information on
          N , A, B ` C            N `A      N `B
                          (∧L)                      (∧R)             how fingerprints are shared is available. Furthermore, as
      N,A ∧ B ` C                   N `A∧B
                                                                     pointed out by several works ( [48, 49]), browser finger-
N,A ` C      N,B ` C
                         (∨L)
                                 N `A
                                         (∨R1 )
                                                   N `B
                                                            (∨R1 )
                                                                     printing is not as accurate as cookies to identify users.
    N,A ∨ B ` C                 N `A∨B            N `A∨B             Thus, ad transactions carried out without the presence
     N `A          N,B ` C               N,A ` B                     of cookies are not enough to produce targeted advertise-
                                 (→L)                (→R)            ments [6]. We further discuss this extension in §11.
          N,A → B ` C                   N `A→B

In our derivations, we do use neither ∨Ri nor ∨L rules as
we are only interested in deriving knowledge predicates.             Cookie Syncing
    We can also have domain-specific axioms of the form              The rule Sync in Fig. 1c shows the preconditions re-
A → B that can be added to a derivation with the rule:               quired to implement cookie syncing between websites w
      N,A → B ` C         A → B is Domain Axiom                      and w0 . Cookie syncing requires the exchange of cookies
                                                   DomAx
                          N `C                                       to link the IDs used by the two trackers. This tech-
                                                                     nique is also called First to Third-party Cookie Sync-
We represent a domain axiom A1∧...∧An→B as a rule                    ing [20]. Rule PropagateSync shows how to propagate
A1 , . . . , A n
                 and vice-versa as in IFOL, ` and → are in-          cookie syncing through a sequence of websites.
      B
terchangeable. Tracking specific rules in the next section
are indeed domain axiom.
                                                                     Tracking via Cookie Syncing
                                                                     In Fig. 1c, the rule SyncTracking describes how cookie
                                                                     syncing between websites w0 and w00 allows to track
4 Tracking Specific Rules                                            users even in websites where a tracker is not explic-
                                                                     itly present. This is known as Third to Third-party
Information Flow                                                     Cookie Syncing [20]. We did not define a rule to de-
In Fig. 1a the rules IncludeW and RedirectW show how                 scribe cookie forwarding because it is a special case of
a link between two website w0 and w can be created.                  3rdpartyTracking where the tracker passively receives
Rule Redirect (Include) in Fig. 1a illustrates how the               the user’s history collected by a 3rd-party. The cookies
redirection to (inclusion of) another website can be em-             forwarded could be used for back-office exchange that is
ployed by trackers to pass information, for example                  outside of the scope of our model. All the rules assume
cookies. The rule ImpRed (ImpInc) in Fig. 1b shows that              the intention of the websites to track users.
the predicates Redirectcookie (IncludeContentcookie ) im-
plies the predicates Redirect (IncludeContent).
                                                                     COPPA Compliance
                                                                     A website w must comply with COPPA if at least one
Network Interactions                                                 of these conditions hold [42] for w:
Rules AccessToW and AccessTo in Fig. 1a describe ac-
cess to resources with a possible propagation of infor-               1. is directed to children under 13 y/o and collects PII.
mation between two websites w and w0 (w/ or w/o an                    2. is directed to children under 13 y/o and allows an-
exchange of cookies). The rule PropagateAccess shows                     other website w0 to collect PII.
how the access can be propagated through websites.                    3. has a general audience, but it has actual knowledge
                                                                         that it collects PII from children under 13 y/o.
                                                                      4. collects PII from users of a website w0 directed to
Third-Party Tracking                                                     children under 13 y/o.
The rule 3rdpartyTracking in Fig. 1a shows that 3rd-
parties present on a website w can track users. This rule
can be applied recursively to describe complex interac-              However, websites that fall in conditions (1) and (3) and
tions among websites as shown in the Appendix A.2.                   collect only persistent identifiers (e.g. cookies) are not
This rule does not consider the possibility of browser               obliged to comply with COPPA if the persistent identi-
fingerprinting (not blocked by the Block_tp_cookie mit-              fier is used for internal operations only. It is important
igation). Since we are interested in the data sharing pro-           to underline that this exception does not allow behav-
A Calculus of Tracking: Theory and Practice             265

                                                                                                            If a website w includes content from
                                                                                                            (redirects to) site w0 , then there is a link
              IncludeContent(w, w 0 )                         Redirect(w, w 0 )
                                                IncludeW                             RedirectW              between w and w0 that allows an ex-
                     Link(w, w 0 )                               Link(w, w 0 )
                                                                                                            change of information.

          Redirectcookie (w, w 0 )                    IncludeContentcookie (w, w 0 )                        During a redirection (inclusion) it is pos-
                                       Redirect                                             Include         sible to append a cookie of w for w0 .
            Linkcookie (w, w 0 )                              Linkcookie (w, w 0 )

                                                                                                            If a website w includes content from
                         Link(w, w 0 )     ¬Block_request(w 0 )
                                                                        AccessToW                           (redirects to) a website w0 (this case in-
                                       Access(w, w 0 )
                                                                                                            cludes connections exploiting social but-
                                                                                                            tons) that is not blocked by any exten-
                      Linkcookie (w, w 0 )      ¬Block_request(w 0 )                                        sion, then the user is forced to access the
                                     Accesscookie (w, w 0 )
                                                                            AccessTo                        resources of w0 from w.

                                                                                                            If a website w forces to access the re-
                       Access(w, w 0 )     Access(w 0 , w 00 )                                              sources of w0 and w0 forces to access the
                                                                 PropagateAccess                            resources of w00 , then the user that visits
                                   Access(w, w 00 )
                                                                                                            w is forced to access website w00 .

                                                                                                            If a user visits a website w that forces to
                 Visits(w)                                                                                  access a website w0 not blocked by any
                 Access(w, w 0 )       ¬Block_tp_cookie(w 0 )                                               mitigation, then w0 knows that the user
                                                                       3rdpartyTracking
                                   Knows(w 0 , w)                                                           visited w.

(a) Tracking Flow
                                                                                                           Redirectcookie and IncludeContentcookie
            IncludeContentcookie (w, w 0 )                     Redirectcookie (w, w 0 )                    are a particular case of Redirect and
                                                 ImpInc                                   ImpRed
               IncludeContent(w, w 0 )                           Redirect(w, w 0 )                         Include respectively.

(b) Tracking Implications

                                                                                            A website w redirects the user to a website w0 inserting
             Accesscookie (w, w 0 )     ¬Block_tp_cookie(w 0 )
                                                                        Sync                cookies of w in the request. If the connection to w0 is
                           Cookie_sync(w, w 0 )
                                                                                            not blocked by any mitigation and the browser allows
                                                                                            w0 to set its cookies then w0 can receive w’s cookies and
        Cookie_sync(w, w 0 )        Cookie_sync(w 0 , w 00 )                                synchronize them with its cookies. Cookie syncing can
                                                                 PropagateSync              be propagated.
                     Cookie_sync(w, w 00 )

                                                                                            The presence of cookie syncing with w0 allows a web-
           Knows(w 0 , w)      Cookie_sync(w 0 , w 00 )                                     site w00 to track users on the website w even if it is not
                                                              SyncTracking
                         Knows(w 00 , w)                                                    explicitly present.

(c) Tracking With Cookie Sharing
                                                                                                   If a website w tracks users on a children related
              Knows(w, w 0 )       Kids(w 0 )    w 6= w0                                           website w0 , then w0 should comply with COPPA.
                                                           COPPAcomplRelease
                        req_COPPA(w 0 )                                                            This rule covers case (2).

                                                                                                   If a website w tracks users on a children related
              Knows(w, w 0 )       Kids(w 0 )    w 6= w0                                           website w0 , then w should comply with COPPA.
                                                           COPPAcomplCollect
                        req_COPPA(w)                                                               This rule covers case (4).
                                                                                                   If w is a children related website that collects PII
                                                                                                   on an external website w0 then it can perform be-
         Kids(w)     Knows(w, w 0 )       BehavioralAds(w)
                                                                    COPPAcomplBehav                havioral advertising. This rule covers the cases (1)
                          req_COPPA(w)
                                                                                                   and (3).
                                                                                                   It is a special case of COPPAcomplBehav. If w is a
                                                                                                   children related website and performs cookie sync-
                   Kids(w)         Cookie_sync(w 0 , w)                                            ing with w0 (i.e. it receives cookies from w0 ) then
                                                              COPPAcomplCS
                            req_COPPA(w)                                                           it can create profiles for its users for behavioral
                                                                                                   advertising.

(d) COPPA Compliant

Fig. 1. Tracking Derivation Rules
A Calculus of Tracking: Theory and Practice                   266

ioral advertising. Fig. 1d shows the rules that describe           As previously stated, the predicate Knows describes
when a website should comply with COPPA. The pred-            the potential ability of a website to track users. Thus,
icate Kids(w) describes a website directed to children        the obtainment of the predicate req_COPPA is not by
under 13 y/o, req_COPPA(w) identifies a website that          itself a definitive proof of the need for compliance. How-
should comply to COPPA.                                       ever, it provides an explanation that can trigger further
     COPPAcomplRelease and COPPAcomplCollect de-              investigations by the FTC on which data are actually
scribe conditions (2) and (4): if a children related web-     sent (for example, due to a complaint of a parent). Web-
site w0 allows a website w to track its users then both       sites must then prove that the exchange either did not
websites must comply with COPPA. We impose w 6= w0            occur or did not contain children’s information.
to not fall in the conditions (1) and (3) where COPPA is
not mandatory if used for internal activities. It is impor-
tant to underline that in our model the Knows predicate
implies the employment of a persistent identifiers (e.g.
                                                              5 Decidability and Theorem
cookies). The scenario described in COPPAcomplCollect           Proving
is not always straightforward to be observed due to ex-
change of cookies (e.g. cookie syncing) among websites.       To show the decidability of our construction we rely on
     COPPAcomplBehav captures cases (1) and (3). Our          the relation between logic programs and a fragment of
model describes only personal identifiers, thus we need       intuitionistic logic (in particular Harrop formulae [51]).
to determine if a certain website uses this informa-
tion for external operations (e.g. behavioral advertis-       Theorem 1 (PTIME Knows Decidability). It is
ing). BehavioralAds(w) could be instantiated using the        possible to decide whether the internet snapshot N al-
approach presented by Liu et al. [50] and we leave            lows a website w∗ to know about the user’s visit to an-
for future work the integration with our model. Rule          other website w (N ` Knows(w ∗ , w)) in poly time in the
COPPAcomplCS shows a special case of COPPAcomplBehav          size of the snapshot N .
where a children related website w receives cookies from
                                                              Proof. We rely on embedding both snapshot and rules
another website. Cookie syncing is a known technique
                                                              as a Harrop formulae.
utilized for behavioral advertising [4]. It is important to
underline that the opposite case (Cookie_sync(w, w 0 )),            G ::= A | G1 ∧ G2 | H → G |                                            (1)
in which the children related website w sends cookies                             | G1 ∨ G2 | ∀wG | ∃wG             %Not used here
to an external website, is already covered by the rule              H ::= A | G → A | ∀wH | H1 ∧ H2                                        (2)
COPPAcomplCollect since Cookie_sync(w, w 0 ) generates
                                                              where A is a predicate, G is a goal formula and H is an
a Knows(w 0 , w) that triggers the mentioned rule.
                                                              Harrop formula. An internet snapshot N is encoded as
     Do we really need a formal approach? Consider
                                                              a (large) conjunction which is a Harrop formula. Each
Youtube and assume Kids(youtube.com) always holds
                                                              rule from §4 is encoded as a goal formula. For example,
(as sometimes it might be necessary to treat informa-
                                                              3rdpartyTracking can be coded as a Harrop formula:
tion according to COPPA). We might be tempted to
conclude that any website that includes cookies from                                         H::=∀w∀w0 H
youtube.com should be COPPA compliant. This infor-              z                                   }|
                                                                                                     H::=G→A
                                                                                                                                       {
mal reasoning seems to imply that any website import-                    z                            }|
                                                                                         G::=G1 ∧G2 ∧G3
                                                                                                                                       {
ing a social button or a video from Youtube should
                                                                         z            }|                                {
                                                                          G::=A      G::=A                G::=H→A              A

be COPPA compliant. However, by applying the set of                      z }| { z }| { z
                                                                     0                       0
                                                                                                            }|      0
                                                                                                                        {   z }| { 0
                                                                ∀ww Visits(w) ∧ Access(w, w ) ∧ Block_tp_cookie(w ) → ⊥ → Knows(w , w)

rules we previously presented we can instead prove that
                                                                  From Theorem A in [51] the pair of the query and
this is actually incorrect. Suppose Kids(youtube.com),
                                                              the rules LJ from §3.2 are a logic programming lan-
we have an IncludeContent(abc.com, youtube.com) due
                                                              guage. As we have no disjunction on the right of ` for
to the social gadget, and thus by applying rules
                                                              the query of interest, the rules (∨Ri ) responsible for the
IncludeW, AccessToW, and 3rdpartyTracking we have
                                                              PSPACE complexity of intuitionistic logic do not apply.
Knows(youtube.com, abc.com). At this point, none of the
                                                                  Since N is finite, there are at most O(|N |2 ) differ-
COPPA rules can produce req_COPPA(abc.com). In-
                                                              ent constants as we have at most two arguments for each
stead, it is possible to trigger rule COPPAcomplBehav by
                                                              predicate. Hence, the instantiation of all quantified for-
showing that youtube.com is performing behavioral ad-
                                                              mulae embedding the rules generates at most O(|N |6 )
vertising and thus should comply with COPPA.
A Calculus of Tracking: Theory and Practice      267

ground propositional rules (we have at most three vari-            predicates present in the proof and the interpolant.
able per rule), even if no optimization can be done (e.g.          We can then eliminate N1 from N and try to derive
distinguishing between content delivery networks and               N \ N1 ` Knows(w ∗ , w). If we succeed, it means there is
actual websites). Thus, the ground instantiation of the            another way to exchange data, so we extract a new sub-
rules is poly in the size of the snapshot and also the             set N2 and continue the process until for Ni , i = 1 . . . no
query of interest can be decided in poly time.                     derivation is possible. As deciding a single query is de-
                                                                   cidable in polynomial time (See Theorem 1) the process
We do not claim that the calculus of tracking for
                                                                   terminates after a polynomial number of interactions.
arbitrary formulae including knowledge predicates is
                                                                   The union of all sets Ni is the desired set Nω .
tractable. The presence of disjunction on the right would
make decidability jump to PSPACE [52] as one could                 As immediate from the proof above, one could also stop
encode QBF as a decision problem in the formula on                 the search as soon as the first subset of the internet
the right of `. This decision problem could well use               snapshot, N1 , responsible for the tracking is identified.
Knows(w ∗ , w) as predicates but they could be replaced            This is what we do with a theorem prover. There may
with abstract ps and qs and would have no relation with            be more than one proof because a prover can choose to
the complexity of inferring visibility relations on the in-        apply one rule before another one according to a suitable
ternet. As of now, we do not see a practical need to               heuristic that may lead to a faster proof search (see
incorporate disjunction on the right.                              GAPT [57] for additional information). Different proofs
    From Theorem 1 follows that COPPA compliance                   may also come from the existence of different tracking
rules can be also encoded as Hereditary Harrop formulas            possibilities on the Internet. The important thing is that
using the knowledge relations as basic atoms:                      one can be found in poly time (see Th.1).

Corollary 1.1 (PTIME COPPA Compliance). It
is possible to decide whether the internet snapshot N
requires a website w to be COPPA Compliant (N `                    6 Using the Calculus for Tracking
req_COPPA(w)) in poly time in the size of N .                        Relations
Next, we show that from a derivation we can reconstruct            We formally define tracking relations that are of prac-
the connections responsible for the tracking.                      tical interest through our formal model. We illustrate
                                                                   some of these relations in the practical case of Alexa
Theorem 2 (Map Proofs to Configurations).
                                                                   Top 5, 10, 50, and 100 websites later in §9.
Given a derivation of Knows(w ∗ , w) from an inter-
net snapshot N (N ` Knows(w ∗ , w)), one can extract an
essential subset of the configuration Nω ⊆ N such that
                                                                   Flow Propagation
N \Nω 6` Knows(w ∗ , w).
                                                                   Given the sharing of information through redirections,
                                                                   content inclusions, and cookie syncing and given a se-
Proof. This result follows from the existence of uniform
                                                                   quence of visited web sites, we can study how the knowl-
proofs6 for the fragment of interest [53] and the exis-
                                                                   edge about this sequence is distributed on the Internet.
tence of a feasible interpolation for intuitionistic logic
                                                                   This is possible through a graph where we underline
[54, 55]. Given a derivation of N ` Knows(w ∗ , w) one
                                                                   edges with predicate Knows(w 0 , w) to identify the web-
can construct a uniform proof and the existence of the
                                                                   sites that know if a user visited another website. We
interpolant guarantees that we have a set of formu-
                                                                   can map this representation to a Venn diagram where
lae that only includes constants shared from the an-
                                                                   we identify which trackers are directly and indirectly
tecedent (the internet configuration) and the succedent
                                                                   included in the websites visited. We define a mapping
(the knowledge predicate). Hence we can use the proof
                                                                   between the predicate Knows and the set theory:
to reconstruct the tracking steps and data exchanges
responsible for Knows(w ∗ , w) in a subset N1 , as the                KnowsUser(N , w) = {w∗ | N ` Knows(w ∗ , w)}          (3)

                                                                   where KnowsUser(N , w) represent the set of websites
                                                                   w∗ that are able to track a user on the website w in an
6 A finite constructive process applies uniformly to every for-
mula, either producing an intuitionistic proof of the formula or   Internet snapshot N .
demonstrating that no such proof can exist.
A Calculus of Tracking: Theory and Practice     268

Lowest Tracking Coverage                                       accessible sites than Y. The ideal performance would be
Our formal model generates relations between websites          to drop one accessed site per blocked accessed tracker
through Knows predicates for a given N . We compare            (i.e we lost only the tracker itself).
different mitigations to determine which produces the               However, one of the major concerns in terms of pri-
lowest tracking coverage. A mitigation X in an Internet        vacy is not that a high number of trackers knows about
snapshot N (NX∗ ), disables some Knows predicates.             fragments of a user’s activity but that few trackers can
                                                               reconstruct (almost entirely) the activity of a user. For
Definition 2 (Mitigation Subsumption). Let N be                example, Google is present in roughly 80% of the Top
an Internet snapshot and NX∗ ,NY∗ two mitigations. We          1 million domains [19] and thus, has a high tracking
say that the mitigation NX∗ is more effective than NY∗ iff     coverage. We thus propose an additional definition:
∀ pairs (w, w0 ): N , NX∗ ` Knows(w 0 , w) implies N , NY∗ `
Knows(w 0 , w).                                                Definition 4 (Mitigation Subsump. per Tracker).
                                                               Mitigation NX∗ is quantitatively more effective than
Intuitively, any Knows predicate obtained from N ap-           mitigation NY∗ against the tracker w iff the access size
plying the mitigation NX∗ is also obtained from N ap-          of X is larger (or equal) than the access size of Y,
plying NY∗ . We developed a script that automatically            NX∗ A ≥ NY∗ A , and the knowledge size of X pro-
generates TPTP input files for Slakje to obtain a proof        jected to w is smaller than the knowledge size of Y
for Mitigation subsumption. Unfortunately, this defini-        projected to w.
tion can be rarely applied. Indeed, if the mitigations
modify different parts of the graph of Knows predicates,       In other words, the mitigation NX∗ produces a higher
the results cannot be compared. Thus, we propose a             reduction of websites where the tracker w can control a
quantitative analysis.                                         user compared to the mitigation NY∗ while still keeping
    Given an Internet snapshot N , a mitigation NX∗ is         a larger (or equal) number of accessible sites than Y.
better than a mitigation NY∗ if both conditions hold:               Unfortunately, it is hard to fulfill both conditions
                                                               (as we will see in §9.1, Fig. 7): being less tracked means
 – C1: The mitigation NX∗ blocks access to a smaller           losing more access.
   number of websites compared to the mitigation NY∗
 – C2: The trackers obtained with NX∗ are smaller in
   number compared to the trackers obtained with NY∗
                                                               7 Coping With Uncertainty
Given an Internet snapshot N , we define its access size       The model considers the possibility of tracking prac-
and knowledge size respectively as follows                     tices given a static description of the Internet (e.g. from
            X                                                 OpenWPM). However, interactions among websites are
    ||N ||A =         (w, w∗ ) ∈ W 2 | N ` Access(w, w ∗ )
                                                               often non-deterministic [56] and the same applies to the
                w∈W
                X                                             tracking behaviors [7]. For example, 3rd-parties embed-
   ||N ||K =          (w, w∗ ) ∈ W 2 | N ` Knows(w, w ∗ )
                                                               ded in a website can include different 3rd-parties de-
                w∈W
                                                               pending on the results of RTB and thus changing the
Site breakage, used to compare mitigations’ perfor-            set of connections observed. Furthermore, trackers can
mance [37], is directly influenced by the access size. We      have different behaviors on the same site, for example,
can now compare two mitigations NX∗ and NY∗                    a tracker can behave both as analytics (and thus can-
                                                               not track the users over different websites) and as a
Definition 3 (Quantitative Mitig. Subsumption).                3rd-party tracker [7]. To handle this uncertainty, we ex-
Mitigation NX∗ is quantitatively more effective than mit-      tended the model by considering the likelihoods of de-
igation NY∗ iff the access size of X is larger (or equal)      riving predicates over different snapshots:
than the access size of Y, NX∗ A ≥ NY∗ A , and the
knowledge size of X is smaller than the knowledge size          – A snapshot N is a picture of the Internet at some
of Y, NX∗ K < NY∗ K .                                             point in time and it is deterministic by construc-
                                                                  tion (either an inclusion is there or it is not, same
Intuitively the mitigation X reduces the number of                for mitigations). We intuitionistically derive either
trackers that know the user’s visited websites more than          the predicate A, ¬A, or neither (if we do not have
Y does, while still keeping a larger (or equal) number of
A Calculus of Tracking: Theory and Practice      269

   enough data, e.g. OpenWPM failed to detect an in-
   clusion) but never both.
                                                             8 Dataset and Scalability
 – Uncertainty stems from the fact that snapshots
   change over time. So yesterday A held, the day be-        8.1 Dataset for Internet Snapshot
   fore yesterday ¬A held, and three days ago neither
   was derivable. What can we conclude about today           We evaluate our model with the 10k Site ID Detec-
   if we do not want to resample it?                         tion(1) 2016 dataset7 collected using a stateful instance
                                                             of OpenWPM8 . We summarize the tables and columns
                                                             employed to instantiate the predicates of our model in
The overall semantics for N snapshots is:                    Fig. 2. The table site_visits contains the list of the
                                                             Top 10k Alexa visited domains. The table urls contains
            < N1 , . . . , NN > ` A (a, b) ⇐⇒                the set of URLs loaded during the crawling. The table
 |Ni : Ni ` A| = a ∗ N ∧ |Ni : Ni ` ¬A| = (1 − b) ∗ N        http_responses contains the HTTP responses.
                                                                  From the dataset, we extracted the sequence of
To each derivation we associate a minimal and a maxi-        redirections and inclusions necessary to instantiate the
mal likelihood in the same way Ferson et al. [57] derived    predicates. From table http_responses we employed
a probability-box:                                           visit_id (ids for the top 10k websites), url_id (ids for
                                                             URLs of the HTTP responses), response_status (HTTP
 – a: likelihood that A is derivable from the predicates
                                                             Status Codes), location_id (in case of a redirection,
   in the Internet snapshots.
                                                             ids for a new URL to visit. NULL otherwise), and
 – 1 − b: likelihood that ¬A is derivable from the pred-
                                                             time_stamp (timestamps of HTTP responses).
   icates in the Internet snapshots.
                                                                  We provide an example of the mapping into the
                                                             model in the Appendix A.5. Tab. 5 shows the number
In other words, a is the minimum likelihood that A is        of HTTP responses received for the Top Alexa. The re-
gonna be true, while b is the maximum.                       sponses are used to find the sequence of redirections.
     The p-box captures the evidence across all snap-             In addition, Tab. 5 shows the number of predicates
shots so it must potentially include evidence that A         obtained applying our model on the Top 10, 20, 30,
holds for sure (a ∗ N snapshots) and it does not hold        40, and 50 Alexa domains of the dataset without any
for sure ((1 − b) ∗ N snapshots where ¬A was found to        mitigation. After visiting the Top 50 domains the users
hold e.g. when mitigations were detected). The gap be-       contacted 190 different websites with more than 6k con-
tween a and b measures the uncertainty i.e. what we          nections. The number of redirections remains relatively
cannot prove because the law of excluded middle does         small compared to the number of inclusions observed.
not hold and thus b 6= (1 − a). To compute the uncer-             The number of HTTP responses in Tab. 5 for the
tainty a brute-force solution is just to derive the proof    Top 30, 40 and 50 domains are slightly different from the
for each snapshot and then aggregate the results.            values of Link(w, w 0 ) because the crawler failed to collect
     The likelihood provides information on the Internet     some HTTP responses during a sequence of redirections
as an evolving ecosystem. At any given time of course        probably due to network problems. However, we can
in the snapshot true at that moment, the probability         correctly close the sequence even if a response is missing.
collapses to 0 or 1, in the same way that a tossed coin           To compute the sequence of redirections, we first ex-
is always head or tail.                                      tract the sequence of HTTP responses for each visit_id
     To illustrate the extension we considered 17 snap-      considered, then order the responses using the column
shots from January 2016 to October 2017 obtained from        time_stamp (to avoid considering intermediate redirec-
WebCensus (see Section 8) and we computed the like-          tions as the beginning of a new connection) and extract
lihood that the derivations, responsible for the proof       the redirections from the set of HTTP responses. Cookie
Knows(revsci.net, qq.com) in Fig. 15, are obtained from      syncing can be detected by analyzing URLs [20] and
the snapshots. Fig. 15 shows the likelihood derivation       payloads in POST requests. For illustrative purposes,
for a snapshot Ni . In Appendix A.3 we present the ex-
haustive set of rules used in the proof. The maximum
likelihood b for some derivations is 1 because, as we can-
not observe a ¬IncludeContent, we cannot exclude that        7 https://webtransparency.cs.princeton.edu/webcensus/
the interaction happened but we failed to observe it.        8 https://github.com/mozilla/OpenWPM
A Calculus of Tracking: Theory and Practice         270

Tables from OpenWPM used to instanciate the predicates Visits, IncludeContent, and Redirect of our model (continuos lines).
It is also possible to instanciate the predicates IncludeContentcookie and Redirectcookie from the tables http_responses and urls
(dashed lines) but in Section 9.1 we employed empirically validated pairs from [17].

Fig. 2. Mapping of the Predicates to the Dataset.

Table 5. # of Predicates and HTTP Responses for the Top Alexa

Variables vs Top Domains      10      20      30      40     50
HTTP responses               925    1957    2864    3618   4530
IncludeContent(w, w 0 )      824    1803    2681    3391   4272
Redirect(w, w 0 )            101     154     184     229    261
Link(w, w 0 )                925    1957    2865    3620   4533
Linkcookie (w, w 0 )           3       3       3       5      6
Access(w, w 0 )              925    2272    3636    5024   6382
Accesscookie (w, w 0 )         3       3       3       5      6
Cookie_sync(w, w 0 )           3       3       3       7      8

                                                                  Fig. 3. Fragment of the TPTP Input for Slakje to Prove
we use here 200 empirically validated domain pairs per-           Knows(fbcdn.net, facebook.com)
forming cookies syncing from [17].
                                                                  Table 6. Timing for Successful and Failed Proof Attempts

                                                                    Run Slakje to prove Knows(fbcdn.net, facebook.com) and
8.2 Theorem Proving Implementation
                                                                    viceversa (not provable) with different visited domains

We leverage on the GAPT tool [58] to generate proofs                 # visited   TPTP input     Successful proof   Failed proof
for the Knows and the req_COPPA predicates. We use                   domains      [# axioms]         Time [sec]     Time [sec]
                                                                             5            75                 1.4            1.1
the intuitionistic prover Slakje [59] to produce formal
                                                                            10           209                 1.8            1.6
proofs based on the rules in our model. We encode the                       50           867                10.5           19.3
model and the data using the TPTP syntax. The data                        100          2,343            1,469.8         >3,600
used for the axioms is generated from actual data ob-
tained using OpenWPM. We instantiated the Kids(w)
predicate using the Top 50 Alexa In the Kids and Teens            Tab. 6 shows the performance with an Intel i7-8750H @
category. This approach is fully-automated by a script            2.20GHz and 2 GB RAM for the Java VM.
that generates a sequence of axioms from the database,
the model, and the conjecture to prove.
     Fig. 3 shows a fragment of the TPTP input for                8.3 Scalability
the prover, where the model is encoded and the rel-
evant data are inserted as axioms. Fig. 15 and 14 in              Our complexity analysis gives an upper-bound of
the Appendix A.4 show an example of the proof gen-                O(|N |6 ), which is inadequate for the application of the
erated by Slakje for Knows(revsci.net, qq.com) and                approach beyond very compact domains. Indeed our
req_COPPA(flashtalking.com) respectively. We evalu-               goal is not to provide Internet-scale analysis but third-
ated the performance of Slakje by assuming the Top 5,             party verifiable evidence for individual cases where num-
10, 50, and 100 as visited domains to generate a proof            bers are manageable. For example, users rarely visit
for Knows(f acebook.com, f bcdn.net) and the vice versa.
A Calculus of Tracking: Theory and Practice         271

                                                                  9 Analysis of Mitigations
                                                                  9.1 Evaluation of Tracking Relations

                                                                  We evaluated our approach on the dataset previously
                                                                  presented with the filter list of three widely deployed
                                                                  extensions (Ghostery, Disconnect, Adblock Plus). We
                                                                  neither consider the Firefox third-party cookie block-
                                                                  ing feature for unvisited websites10 nor other Fire-
                                                                  fox configurations that were either too restrictive
                                                                  (e.g. block all cookies) or they overlap (e.g. Firefox
                                                                  uses Disconnect blacklist). We used the blacklist of
Determine which websites w0 knows about the visit of face-        Ghostery, Disconnect, and Adblock Plus from Bashir
book.com (Knows(w0 , f acebook.com)) by analyzing only on         et al. [3, 17] (the data was collected in 2016 too). We
the interactions generated by the facebook.com visit misses in-   then compared the effectiveness of some of the miti-
teractions generated by adobe.com (visited by the user) with      gations (Disconnect and Adblock Plus) in 2016 with
facebook.com and thus potential trackers.
                                                                  their 2019 version. We also extended the comparison
Fig. 4. The Problem of Determine Knows(w0 , f acebook.com)        with Privacy Badger and Adblock Plus (enforced with
                                                                  EasyList&EasyP rivacy) in the 2019 scenario.
more than 100/120 websites [35, 60]9 , the cookie du-
ration is typically short [61] and cliques, important for
COPPA, are relatively small [19]. Thus, scalability in            Flow Propagation
this application is not a problem.                                Fig. 5 shows the graph of Access obtained applying our
     As shown in Tab. 6, the time required to generate a          rules on the Top 5 Alexa domains without any mitiga-
proof increases with the number of axioms. This number            tion. While Fig. 6 shows the Venn diagrams obtained
is dependent on the interactions observed by the user on          computing the Knows predicates of the model without
the visited websites. To improve performance we per-              any mitigation and with the Disconnect mitigation.
form DBMS slicing by extracting only the interactions
that are obtained from the user’s visited websites (e.g.
Top5, Top10, etc.) and not the entire Internet interac-           Lowest Tracking Coverage
tions and then perform proof reconstruction. Unsound              Fig. 6a shows the Venn diagrams for the Top 5 domains
search followed by proof reconstruction is a new trend            without any mitigation (NB∗ = ∅) while, Fig. 6b shows
in Automated Reasoning [62]. This is the minimum set              the Venn diagram with Disconnect mitigation (NA   ∗ =

of interactions (and thus axioms) that must be consid-            Disconnect). From Def. 2 we have that NA    ∗ is more

ered to avoid missing possible tracking practices. For                            ∗
                                                                  effective than NB .
example, if we extract only the interactions generated
by visiting a website w and not all the other visited
websites we can miss interactions generated from other            Comparing Different Mitigations
visits that reach w as shown in Fig. 4.                           We compared the effectiveness of the filter list
                                                                  of Ghostery, Disconnect, Adblock Plus (based on
                                                                  EasyList) in 2016 and Disconnect, Adblock Plus
                                                                  (based on EasyList), Adblock Plus (enforced

                                                                  10 This feature is unable to block Google in certain situations.
                                                                  Firefox employs by default the Google search engine and, thus,
                                                                  establishes connections with Google domains if the website is
                                                                  not accessed directly through its domain name (we assume non-
                                                                  tech-savvy users behave in this way). As a result, Google do-
9 Skewed towards tech-savvy users, thus these values are likely   mains (and all its subdomains) are whitelisted by Firefox and
upper bounds.                                                     can bypass the third-party cookies block.
A Calculus of Tracking: Theory and Practice         272

Access predicates obtained without any mitigation in the Top       (a) KnowsUser(N , w) without mitigations. Each circle is a
5 Alexa domains. Several connections are made to different         visited Top 5 Alexa site and includes trackers which can po-
third-party domains. Understanding how many trackers can po-       tentially know about this visit
tentially know about your youtube.com visits is far from trivial
(even ignoring any back-office sharing agreement).

Fig. 5. Access Graph Top 5 Alexa Domains

with EasyList&EasyP rivacy), and Privacy Badger
("trained" on the Top 200 Alexa domains in December
2019) in 2019 based on the Def. 3 presented in §6. We
extracted the requests from the dataset and we recur-
sively apply the filter lists of the different mitigations
to the connections established for the Top 5, 10, 50,
100 Alexa domains. Except for Privacy Badger, that
provides the list of domains it "learned" either to block
                                                                   (b) KnowsUser(N , w) with Disconnect mitigation.
completely (Block_request(w)) or to stop setting the               Disconnect significantly limits potential trackers when visit-
cookies (Block_tp_cookie(w)), for all the other miti-              ing youtube.com (from 9 to 4) and yahoo.com (from 9 to 3)
gations we rely on their blacklist of domains, i.e. only
Block_request(w). The results are normalized with                  Fig. 6. Comparing Tracking Knowledge for Alexa Top 5.

respect to N without any mitigation.
                                                                   ilarly. Adblock Plus is the most permissible mitiga-
     We employed the filter lists from [3, 17] and the
                                                                   tion in 2016. However, Adblock Plus shows a big in-
database previously presented to analyze the effective-
                                                                   crement of efficacy in its 2019 version. For example,
ness of the mitigations in 2016. We then computed the
                                                                   in the Top 100, a 26% reduction of the accessed con-
current effectiveness of the filter list of Adblock Plus
                                                                   tent generates a 66% decrement of trackers. It is worth
(with and without the addition of the EasyP rivacy list),
                                                                   mention that the filter list of Adblock Plus from [3]
Disconnect, and Privacy Badger in 2019 with an up-
                                                                   is also roughly 46 times smaller than the list in 2019
to-date database11 from June 2019. Fig. 7 shows the
                                                                   and that currently there is overlap between EasyList
comparison of the mitigations. The dashed line and the
                                                                   and EasyPrivacy [37]. In contrast, Disconnect does not
dash-dotted line correspond to two different efficiency
                                                                   significantly improve in 2019 with a more restrictive
levels. The first is a 1-for-1 drop: for each connection
                                                                   behavior. We found that the combination of EasyList
that the mitigation blocks, it blocks one tracker, while
                                                                   and EasyPrivacy (EasyList&EasyPrivacy) achieves the
the second represents a 1-for-2 drop: for each connec-
                                                                   highest protection at the cost of the most restric-
tion that the mitigation blocks, it blocks two track-
                                                                   tive approach. Finally, Privacy Badger showed a sim-
ers. Fig. 7a shows that, among the filter lists in 2016,
                                                                   ilar level of protection compared to Disconnect and
Disconnect is the most aggressive mitigation up to
                                                                   EasyList&EasyPrivacy but with a significantly higher
the Top 100 domains, where Ghostery behaves sim-
                                                                   number of connections allowed due to the balancing of
                                                                   blocking connections and cookies.

11 The database contains the same information of the 2016
database with small differences in the structure, for example,
the 2019 version presents a table for the redirections.
You can also read