Email Tracking: a Study on its Prevalence - KU Leuven (ESAT)

Page created by Mitchell Powell
Email Tracking: a Study on its Prevalence - KU Leuven (ESAT)
Email Tracking: a Study on its Prevalence

Shirin Kalantari

                                       Thesis submitted for the degree of
                                       Master of Science in Engineering:
                                       Computer Science, option Secure
                                                   Thesis supervisors:
                                                       Prof. dr. ir. C. Diaz
                                                   Prof. dr. ir. F. Piessens
                                                      Prof. dr. B. Berendt
                                                       Dr. J.T. Mühlberg
                                                               M. Juárez
                                                         T. van Goethem

                   Academic year 2018 – 2019
Email Tracking: a Study on its Prevalence - KU Leuven (ESAT)
c Copyright KU Leuven

Without written permission of the thesis supervisors and the author it is forbidden
to reproduce or adapt in any form or by any means any part of this publication.
Requests for obtaining the right to reproduce or utilize parts of this publication
should be addressed to the Departement Computerwetenschappen, Celestijnenlaan
200A bus 2402, B-3001 Heverlee, +32-16-327700 or by email
A written permission of the thesis supervisors is also required to use the methods,
products, schematics and programs described in this work for industrial or commercial
use, and for submitting this publication in scientific contests.
Email Tracking: a Study on its Prevalence - KU Leuven (ESAT)

Abstract                                                                                                                                 iii
1 Introduction                                                                                                                            1
  1.1 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . . .                                                         2
2 Background and Literature Review                                                                                                        5
  2.1 Introduction . . . . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
  2.2 Email Protocols . . . . . . . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    5
  2.3 HTML email . . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
  2.4 HTTP request . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  2.5 Rendering HTML emails . . . . . .                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
  2.6 Commercial Newsletter Emails . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
  2.7 Conclusion . . . . . . . . . . . . .                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
3 Problem Statement and Methodology                                                                                                      21
  3.1 Introduction . . . . . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
  3.2 Email tracking for senders . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
  3.3 Email tracking for third parties . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
  3.4 Identifying Tracking Images . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
  3.5 Identifying HTTP resources in email .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
  3.6 Data . . . . . . . . . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
  3.7 Conclusion . . . . . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
4 Implementation and Results                                                                                                             33
  4.1 Introduction . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
  4.2 Data . . . . . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
  4.3 Read receipt . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   34
  4.4 Identifying personalized tokens                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   35
  4.5 Identifying tracking Images . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
  4.6 Remote Contents in email . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
5 Discussion                                                                                                                             41
  5.1 Introduction . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
  5.2 Email read receipt . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   41
  5.3 Improving the defence      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   43
  5.4 Future work . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   45
  5.5 Conclusion . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   46


6 Conclusion                                                                        49
A Infrastructures                                                                   53
  A.1 Mail server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   53
Bibliography                                                                        57


Unlike web tracking, email tracking has attracted little academic interest. Email
tracking in its most common form is a result of an unwanted HTTP request. In
this thesis we measure prevalence of different email tracking methods in a corpus of
commercial newsletter email. We discovered that a certain method of email tracking
could provide a persistent user identifier for online trackers. We find that 59% of
senders in our corpus can leak this persistent user identifier. In addition to third
parties, email tracking enables the sender to obtain additional information about
recipients. While senders can be notified about user interactions with their email
through standardized protocols, using email tracking they can receive the same
information in a larger scope and without obtaining an explicit user consent. We
discussed existing countermeasures and their effectiveness in resolving concerns of
email tracking.

Chapter 1


Internet plays a vital role in our life. Each internet service that we use produce data.
A vast amount of data about online interactions of users are being collected. Online
trackers are interested in collecting these information and as previous studies shown
web tracking is common in popular websites [27].
     While web traffic dominant our modern usage of internet, electronic mail is
another inevitable part of our online life. It is estimated that by the end of 2018,
more than 280 billion emails were sent daily [3]. We do not exaggerate if we say that
our email inboxes have traces of almost every single action we take online. Today,
email systems deliver commercial newsletters, news briefings, social media updates,
password recovery links and, of course, daily-life communications. Preserving the
privacy of our emails is paramount because of the vast usages of email in our daily
    With the persistence of trackers on web, it is not unlikely to expect the presence
of trackers in user’s inbox. Previous researches on email tracking had shown different
methods that are used for tracking emails and prevalence of these methods in
practice [26, 79]. Email tracking as discussed in both these works is in its core
the result of HTTP requests that are made upon user interactions with an HTML
email. These HTTP requests leak information related to an already sent email.
If we summarize tracking methods described in these papers the information that
lead to tracking can be generalized in three forms: meta-data, HTTP headers and
personalized URL tokens. Meta-data information are basic root of an HTTP request
and include information such as user’s IP address and timezone. The HTTP headers
leak information such as cookies, referrer address, and user agent. These headers
are sent as a result of miss-configurations or loose privacy settings of email clients.
Finally personalized tokens, are deliberately included by the sender inside the URLs
of an email. Recipient email address as part of a URL is an example of personalized
tokens that was studied in previous works [26, 79].
     The implication of email tracking as discussed in previous works is twofold. The
first issue which is repeated in both these papers is that by using HTTP request the
sender can learn whether, when, and how a user has interacted with an email. Xu et
al. discussed that such methods of email tracking can be used to launch long-term

1. Introduction

surveillance attacks against the recipient [79]. The second privacy pitfall which is
only discussed in the paper by Englehardt et al. is leakage of personalized tokens
to unauthorized parties [26]. Personalized tokens that they studied were based on
recipient email address and were considered to be instances of Personally Identifiable
Information (PII).
    The existing countermeasures are not very effective in resolving these concerns.
The most effective (and yet practical) countermeasure can only prevent meta-data and
HTTP header leakage and is incompetent in preventing tracking using personalized
URL tokens[26].
    In this thesis we further clarify the implications and methods of email tracking.
We are interested to understand the role that email tracking plays in online tracking.
Whether it can provide additional information about a user that cannot be obtained
through well studied web tracking methods. In a smaller scope, we want to understand
the role of email tracking for email senders. And in any case, we are interested to
know whether email tracking methods are transparent for end users and whether
users can have control over them.
    Based on results we get in this thesis we are able to demonstrate:

    • How personalized tokens in email could be used in order to employ a persistent
      method of online tracking. Such methods of tracking to our knowledge, has
      been unknown to this date. In comparison with Englehardt et al. we use a
      broader definition for personalized tokens, where they do not necessarily have
      to be based on recipient email address and are not considered as PII.

    • How HTTP requests in email can inform the sender about user interactions
      with an email in an obscure way. We demonstrate how probable it is for the
      sender to obtain information about both the specific recipient and the specific
      email from an HTTP request.

    • What resources in email generate HTTP requests and whether these resources
      could be replaced by offline alternatives.

    • A novel method for identifying advertisements based on their HTML structure
      and URL query parameters rather than their domain. We used this method to
      identify advertisements that are not detected by existing countermeasures.

    We measure the prevalence of email tracking in a corpus of 343,998 newsletter
emails that were collected for this thesis. The emails came from 1,148 different
senders and was collected from 2018-03-31 to 2018-10-26. Based on the result that
we get in this thesis we proposed guideliness that could be used to improve the state
of defence against email tracking.

1.1     Structure of the report
In Chapter 2 we provide the background information about email protocols and
review the cause of email tracking and the information that it leaks. We see how

1.1. Structure of the report

commercial parties use this information to provide customized email services for
their clients. In Chapter 3 we elaborate on consequences of email tracking. We
propose methods for measuring prevalence of different tracking methods. In Chapter
4 we report and interpret the result of applying proposed methods on our corpus
along with their limitation. In Chapter 5 we assess our result and its importance in
practice. Frinaly we summaries our findings and conclude in Chapter 6.

Chapter 2

Background and Literature

2.1    Introduction

HTML emails and HTTP contents in them are ground for email tracking. In this
chapter, we study HTML emails in more details. We first discuss email protocols to
understand how HTML is encoded in an email. We describe the HTML rendering
process and identify part of HTTP request that could lead to information leakage.
We outline the existing countermeasures and their effectiveness in preventing the
unwanted information leakage. We finalize this chapter by indicating email tracking
services that are currently used in commercial newsletter emails.

2.2    Email Protocols

To send an email, different protocols and software are used. Figure 2.1 is an overview
of key software and protocols that are used in sending and retrieving an email. To
compose an email the sender uses a Mail User Agent (MUA or UA) software. MUAs
can be categorized into two major types: web-mail clients like Gmail web-mail, and
local mail clients like Thunderbird and iOS Mail.
    To send an email the MUA submits the email over Simple Mail Transfer Protocol
(SMTP) to the sender’s mail server (also called Mail Transfer Agent (MTA)). MTA
is in charge of transmitting emails and relay the email to recipient network via
SMTP. Some examples of mailbox providers are Gmail, Yahoo!, and Outlook. Before
routing an email each MTA might perform some security checks like spam filtering,
and maleware detection on the email. When an email reach its destined MTA, the
recipient MUA can retrieve the newly arrived email over email access protocols. In
this section we elaborate more on details of these protocols.

2. Background and Literature Review

    Figure 2.1: An overview of steps and protocols that are used to send emails.

2.2.1    SMTP: The Email Transport Protocol
SMTP is used to transmit email objects. The email object consists of two parts: the
envelope headers and the content [46]. The SMTP envelope headers store bookkeeping
information regarding the delivery and transportation of an email. The envelope
headers are destined to be used by MTAs in order to transfer the email from sender
network to the recipient network. The SMTP content, is itself consisting of two parts:
the header section and the body. The content header section contains information
that is used by the email client, for instance the email subject. The content headers
are column separated key-value terms. The body contains the email message. SMTP
can only carry email messages represented by US-ASCII. Multipurpose Internet Mail
Extensions (MIME) relax this restriction by defining algorithms that can be used to
encode the email message to US-ASCII.

2.2.2    MIME
MIME refines the email object carried by SMTP protocol to allow for more practical
contents. Using MIME the email object can contain textual contents with character
sets other that US-ASCII, non-textual message contents like file attachments and
multi-part message bodies [33]. The email object is organized in MIME parts.
Each MIME part has some headers that provide additional information about
its enclosing contents. The Content-ID is an identifier for a MIME part that
can be used for referencing this MIME part in other part of the email[33]. To
provide the encoding information, each MIME part uses two mail headers: the
Content-Type header indicates the type of content that is being encoded and the
Content-Transfer-Encoding indicates the encoding scheme. A MIME content type
is expressed by a type and a subtype. The MIME type is the general description of
the kind of data carried in the MIME enclosure. The subtype offers a more specific

2.2. Email Protocols

description of the type of enclosed data. [33]. Figure 2.2 is an example of a MIME
part that is encoding an image.

          Figure 2.2: Encoding an image in email message using MIME.

2.2.3    Email Access Protocols
Email access protocols are used to transfer email objects from recipients’ mail servers
to their MUAs. Post Office Protocol (POP) and Internet Message Access Protocol
(IMAP) are standardized protocols that are deployed in most commercial mail servers
and MUAs. Some mailbox providers use their custom protocol for transferring mail
objects to MUAs. This protocols are often used by the native MUAs that the mail
box provider has developed. For example Microsoft previously used DeltaSync and
WebDav as email retrieval protocols.

POP3 is a simple retrieval protocol described by RFC 1939[59]. Using this protocol
the MUA can use certain command as defined in RFC to communicate with the
mail server. The main communications are authentication of the user to the mail
server, message retrieval and message deletion. Once a message is retrieved, the
POP session can be terminated and the MUA can operate offline on the email.

Defined by more that 10 RFCs, IMAP is a relatively complex email retrieval protocol.
Using IMAP each email can have an additional set of flags associated with it. These

2. Background and Literature Review

flags communicate information about user interactions between MUA and mail server.
Table 2.1 is the collection of IMAP flags specified in RFC 3501 [21].

      Flag        Description
      \Seen       Message   has been read
      \Answered   Message   has been answered
      \Flagged    Message   is "flagged" for urgent/special attention
      \Deleted    Message   is "deleted" for later removal
      \Draft      Message   has not completed composition (marked as a draft).
      \Recent     Message   is "recently" arrived in this mailbox.

Table 2.1: IMAP flags as described in RFC 3501 Flags Message Attribute sec-
                                 tion [21].

2.3     HTML email
Although the first motivation for using MIME was to support European characters
in email [61], its introduction also enabled sending emails with richer text formatting
like HTML. Using HTML, email messages are no longer restricted to textual content.
Emails could contain well-designed messages with integrated multimedia contents
that render consistently across different mail clients. HTML emails are claimed to
be sent since 1995 [72]. While the main motivation was to have graphics and styling,
having HTML emails has also highlighted concerns about user privacy since early
date[72, 9]. Today, the scope of these concerns has been reduced, nevertheless HTTP
resources that are included in an email give means to the sender to obtain additional
information about an already sent email. HTTP contents are resources that are
hosted on remote servers and are loaded through HTTP requests. In this section
we identify general HTTP contents, different methods of including them and the
consequence of using them in email, and outline the HTML rendering process by

2.3.1    HTTP Contents:
General HTTP resources that could exist in an HTML page are: Cascading Style
Sheets (CSS), scripts, links, and images. CSS is used to attach style to HTML
documents. Scripts are code that run on the client’s machine when the HTML page
loads or upon certain user interaction [8]. Links are one of the prominent features of
HTML and connect one page to another HTML resource [7]. Images provide richer
contents in a page. Due to their basic role images are the popular HTTP content for
serving tracking purposes.

CSS: Style sheets are used for styling an HTML page. There are three alternatives
to include CSS in an HTML page: inline, internal and external CSS. Inline CSS are

2.3. HTML email

expressed using the style attribute inside an HTML tag. Internal CSS are expressed
inside  tag within the  of an HTML page. External CSS are separated
CSS files that are linked to in the  of HTML page and are expressed with
 tag. These three methods are illustrated in Figure 2.3. Among these three
methods only loading external CSS files result in an HTTP request. However loading
external CSS in email open an attack surface that can be exploited for to change the
contents of an email. The exploit called Ropemaker enables a malicious attacker to
change the content of an email after it is sent, just by changing the content of the
external CSS that is used inside an email [36].

    Figure 2.3: Including CSS in HTML: external, internal, and remote CSS

Scripts JavaScript is very popular in web. In an HTML file, JavaScript code can
be internal, written inside  tag, or they can be external expressed using
 tag. While internal JavaScript code does not leak any HTTP request, having
JavaScript inside email is known to be very dangerous. Already in 1998, the Reaper
vulnerability was found in HTML emails that enabled the sender of an email to
wiretap the email messages when they are forwarded by the recipient to another
email address [78].

Links Links which are expressed using  tag are also a remote content. In web,
links are the most commonly used HTML element [57]. The HTTP request for a link
is made when a user clicks on a link. Since links need an explicit user interaction
they are less hazardous. However when clicking on links, users are shifted to their
browsing context that allows for all traditional web-tracking methods[26].

2. Background and Literature Review

There are different methods for including an image inside an HTML page, for examples
 through ,, tags or CSS background-image property[6].
When expressed using  tag, images inside an HTML email could have three
 different kinds: external, data URI, and Content-ID (CID). External images have a
 remote URL in their src attribute. With data URI, the src attribute include the
‘immediate data’ [54] directly embedded. CID images come as attachments to emails
 and the image src attribute reference to the MIME Content-ID of the attachment[54].
 Figure 2.4 demonstrates how the image in Figure 2.2 can be referenced within an
 tag inside an email.

Figure 2.4: Including a CID image: The image from Figure 2.2 is referenced within
                                an  tag .

     Among these three methods, external images are the preferred method for
including images in email. Although CID and data URI images prevent the HTTP
based email tracking, the security pitfalls of these two methods make external images
the advisable choice. Data URI scheme can be used to launch phishing attacks in
email [47][55]. Having CID images also have its downfalls since it can negatively
affect the delivery of an email. With CID images included in an email message,
the chance of getting blocked by spam filters increases. Most spam filters use the
text of the email message to classify it as spam [64, 16]. To circulate these textual
filters, spammers can format their messages inside the images. This is called an
image spam [16]. Figure 2.5 is an example of an image spam email. For this reason
having a lot of embedded images in an email alerts spam filters that the email might
contain an image spam. Gmail use optical character recognition (OCR) techniques
to extract the text from an image and and run their spam filters on it[11]. Email
service providers advise against CID and data URI images and recommend external
images instead [53, 42, 58].

2.4     HTTP request
HTTP requests are the main sources of tracking in email. The privacy concerns of
HTTP requests are due to meta-data, HTTP headers and the personalized URL
tokens that they convey.

2.4.1   Meta-data and HTTP headers
HTTP requests in email can be generalized in the following form:

                      GET request-URL ∗ (request-header)

2.4. HTTP request

Figure 2.5: Some examples of image spams: Each email is structured in one image.
The textual contents are part of the image. The image is taken from the study by
         Ketari et al. A Study of Image Spam Filtering Techniques[44].

The GET method indicates a retrieval request, request-URL is the address of the
remote resource and request-header are one or more HTTP headers that MUA
uses to include additional information. Some HTTP headers are:

   • User-Agent: A header containing operating system and MUA specifications
     like vendor and version.

   • Cookie: A header containing information previously set by the server.

   • Referer: A header carrying the address of the page from which the HTTP
     request was made.

   • Date: A header indicating the time and date at which the request was made
     (according to client’s machine).

HTTP is an application layer protocol which depends on transport layer protocols
such as TCP/IP. These protocols also contain meta-data, for instance, the IP address,
ports and packet size.

Privacy considerations
HTTP headers and meta-data information can be used to obtain additional infor-
mation about the recipient. At TCP/IP level, the IP address conveys information
about the approximate location and timezone of the recipient [41, 74]. The HTTP
headers that are sent in the request reveal identifying information. The User-Agent

2. Background and Literature Review

header, when combined with other meta-data can contribute to uniquely identifying
the recipient [79]. When using a web-based MUA, Referer and Cookie headers
might be sent along the request that could compromise the privacy of the recipient.
Referer header, as specified in RFC 7231 helps servers to identify the source of
their traffic and allows user agent to generate back links [32]. When using a web
MUA this header might point to the URL of web mail which might contain session
information [32, 26]. Cookie header is originally designed to carry user identification
information. When sent along an HTTP request, the server can associate the request
to previous requests of the same user in web. Figure 2.6, contain an of example
HTTP request for an image made by a web MUA. The request contains Cookie,
User-Agent, and Referer headers.

Figure 2.6: Loading an image through Outlook web client. The request contain
Cookie, User-Agent, and Referer that can be used by the server for identifying the

Blocking remote contents Most email agents can be customized to block HTTP
contents that are in an email. In MUA’s terminology this is referred to as blocking
remote contents. This countermeasure blocks requests that are made by the rendering
engine upon loading a page. What is considered as the set of remote contents varies
between different MUAs. For example Thunderbird’s blocking remote contents disable
automatic loading of external CSS files and images [17]. But in order to disable links
other settings must be changed 1 . Gmail does not have any setting for disabling
links, but it can be customized to block images while having external CSS is strictly
prohibited in Gmail [17]. When users explicitly decide to load HTTP contents either
by clicking on links or by enabling remote contents, this countermeasure is not
effective anymore.

Content proxies This is a countermeasure deployed by Google in Gmail [38].
With a proxy, the request for remote contents uses the proxy’s properties instead of
         The value of network.protocol-handler.external-default must be set to false in the preference file.

2.4. HTTP request

user’s MUA. In this setting meta-data information such as IP address and timezone
are preserved. The HTTP headers are also protected since the proxy does not have
access to user’s web browsing cookies. However content proxies still leak the time of
opening an email since they do not pre-fetch contents. They do not cache the images
or change the caching policy of the response either, so each time a user opens an
email, a new request will potentially be made.

2.4.2   HTTP request: Personalized URL
URLs in email can have identifying tokens embedded in them. Based on these
tokens, the sender can relate the request to one specific recipient. The study by
Englehardt et al.[26] revealed that in their corpus of newsletter emails, 29% of emails
had at least one link with personalized tokens. They had a predefined set of values
for personalized tokens for each recipient and searched links for instances of such
values. This personalized tokens were considered to be PII and were limited to either
the recipient email address or some hashing and encoding schemes applied to the
email address. Formula 2.1 is a demonstration of encoding schemes that was used in
their work. Figure 2.7 shows two examples of URLs with such tracking tokens. In
Figure 2.7 (a), the sender used the recipient email address as the user identifying
token and in Figure 2.7 (b), as the naming of query parameter suggests hash of email
address has been used (in this case MD5).

           P II0   =e
           P II1   = E(e)       e = recipient email address
           P II2   = H(e)       H = { MD5, SHA1, SHA256, SHA384,...}
           P II3   = H(E(e))    E = { URL encoding, Base64, Base32...}
           P II4   = E(H(e))

                        P II ∈ {P II0 , P II1 , P II2 , P II3 , P II4 }                (2.1)

Figure 2.7: Examples of images with user identifying tracking tokens inside emails.
(a) The email address is used as a query parameter. (b) The MD5 digest of the email
                        address is used as a query parameter.

    Since the authors considered these tokens as instances of PII, their focus was on
leakage of these tokens to third parties.

2. Background and Literature Review


Request blocking: In addition to blocking remote contents, another counter-
measure for preventing leakage of PII tokens is applying URL filtering methods.
Englehardt et al. demonstrated that by using ad-blocker and tracking-blocker ex-
tensions the number of third parties receiving PII tokens reduces roughly in half

2.5         Rendering HTML emails
HTTP requests in email can be categorized in two main types: Explicit requests
that are made upon certain user interactions and implicit requests that are made
by the MUA. In terms of information leakage and privacy impact both request are
the same. However, implicit request are hazardous since they are made without user
involvement as soon as the client renders an email. In order to display HTML emails,
modern MUAs take two steps: preprocessing and HTML rendering[67, 13].
    In the preprocessing step, based on MUA policy, some HTML tags are removed
(HTML stripping) and certain elements are overwritten (HTML overwriting). The
HTML stripping removes HTML tags that cause serious attacks in email. For
example  tag is removed because of the Reaper vulnerability, which leads
to wiretapping emails. Data URIs are removed in some cases since they can be
used to launch phishing attacks [47]. Figure 2.8 is a demonstration of such phishing
attack in Gmail in 2017[55]. Using data URI the attacker encodes the HTML code
of a phishing site into the email. When user opens this email, embedded code is
rendered and can be used to steal sensitive information. To prevent this attack
mail clients such as Gmail and Yahoo! (web, mobile (iOS and Android)) strip data
URIs from email[17]. In HTML overwriting step the MUA blocking remote contents
functionality is implemented. The client overwrites HTML properties of contents
that it blocks in a way that the rendering engine would not request them (see Figure
 In the rendering step HTML part is interpreted to visual elements. The MUA’s
choice on how to deploy rendering engine affects the information leakage of HTTP
request. Depending on this choice, MUAs can be categorized to local and web-based
types. A local MUA comes with its own rendering engine while a web-based MUA
uses the web browser engine for rendering HTML emails.
When using web-based MUAs like Gmail2 , to display the email it becomes part of the
Gmail’s web page. If the email itself has an HTML part, this HTML code is inserted
inside the web mail’s HTML code. The browser cannot distinguish that these two
HTML codes are from different sources. As a result the request for contents inside
email includes the HTTP headers like Cookie. Local MUAs use a separate rendering
engine which does not have access to user’s web browsing information. Hence HTTP
request made from local MUA could not include user’s web cookies.

         Gmail as:

2.6. Commercial Newsletter Emails

Figure 2.8: Using data URI, when recipient open the phishing email a fake Gmail
login page would be prompted. Image is taken from a blog post by Mark Maunder

Figure 2.9: HTML overwriting: Blocking remote images in Outlook web, the src
                attribute of a remote image is overwritten.

2.6       Commercial Newsletter Emails
To better understand email tracking we study email tracking services that are
currently provided as a service. Companies that send out newsletter emails often use
an Email Service Providers(ESP) for delivering emails and managing their mailing
list. Mailchimp3 is an example of a well known ESP. Beside offering services like

2. Background and Literature Review

email templates, list management and email delivery, ESPs also track each campaign
and give reports to their customers. The following list contains different aspect of an
email message that major ESPs track and report to their customers [28]:

     1. Email delivery and bounce rates.

     2. Open rate.

     3. Click-through rate.

     4. Opt-outs rates.

     5. Spam complaints.

     6. Meta data information (IP, timezone and devices).

     7. Users who forward email.

Email delivery and bounce rate: Marketers monitor delivery status of each
campaign, and the placement of their emails in users’ inbox [63]. ESPs report cases
where emails fail to reach their destined inboxes and categorize the failure into two
categories: soft bounces, and hard bounces [52, 77]. Hard bounces happen when
there are technical problems that prevent email delivery. The reasons include the
mail server being down, typos in email address or network problems [52, 77]. In such
scenarios the sending mail server will fail to deliver the email and might get a bounce
message with a status code, like those described in RFC 3463 [75] that explains the
problem from the recipient mail server or from a transport systems.

Open rate and click-through rate: We indicate that HTTP read receipt is one
of the implication of email tracking. If the MUA loads remote resources as soon as
the recipient reads a commercial email, the request that can operate as a read receipt
for the sender. We elaborate more on the properties of requests can operate as a
read receipt in Chapter 3. ESPs use requests for remote images as an intermediate
to obtain open and click-through rates. Click-through rate indicates how successful
a campaign is in term of engaging its recipient [28]. ESPs store the links that each
recipient has clicked on to enable senders to infer user’s interest that is useful for
creating personalized offers.

Opt-out rates: Anti-spam regulations and online authorities mandate marketing
emails to provide an opt-out options for recipient [15]. CAN-SPAM Act is and
example of such regulations that is currently in place in US[35]. There are two
opt-out option in marketing emails: unsubscribe link and List-Unsubscribe header.
When using an unsubscribe link, user locates and clicks on a link that is included in
the newsletter email. The second method for opting-out is by using List-Unsubscribe
header as specified in RFC 2369 [15]. Both method serve the same purpose, but
List-Unsubscribe is usually intended to be used by mail client software. Mail client
software use this header to provide an unsubscribe button in their user interface. An

2.6. Commercial Newsletter Emails

example of such button is shown in Figure 2.10. Content providers track opt-out
rates to improve their future emails and have less users unsubscribed from their
mailing list.

Figure 2.10: Gmail unsubscribe button: The image is taken from the official
        Google+ post, announcing the unsubscribe button in 2014 [39].

Spam complaints: MUAs often have a spam report button that user can use to
identify emails as spam. After such report the mailbox provider learns a human
identified spam email that spam filters failed to catch. To globally fight spam, the
mailbox provider sends this information to other ISPs and mailbox providers through
a so-called Complaint Feedback Loop [31]. As stated in RFC 6449: “Senders of bulk,
transactional, social, or other types of email can also use this feedback to adjust their
mailing practices, using Spam Complaints as an indicator of whether the Recipient
wishes to continue receiving email” [31]. So in addition to ISPs, the senders also
might receive this feedback information. In this way they can learn what kind of
information a user identify as spam or junk and improve their content. Sources like
[28] suggest that the ESP receives only aggregated information about the number of
complaints per campaign. About the information that is in this feedback report the
RFC 6449 state:
    “[...] the Recipient’s or reporter’s Email Address and IP address may be cat-
   egorized as private data and removed from the feedback report that is provided
   to the Feedback Consumer. Privacy laws and corporate data classification stan-
   dards should be consulted when determining what information should be considered
    Looking at feedback loop policy of mailbox providers, AOL and Microsoft redact
the user email address from abuse report that they send to ESPs or registered email
senders but they keep the complaint message intact[25, 24]. As we already discussed
in the Introduction chapter, newsletter emails contain tracking tokens and hence in
such cases ESPs can use these tokens to find the email address of the user who made
the spam complaint. Some mailbox providers send the report in full detail without
modifications like removing the email address [80]. Gmail, only send feedback loops
to a limited number of ESPs[37]. In addition, to receive feedback loops from Gmail
the email should contain a special feedback-id header. This header contain a unique
sender identifier and three other optional fields that the sender can user to embed

2. Background and Literature Review

identifiers of their choice4 . Figure 2.11 shows an example of a feedback-id header
that was used in one of the emails in our corpus. When user report an email as spam
in Gmail, they use this header to send the feedback report to the ESP which sending
that email.
                 Figure 2.11: An example feedback-id header in our corpus

    Most users are not aware that when they are hitting report as spam button such
information is being propagated to different parties, potentially with their email
address attached to it.

Meta data ESPs use meta-data information to report approximate location of
each recipient and devices and software they use for reading emails. Marketers use
this information to customize their campaigns according to different platforms. The
time at which each recipient reads the newsletter email is also being monitored. ESPs
provide services to deliver emails according to the individual recipient time-zone
[12, 19].

Forward information Marketers can use the ESP’s services to include a link
inside each email that the user can use to forward an email to friends. The forwarded
email will be sent via the ESP and contains a link that points to the web-hosted
version of the same email. The ESP will not add the recipient email address to the
mailing list but they do reports the users who forward the email through forward to
a friend link [28, 2]. Figure 2.12 shows the scenario of forwarding an email using a
froward to friend link.

Figure 2.12: Following the forward to a friend link for one of the emails in our

         Google example suggestions were campaign and customer identifiers

2.6. Commercial Newsletter Emails

2.6.1    Newsletters Analytic Features
Users register for newsletter emails by filling in a subscription form. Before receiving
newsletter emails, they receive an email with a confirmation link that has to be
clicked on for the subscription to be finalized. By clicking on this link subscribers
give their consent to receive subsequent emails from this sender.

URL parameters

Marketers use different mediums when they are promoting a campaign. They often
want to compare the effectiveness of each medium in attracting users to their promoted
campaign. Take the example of a website that wants to promote a specific product.
This product has a page in their website (
They promote this product in company’s Facebook page, in their weekly newsletter
emails, in banners in their website, and through online advertisements. Some users
will purchase this product and the website want to know which campaign resulted in
this purchase. To collect this kind of information they embed a set of parameters in
the URL that is pointing to their product in different mediums. There is a common
set of query parameters called Urchin Tracking Module or Urchin Traffic Monitor
(UTM) that are used for this purpose.
UTM parameters are used and introduced by Google Analytic [40]. There are five
query parameters that could be added to a URL. Here we explain the three mandatory
parameters. When user land in the promoted web page, the query parameters will
be sent to Google Analytic for reporting purposes [40].

   1. utm_source: This parameter identifies the entity (advertisement) that initiated
      the click.

   2. utm_medium: This parameter identifies the medium.

   3. utm_campaign: This parameter identifies the name of the campaign.

A/B Testing

Marketing emails are subject to A/B testing in which senders build different variations
of a single campaign and compare users interaction with different versions to see
which feature is better. They want to be able to build different variations of a single
campaign and test different settings. The motivation behind doing the A/B testing
is relatively simple. The marketers want to make sure that each email they send
triggers maximum user engagement. Some ESPs like MailChimp provide A/B testing
as a service to their customers [1]. The result of A/B testing on emails is that not all
users receive the same version of an emails. As a result users subscribed to the same
newsletter might get emails with slightly different content, subject line or HTML

2. Background and Literature Review

2.7    Conclusion
With HTML emails being the the root cause of email tracking, in this chapter we
outline the email protocols that enables sending HTML emails. The email tracking
is tightly related to HTTP requests that are made from HTML emails. We explain
the type of information that could be obtained from these requests and their existing
countermeasures. We indicate different HTTP contents that could exist in an HTML
email and their alternatives. We discuss that remote images and links do not have
a straightforward offline alternative. We elaborate on HTML rendering process by
MUAs and discuss how the choice of rendering engine can affect the information
leakage by HTTP requests. To expand our understanding of email tracking methods
that are commonly used, we compile a list of analytic services that ESPs provide to
their customers.

Chapter 3

Problem Statement and

3.1       Introduction
In previous chapters, we recognize email tracking as the result of HTTP requests.
We break down information of an HTTP request to meta-data, HTTP headers and
personalized URL tokens. We discussed the effectiveness of existing countermeasures
in blocking these parts.
    Summarizing the email tracking methods we discuss in the previous chapter, we
can notice two different trends in email tracking: In one hand, commercial senders
are actively monitoring user engagement with their email. For their commercial
interest they are interested in increasing user interaction with their emails and deploy
email tracking to obtain analytic metrics for their emails.
On the other hand, the paper by Englehardt et al. demonstrated how common it is
for third parties to receive PII information about recipients through email tracking.
In this chapter we elaborate on these two privacy concerns of email tracking.

3.2       Email tracking for senders
The newsletter emails are reported to bring highest return of investment for senders.
Sending emails at a time on which users are more likely to open it, , . One particular
company phasee 1 , offer machine generated subject lines that is promised to boost
email open rates. The senders obtain this information by HTTP requests that are
made for remote contents in their emails. HTTP requests for resources that are
included in emails are assumed to provide information about user interactions with
emails. In this setting, HTTP requests for remote contents can work as a read receipt
for the sender.
    However an HTTP request can work as a read receipt only when it contains
information about both the recipient (who) and the read email(which email). While

3. Problem Statement and Methodology

Figure 3.1: Eve sends her email with personalized remote images, however she uses
                         the same image in her emails.

obtaining user identifying information from HTTP request has been discussed in
previous works, methods for obtaining email identifying information from an HTTP
request has not been captured. To better illustrate the importance of email identifiers
for HTTP read receipt we use an example scenario:
    Eve has sent two emails e1 and e2 to Alice and Bob. Eve is interested to
know whether/when Alice and Bob read her emails. For this reason she had
inserted a remote beacon images i in her emails with the URL structure http :
// = (Figure 3.1, a). Eve personalized the user query of this URL
based on the recipient (Figure 3.1 b, c).
    For reading emails Alice and Bob use an email client like Gmail, that loads
external images by default, but uses a content proxy. Alice opens e1 and her email
client make a request for image i. Eve notices this request and want to associate it
to one recipient (Alice or Bob) and one email (e1 or e2 ) (Figure 3.1 d):

     • (who) Eve uses the personalized token in the URL of the request, to find out
       that Alice has made the request (meta-data and HTTP headers cannot be used
       since the request is made by the proxy).

     • (which email) Eve cannot determine which of her emails has been read by
       Alice since Eve has used the same image in both emails. However, if Eve uses
       different images in e1 and e2 (Figure 3.2), she can identify the email based on
       the image that is being requested.

3.2. Email tracking for senders

Figure 3.2: Eve sends her email with personalized remote images, this time she
                     uses different images in her emails.

     By this example we want to clarify that among all HTTP requests, we consider
requests that can potentially identify both the user and the email as a privacy
hazard. Sender can obtain message identifying information from HTTP requests if
they include unique resources (or unique URLs) in their emails. The request for
these unique resources can identify the email that is being opened. User identifying
information can be obtained by meta-data (IP), HTTP headers (Cookie) or URL
tokens (recipient email address as a token).
If we consider the way images are loaded by MUAs we can further relax constrain
of uniqueness of resources. We suggest that even with one unique image per email
the senders can obtain their email identifier. We leverage from the request-all or
block-all policy of email clients in regard of remote images. An email client either
block all the images, or it loads all the images in an email. Under this setting with
only one unique image in an email, the sender can learn whether the email is read
when the email client loads images. In the context of the example we give, when
Alice decide to load images in e1 , there is no way she can prevent the request for
i1 . In order to see how often senders can obtain email identifying information form
HTTP request we take the following steps:

   • For emails in one inbox, get all emails that are sent from the same sender.

   • For emails in this set, extract all external images that use the domain of the

   • If for each email there is at least one image that has not been used in other
     emails, the sender can use the request for external images to obtain information
     about the email being read.

3. Problem Statement and Methodology

3.3         Email tracking for third parties
As we discuss in the previous chapter Englehardt et al. claim that PII URL tokens
that reache third party domains raise privacy concerns [26].

     1. While they claim that leakage of URL tokens to unauthorized parties happen
        when the request reaches a third party domain, we find a specific case, in
        which tokens in the URL reach an unauthorized party through the sender
        domain (first party).

     2. We apply this claim in this thesis more widely by:

           (a) Highlighting privacy concerns of personalized tokens in URLs that are not
               considered as PII.
           (b) Demonstrating that personalized URLs could impose a privacy risk even
               when they reach sender of an email.

3.3.1        Leaking through sender domain
LiveIntent2 is a Supply Side Platform (SSP) that enables publishers to receive revenue
by managing their advertising space inside emails[51]. LiveIntent is among trackers
that received the highest number of PII tokens in the paper by Englehardt et al. [26].
The advertisements are served through a so-called LiveTag. LiveTag is a clickable
remote image. The domain that is used in the URLs of a LiveTag belongs to the
first party (email sender’s domain). To serve advertisement through LiveTag senders
should dedicate a subdomain to LiveIntent through DNS CNAME setting. This
subdomain will be hosted by LiveIntent and will redirect to LiveIntent contents.
Figure 3.3, shows LiveIntent advertisement in email and its LiveTag HTML, which
uses the sender domain.
     For LiveIntent advertisements the distinction of first-party and third-party based
on domain does not work. Although the URL of the advertisement is a subdomain
of the first-party but it is actually serving third-party contents.

3.3.2        Generalizing Methods:
URL tokens that are not PII
In their paper, Englehardt et al. argued that hashing and encoding schemes applied
on an email address do not bring enough privacy protection. They considered these
transformations of an email address still as a PII. Indeed we find a later work
supporting this claim that hashing a PII is not enough for preserving its privacy [23].
However, whether email address itself is a PII is debatable. Some sources considered
email address as a PII [56, 69], while another research has mentioned regulations
that explicitly exclude email addresses from being PII [60].

3.3. Email tracking for third parties

Figure 3.3: LiveIntent advertisement in email coming from and its
   HTML code snippet. The advertisement URLs are using the sender domain.

    We argue that even if we do not consider these tokens as PII, privacy risks of
having personalized URLs remain the same. Having personalized tokens provides
additional information for the third parties that are involved in loading the resource.
    In addition to being a transformation of email address, personalized tokens can
have other forms. We find one marketing platform, Blueshift3 , that uses a randomly
generated string as user identifiers in its marketing emails. Figure 3.4, is an example
of image in an email that is sent through Blueshift. The URL is personalized by
string in Universally Unique Identifier (UUID) format. This personalized URL
token is generated based on a random number. RFC 4122 and X.667 describe
guidelines and recommendations for UUID generation [49] [14]. A UUID can be gen-
erated in three versions: name-based, time-based and pseudo-random (also called
random-number-based). The name-based version uses a globally unambiguous name
like an email address, the time-based uses system clock and the pseudo-random
version uses a cryptographic random number generator to generate a UUID value.
The 13th character of a UUID string (bits 7 to 4 of octet 9) indicates to which
version the string belongs. Table 3 in X.667 [14] maps this value to a UUID version.
In Figure 3.4, the 13th byte is 4 which indicates that this UUID was generated in
pseudo-random format.
Although the token in Figure 3.4 is based on a random number, we claim that the
sender can associate this value to a particular recipient. To support this claim we
verified that the value of uid parameter is different among distinct recipients that
receive the same email.
Tokens in Figure 2.7 in the previous chapter and Figure 3.4 have been used for the
same purpose. Using this token the sender can associate the HTTP request, to a
particular recipient.

3. Problem Statement and Methodology

Figure 3.4: A URL with a UUID formatted string as its user identifier tracking
 token. The highlighted letter indicates that this token is a pseudo-random UUID

  When loading an HTTP request, third parties can receive the URL tokens in the

     • Request for files (like CSS or image) in email goes through a chain of redirects
       to third parties and each redirect includes the previous URL in its Referer
       header [26].

     • When clicking on a link, in addition to third parties included by redirect chain,
       the landing web page could also embed third parties. As we have seen in
       the previous chapter in Section 2.6.1, these trackers could receive the URL
       parameters of a web page.

If these tokens have the properties of an identifier, the third party can use them as
a mean for persistent tracking. We argue that while trackers (third parties) might
not be able to link user identifiers in URL to an email address, they can link these
tokens to the tracking profile of the recipient. Again we illustrate this argument by
an example:
In this setting Eve a powerful (widely spread) online tracker, has compiled a profile
on behalf of Alice through online tracking (Figure 3.5). Alice receive newsletter
emails from As a third party, Eve is present in all webpages of this
site. Alice clicks on a link with personalized URL and she land in one of the webpages
of Eve receive information regarding this visit (by third cookies, or
JavaScript) along with the URL of the page. Based on the tracking profile she has for
Alice, Eve identify this request. Overtime Eve can obtain one additional information
about Alice. The fact that she is identified by user=@|!$e in example.com4 . Now
let’s assume that Alice want to clear her tracking profile. Che change her device, IP
and browsing software, but she still uses the same email address. She again click on
a personalized link from This time Eve cannot use her conventional
online tracking methods to identify Alice. But she uses user=@lice token to retrieve
Alice’s profile (Figure 3.6).

URL tokens that only reach the sender
Even whey the URL tokens only reach the first party they can be used to identify
the user based o for an HTTP read receipt.
   We expect certain properties for personalized URL parameters. Personalized
token with these properties can be used by both the sender and the third parties for
     If Eve has online partnership with, just to know which query parameter they use
for user identification, she can extend her profile from the first visit.

3.4. Identifying Tracking Images

   Figure 3.5: Personalized URL token is added to the online profile of Alice.

Figure 3.6: Identifying Alice based on web tracking methods are not possible,
                 however Eve can use the personalized token.

tracking purposes. Senders use this tokens to obtain user identifying information for
an HTTP read receipt. Third parties that get involved upon loading that request can
extend their tracking profile for that user with the personalize token as a persistent
identifier. The properties that we expect for these tokens are:

   • Within a URL, query parameters can hold personalized tokens.

   • For emails coming from one sender to a specific recipient, personalized tokens
     remain the same in different emails.

   • For two distinct users receiving the same email, the personalized tokens are

3.4    Identifying Tracking Images
The existing countermeasures do not discriminate between the request for tracking
and non-tracking images. Once the user decides to load images, the MUA will fetch
all images in an email. But privacy risks and the functionality of tracking images
and none tracking images are not comparable. Web beacons and advertisements are

3. Problem Statement and Methodology

Figure 3.7: Example of images of few pixels from left to right: 2 × 2, 3 × 3, 4 × 4,
                             5 × 5 and 10 × 10.

two types of tracking images that we focus on. Beacons are images in size of a few
pixels with pure tracking purposes. They are not visible for human eyes and hence
serve no functionality other than tracking purposes. It is reasonable to assume that
when users choose to load images, their intention is to load visible images that are
used as part of the email message. Advertisements are another example of images
that we assume to serve tracking purposes in email. The question we want to answer
here is, whether we could find methods for identifying these two tracking images and
block the request for them.

3.4.1        Identifying Beacons
For identifying beacons we can use their small size property. It is a recommended
best practice to explicitly set the HTML image size attributes [6, 66]. In email this
is more relevant since some MUA block images by default the sender wants to make
sure that the email look proper and the template does not flicker when images are
loaded. One method of specifying image size is through height and width attributes
in the  element.
    With size properties present, it is possible to identify the beacons before loading
an image as any images with a size not recognizable by humans. Figure 3.7 shows
some images in size of few pixels with largest image 10 × 10 pixels. Based on this
images we consider any image less that 100 pixel (with height and width less than
10 pixels) to be a web beacon.

3.4.2        Identifying Advertisements
In Section 3.3.1, we introduce an email advertiser that use a first party domain for
its advertisements. The existing methods of ad-blocking that work based on URL
filtering cannot detect and prevent such images from loading. We propose using the
HTML structure of the advertisement element and its URL structure as a method
for identifying and blocking it. Figure 3.8 is an example of advertisements block.
This is called LiveTag, which is a placeholder for showing advertisements served
by by LiveIntent 5 in email. URLs that are used in LiveTag should contain certain
query parameters. To identify the user they should use email address of the recipient
in e query parameter or the MD5 hash of the recipient in p query parameter. Query
parameter p is also required a required parameter and identifies the sender [50]. If
we generalize the HTML structure of a LiveTag we get to the following properties:
Find all  element that has an  tag as their immediate child in the HTML
structure. Check if URLs in the src and the href attributes of these elements

3.5. Identifying HTTP resources in email

contain the corresponding query parameters of a LiveTag which are p, and either e
or m.

Figure 3.8: A sample LiveTag which is used for serving advertisement in email.
The image is taken from the LiveIntent page Publisher Onboarding and Tag Imple-

3.5     Identifying HTTP resources in email
For identifying personalized URL tokens we narrowed our focus on links and external
images in email. However, there is no previous analysis on different HTTP resources
that could exist in an HTML email. Once we know different HTTP resources and
their prevalence in email, we could narrow our focus to minimizing the privacy
concerns of commonly used resources. To identify HTTP resources in an HTML
email, we need to illustrate which parts of an HTML element can potentially lead
to an HTTP request by MUA. If we look at the structure of an example HTML
element in Figure 3.9, we can see that not every URL will lead to an HTTP request.
In this case, clicking on the  link will result in a request to For all
HTML tags we assume that if they include a URL as part of one of their attribute
values, then they are embedding an HTTP resource. However, there are two tags
which their text will also get processed by the MUA, namely  and .
As we discussed in the previous chapter  tags are usually removed by the
MUA in the preprocessing step. However, many MUAs support  element to
include internal CSS in an HTML page [17]. For this reason we also consider the
URL in the text of a  as an HTTP resource. To summarize the methods we
used for identifying HTTP contents, we traverse the HTML document and for each
   • We searched the values of all its attribute to find a URL.
   • If the tag is , we also search its text for a URL.

3.6     Data
One of properties that we specified for personalized URL tokens is their variability
for distinct users receiving the same email. In our dataset we need to have multiple

You can also read