Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons

Page created by Terrence Baldwin
 
CONTINUE READING
Proceedings on Privacy Enhancing Technologies ; 2021 (3):28–48

Kerem Ayoz, Erman Ayday, and A. Ercument Cicek

Genome Reconstruction Attacks Against
Genomic Data-Sharing Beacons
Abstract: Sharing genome data in a privacy-preserving
way stands as a major bottleneck in front of the sci-
                                                                1 Introduction
entific progress promised by the big data era in ge-            With plummeting sequencing costs, we look forward
nomics. A community-driven protocol named genomic               reaching a capacity of sequencing one billion individu-
data-sharing beacon protocol has been widely adopted            als over the next 15-20 years, resulting in availability of
for sharing genomic data. The system aims to provide            very large genomic datasets [20, 49, 64]. Although such
a secure, easy to implement, and standardized inter-            large datasets are promising a revolution in medicine,
face for data sharing by only allowing yes/no queries on        it has been shown in numerous studies that it is not
the presence of specific alleles in the dataset. However,       straightforward to ensure anonymity of the participants
beacon protocol was recently shown to be vulnerable             in such datasets [19, 36, 42, 63, 71].
against membership inference attacks. In this paper, we              Human genome is the utmost personal identifier and
show that privacy threats against genomic data sharing          sharing genomic data for research while preserving the
beacons are not limited to membership inference. We             privacy of the individuals have been challenging many
identify and analyze a novel vulnerability of genomic           different fields (e.g., medicine, bioinformatics, computer
data-sharing beacons: genome reconstruction. We show            science, law, and ethics) for long, due to possibly dire
that it is possible to successfully reconstruct a substan-      ethical, monetary, and legal consequences. To address
tial part of the genome of a victim when the attacker           this challenge and create frameworks and standards to
knows the victim has been added to the beacon in a re-          enable the responsible, voluntary, and secure sharing of
cent update. In particular, we show how an attacker can         genomic data, the Global Alliance for Genomics and
use the inherent correlations in the genome and cluster-        Health (GA4GH) was formed by the community [1]. The
ing techniques to run such an attack in an efficient and        current genomic data sharing standard of the GA4GH
accurate way. We also show that even if multiple indi-          is called the genomic data-sharing beacons. Beacons are
viduals are added to the beacon during the same update,         the gateways that let users (researchers) and data own-
it is possible to identify the victim’s genome with high        ers exchange information without -in theory- disclosing
confidence using traits that are easily accessible by the       any personal information. A user who wants to apply for
attacker (e.g., eye color or hair type). Moreover, we show      access to a dataset can learn whether individuals with
how a reconstructed genome using a beacon that is not           specific alleles (nucleotides) of interest are present in the
associated with a sensitive phenotype can be used for           beacon through an online interface. That is, a user can
membership inference attacks to beacons with sensitive          submit a query, asking whether a genome exists in the
phenotypes (e.g., HIV+). The outcome of this work will          beacon with a certain nucleotide at a certain position,
guide beacon operators on when and how to update the            and the beacon answers as "yes" or "no". If the dataset
content of the beacon and help them (along with the             does not contain the desired genome, genomic data is
beacon participants) make informed decisions.                   not shared and distributed unnecessarily. In addition,
                                                                researchers do not have to go through the paperwork
Keywords: Privacy, Genome Reconstruction Attack, Ge-
                                                                to obtain a dataset which will not be helpful for their
nomic Data-Sharing Beacons, Genomics
                                                                research. The GA4GH provides a shared beacon inter-
DOI 10.2478/popets-2021-0036                                    face [2] that as of December 2020 provides access to 81
Received 2020-11-30; revised 2021-03-15; accepted 2021-03-16.   beacons and acts as a hub where researchers and data
                                                                owners meet.
                                                                     Beacons are typically associated with a particular
Kerem Ayoz: Bilkent University, E-mail:                         sensitive phenotype (e.g., the SFARI beacon that host
kerem.ayoz@bilkent.edu.tr                                       individuals with autism). Therefore, presence of an in-
Erman Ayday: Case Western Reverse University, E-mail:
exa208@case.edu
A. Ercument Cicek: Bilkent University, Carnegie Mellon
University , E-mail: cicek@cs.bilkent.edu.tr
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons       29

dividual in a particular beacon is considered as privacy-    mulate the genome reconstruction attack accordingly.
sensitive information and the main aim of the beacons        Privacy vulnerabilities due to dynamic changes in a sys-
is to protect this information. An attacker, using the re-   tem has been recently explored in the context of dy-
sponses of a beacon and genomic data of a victim, may        namic model changes in machine learning models [61].
try to infer the membership of the victim in a particular    It has been shown that different model outputs can con-
beacon by running a membership inference attack. Bea-        stitute a new attack surface for an adversary to infer
con framework sets a barrier against membership infer-       information of the dataset used to perform a model up-
ence attacks by allowing only presence/absence queries       date [61]. Here, rather than model updates, we focus
for variants and not tying any response to any spe-          on the changes in the query responses to a dynamic
cific individual. In that sense, beacons are considered to   database.
have stronger privacy measures compared to other sta-             In a genome reconstruction attack, the attacker re-
tistical genomic databases. Despite these barriers, sev-     constructs all or a subset of the genomes in the beacon.
eral works have proven that beacons are not bulletproof      Among the reconstructed genomes, it is not trivial to in-
and they are vulnerable to membership inference at-          fer which one belongs to the victim. Therefore, we also
tacks [59, 65, 73].                                          show how the attacker can identify the victim’s genome
     However, threats against genomic data-sharing bea-      among the set of reconstructed genomes using moder-
cons are not limited to membership inference attacks.        ate auxiliary information about the victim (i.e., a set
In this paper, for the first time, we identify and ana-      of visible physical characteristics of the victim, which is
lyze the vulnerability of genomic data-sharing beacons       public information). Finally, to show one of the conse-
for the “genome reconstruction” attack. We consider a        quences of the identified genome reconstruction attack,
scenario, in which the attacker knows the membership         we show how the attacker can utilize the outcome of
of a victim to a beacon that may not be associated with      this attack to initiate a membership inference attack
a sensitive phenotype. Therefore, we consider a targeted     against the same victim in another beacon, which can
attack, in which either (i) the attacker knows that the      be associated with a sensitive phenotype. To do this,
victim donated their genome to take part in a study or       we combine the identified genome reconstruction attack
(ii) infer the membership of the victim from beacon’s        with the membership inference attacks against beacons
metadata (as done in [65]). Then, we show how the at-        from the literature.
tacker can accurately infer the genome of the victim by           We implement and evaluate the identified vulner-
using the beacon responses. Such an attack may result        ability using real genome data obtained from Open-
in serious consequences if the attacker uses the recon-      SNP [32] and HapMap [21] datasets. We particularly
structed genome to infer sensitive information (e.g., dis-   evaluate the success of the attacker to reconstruct a
ease diagnosis) about the victim or to infer the victim’s    victim’s point mutations that include at least one rare
membership to another statistical genomic database of        nucleotide (i.e., minor allele) since minor alleles (i) re-
interest (e.g., another beacon that is associated with a     veal sensitive attributes of individuals (e.g., predisposi-
sensitive phenotype). In particular, we show how the at-     tions to privacy-sensitive diseases); and (ii) provide rich
tacker can use the inherent correlations in the genome       information to the attacker for membership inference
to run such an attack in an efficient and accurate way       attacks [59, 73]. We show that for a beacon with 50
compared to a baseline approach. We also show how            individuals, precision and recall of the reconstruction
clustering techniques can be used to further improve         reach up to 0.9 (each) when 3 individuals are added
the accuracy of such an attack.                              to the beacon and the victim is one of the newcomers.
     Previous works in the literature assume beacons are     Even when 10 new participants are added to the bea-
static and do not change over time. However, beacons         con (causing a 20% increase in beacon size), we show
are dynamic datasets (donors join and leave) and this re-    that the attacker has a precision of 0.7 and a recall of
sults in an increased risk for the genome reconstruction     0.8. Furthermore, our results show that when more than
attack. An attacker can monitor the number of newly          one individual is added to the beacon, the attacker can
added donors to the beacon and the number of donors          accurately pinpoint the victim’s reconstructed genome
leaving the beacon from the meta-information of the          by using moderate (and publicly available) auxiliary in-
beacon. With this information, newly joined donors (or       formation about the victim. For this, we show how the
donor leaving the beacon) become more vulnerable for         attacker can match the victim’s phenotypical charac-
genome reconstruction attacks. Thus, for the first time,     teristics to the reconstructed genomes using machine
we consider the beacons as dynamic databases and for-        learning algorithms. We also show via experiments that
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons      30

the outcome of the genome reconstruction attack can           52, 54, 58, 74, 78]. To mitigate such attribute inference
be accurately used for the membership inference attack        attacks, cryptographic solutions has been proposed for
on another beacon and it helps an attacker infer the          privacy-preserving processing and sharing of genomic
membership of a victim only with a few queries.               data (e.g., to outsource the computation to a public
    Overall, we identify an important vulnerability and       cloud or to conduct collaborative association studies).
show how it can be exploited. We notably show how             Existing cryptographic solutions mainly focus on (i) pri-
dependencies between point mutations can be used in a         vate pattern-matching and the comparison of genomic
clustering algorithm to have high accuracy in a genome        sequences [15, 24, 43, 55, 69] and (ii) privacy-preserving
reconstruction attack. Furthermore, our methodology           personalized medicine [12, 13]. In this work, we identify
consists of a complete pipeline, showing how an attacker      and analyze a different type of attribute inference attack
use the information it infers in the genome reconstruc-       particularly against genomic data-sharing beacons.
tion attack in a subsequent membership inference at-
tack. Therefore, this study clearly shows that privacy        2.2 Privacy in Genomic Data Sharing
risks for genomic data-sharing beacons are much severe            Beacons
than perceived. This is particularly important since the
                                                              Researchers showed that presence (membership) of an
number of beacon participants, and hence the privacy
                                                              individual in a genome sharing beacon can be inferred
risk of individuals increase rapidly.
                                                              by repeatedly querying the beacon. Here, the attacker
                                                              is assumed to be an active (or authorized) user of the
2 Related Work                                                beacon, in practice, it can ask as many queries as it
Genomic privacy has recently been explored by many            wishes to the beacon (there is no limitations and cost for
studies [11, 27, 56]. In the following subsections, we will   this in the current beacon protocol), and it can decide
summarize existing work on privacy in statistical ge-         which queries to ask to the beacon. Furthermore, the
nomic databases, inference attacks, and privacy of ge-        attacker is assumed to have access to the set of SNPs of
nomic data-sharing beacons.                                   the victim. Shringarpure and Bustamante introduced a
                                                              likelihood-ratio test (LRT) that can predict whether an
2.1 Privacy in Statistical Genomic                            individual is in the beacon by querying the beacon for
    Databases and Inference Attacks on                        multiple SNPs of a victim [65]. Note that inferring the
    Genomic Privacy                                           membership of an individual in a beacon that is associ-
                                                              ated with a sensitive phenotype is equivalent to uncov-
Several works have shown that anonymization does not
                                                              ering the sensitive phenotype about the victim. Then,
effectively protect the privacy of genomic data [30, 33,
                                                              Raisaro et al. showed that if the attacker first queries
35, 45, 50, 53, 66]. It has been shown that the identity
                                                              the SNPs with low minor allele frequency (MAF) val-
of a participant of a genomic study can be revealed by
                                                              ues, it needs fewer queries for a successful attack [59].
using a second sample (e.g., part of the DNA informa-
                                                              In Section 6.5, we use this attack when we show how
tion from the individual) and the results of the clinical
                                                              the proposed genome reconstruction attack can be com-
study [19, 37, 41, 75, 77]. Differential privacy (DP) [26]
                                                              bined with the membership inference attack. We pro-
concept has been frequently used to mitigate member-
                                                              vide further background information about this attack
ship inference attacks when releasing summary statis-
                                                              in Appendix A. Later, von Thenen et al. showed that
tics from genomic databases [28, 44, 68, 76]. Compared
                                                              even if the attacker does not have victim’s low-MAF
to statistical databases, genomic data-sharing beacons
                                                              SNPs, it is still possible to infer membership by exploit-
have stronger privacy measures since they only allow
                                                              ing the correlations in the genome [73]. Furthermore,
presence/absence (or yes/no) queries for variants.
                                                              they showed that beacon responses can also be inferred
     Humbert et al. proposed an inference attack on kin
                                                              using such correlations (via a query inference, or QI-
genomic privacy using the family ties between individ-
                                                              attack). In an orthogonal work, Hagestedt et al. have hy-
uals, pairwise correlations between the SNPs, and pub-
                                                              pothesized that while current beacons systems are lim-
licly available statistics about DNA [38]. Then, Dezn-
                                                              ited to genomic data, in the near future, the community
abi et al. demonstrated that stronger inference tech-
                                                              is going to need a similar system for other biomedical
niques can be generated by combining high-order corre-
                                                              data types. They proposed a beacon system for shar-
lations and family ties [25]. Furthermore, several stud-
                                                              ing DNA methylation data (an epigenetic mechanism
ies have examined phenotype prediction from genomic
                                                              to regulate transcriptional activity) and then showed
data, as a means of tracing identity [10, 18, 39, 46, 51,
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons       31

that it is possible to successfully launch a membership      they may carry sensitive information regarding individ-
inference attack against this system. They proposed a        uals’ health conditions. As discussed in Section 2, most
DP-based solution in their proposed MBeacon [34] sys-        existing works in genomic privacy literature focus on the
tem. The approach retains utility by adjusting the noise     protection of the SNPs to prevent the risk of genetic dis-
level for high risk methylation regions that might leak      crimination.
phenotypic information (i.e., regions which are related
to disease).                                                 4 System Model
Contribution of this paper. In this paper, we iden-
                                                             As shown in Figure 1, we consider a system between
tify and analyze a genome reconstruction attack against
                                                             the beacon participants (e.g., donors), the beacon, and
genomic data-sharing beacons by particularly exploit-
                                                             the beacon users (which may include the attacker). The
ing the information leaked due to beacon updates and
                                                             donor shares their genome with the beacon. It is pos-
the correlations between the point mutations. So far, all
                                                             sible that the donor may share their genome with mul-
works in the literature have focused on membership in-
                                                             tiple beacons that may or may not be associated with
ference attacks against genomic data-sharing beacons.
                                                             sensitive traits. Genome donor is not active during the
To the best of our knowledge, this is the first work that
                                                             protocol after they share their data with the beacon.
identifies, thoroughly analyzes, and shows the conse-
                                                             Also, beacon never publicly shares its dataset, but some
quences of the genome reconstruction attack against the
                                                             beacons may share metadata about (i) their content
beacons. Furthermore, as opposed to existing work (that
                                                             (e.g., size) or (ii) their donors (e.g., their gender, age,
only consider a snapshot of the beacon), we show the
                                                             or ethnicity). In general, we consider the beacon as a
privacy risk in dynamic beacons, in which new donors
                                                             dynamic dataset, in which new donors may join and ex-
may join or existing donors may leave.
                                                             isting donors may leave over time. Beacon users issue
                                                             queries to the beacon. As discussed, the beacon user
3 Genomics Background                                        can only ask the presence of a genome with a particu-
Approximately 99.9% of the all individuals’ DNA are          lar allele (nucleotide) at a particular position of a given
identical and the remaining 0.1% is responsible for our      chromosome and the beacon only responds as “yes” or
differences. Single nucleotide polymorphism (SNP) is         “no”. In this work, we assume beacon honestly reports
the most common source of variation in the human             the result of each query to the user (e.g., without in-
genome. SNP is a point mutation (e.g., substitution of       troducing intentional noise to the query results) and we
a single nucleotide in the genome - A,T,C, or G) and         do not consider a query limit for the users, as it is usu-
there are around 50 million known SNPs in the hu-            ally trivial to overcome such limits (e.g., by registering
man genome [3]. The alternative nucleotides for each         several times with different accounts).
locus (SNP position) are called alleles and each allele
of a SNP can be either the major or the minor allele         5 Threat Model
for that SNP. The major allele is the most frequently
                                                             Depending on the attacker’s objective, two attacks that
observed nucleotide for a SNP position and the minor
                                                             can be launched against genomic data-sharing beacons
allele is the rare nucleotide (i.e., the second most com-
                                                             are: (i) membership inference attack and (ii) genome
mon). The frequency (or probability) of observing the
                                                             reconstruction attack. In both attacks (including this
minor allele at a SNP position is called the minor allele
                                                             work), the attacker is assumed to be a registered bea-
frequency (MAF) of that SNP. Human genome has two
                                                             con user who can send unlimited number of queries to
copies for each locus (one per chromosome) and a SNP
                                                             the beacon. In this work, for the first time, we iden-
can be represented in terms of the number of its minor
                                                             tify and study the genome reconstruction attack. We
alleles (i.e., 0 for homozygous major, 1 for heterozygous,
                                                             assume that the attacker knows the membership of an
or 2 for homozygous minor).
                                                             individual to a beacon. Thus, we consider a targeted
     Particular SNPs in human population are inherently
                                                             attack, in which the attacker knows that the victim do-
correlated and this correlation model may change for
                                                             nated their genome (to take part in a study). Given
different populations. Linkage disequilibrium (LD) is
                                                             the current rise in personal genomics (people upload-
the non-random association of alleles at two or more
                                                             ing their genomes to public sites), this is feasible. Also,
loci. If two SNPs are in LD, they are correlated and co-
                                                             beacons with no sensitive-phenotype report metadata
occur more frequently than expected. Some SNPs are
                                                             about their donors. For instance, Shringapure and Bus-
pathogenic and cause genetic diseases [6] and hence,
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons                                   32

                        2. Including the victim, m                                                                            6. Attacker may initiate a membership
                           donors join the beacon                                                                         inference attack against the victim on another
                                                                                                                               beacon (with a sensitive phenotype)
                                              n                                          n+m
                                        participants                                 participants
                                                                                                                                         Membership
                                                                                                                                                                   HIV+
                                                                                                                                          Inference
                                                                                                                                         no, no, yes, …

                                 Beacon’s                                      Beacon’s
                                 metadata                                      metadata
                                                                                                                 Time
                                             t                                         t+δ

                                                                                              yes, yes, yes, …
                              1. Attacker                              3. Attacker

                                                 yes, no, yes, …
                                takes the                                takes the                                                                           5. Attacker
                           snapshot and                              snapshot and                                 4. Attacker                                 identifies
                            metadata of                            metadata of the                               reconstructs                              the genome of
                             the beacon                              beacon again                                 m’ partial                              the victim using
                                                                                                                   genomes                                     auxiliary
                                                                                                                                                            information

                                        Beacon users                              Beacon users
                                  (including the attacker)                  (including the attacker)

Fig. 1. Proposed system model.

tamante [65] verified a specific person being in PGP                                                             around 79$ per month [70] and there is no other eco-
and Kaviar [31] beacons via metadata, and hence the                                                              nomic cost, as the system is publicly available at [2].
attacker can also identify the membership of the vic-                                                            Even though the number of SNPs in a complete snap-
tim using such metadata. Using the membership infor-                                                             shot is large, typically, only low-MAF SNPs are useful
mation, the goal of the attacker is to reconstruct the                                                           for the attacker (as they are typically the sensitive ones);
victim’s genome by issuing queries to the corresponding                                                          (iii) auxiliary information about the victim to identify
beacon.                                                                                                          victim’s genome among the reconstructed ones. For this
    Genome inference attack can be considered both for                                                           we assume the attacker has moderate information, such
static and dynamic beacons. In static beacons, knowing                                                           as a set of victim’s visible characteristics (phenotype);
that the victim is a member of the beacon, only the “no”                                                         and (iv) publicly available information about genomics,
responses would provide certain information about the                                                            such as minor allele frequencies (MAF values) of SNPs
victim’s genome to the attacker. “Yes” responses may                                                             and correlation between the SNPs in the population of
be due to any other participant of the beacon and as                                                             interest. Finally, we assume that the attacker does not
the size of the beacon increases, “yes” responses do not                                                         collude with the beacon.
provide much information to the attacker. However, in                                                                 In genome reconstruction attack, due to the nature
dynamic beacons, when the beacon is updated, using                                                               of beacon responses, the attacker can infer if a victim
the change in the responses of the beacon, the attacker                                                          has at least one minor allele at every SNP position. This
can learn more about the genomes of new participants.                                                            is because the response of the beacon only tells if there is
Thus, in this paper, we analyze this vulnerability for dy-                                                       an individual in the beacon with at least one minor allele
namic beacons and we assume that the victim is added                                                             at a given SNP position. Thus, for each SNP j of victim
between times t and t+δ along with other (m−1) newly                                                             v (Sjv ), the goal of the attacker is to infer P r(Sjv = 0)
added donors to the beacon. As discussed before, the at-                                                         and P r(Sjv 6= 0) (i.e., P r(Sjv = 1) or P r(Sjv = 2)). For
tacker can monitor the number of newly added donors                                                              simplicity, we define the event Ŝjv = 1Sjv =1∨Sjv =2 . Thus,
to the beacon and the number of donors leaving the                                                               Ŝjv = 0 if Sjv = 0, and Ŝjv = 1, otherwise. Note that
beacon from the metadata of the beacon.                                                                          inferring this information for a victim results in a serious
    We assume that, along with the fact that the victim                                                          privacy concern. As we will discuss and show later, using
is among the newly joined participants to the beacon,                                                            this information, an attacker can associate the genotype
the attacker also knows (i) the number of other newly                                                            of the victim to related phenotypes (e.g., diseases) and
joined individuals that are added to the beacon along                                                            initiate a membership inference attack for the victim
with the victim; (ii) a snapshot of the beacon before                                                            by targeting another beacon that is associated with a
the victim is added (at time t). That is, responses to all                                                       sensitive phenotype (e.g., cancer or HIV+).
queries before the victim joins to the beacon. The bea-                                                               Our methodology consists of a complete pipeline,
con protocol does not bar someone from taking a com-                                                             showing how an attacker uses the information it infers
plete snapshot. Thus, querying a beacon to take a com-                                                           in the genome reconstruction attack in a subsequent
plete snapshot only requires a high-bandwidth internet                                                           membership inference attack. Therefore, we evaluate the
connection. Economic cost of such an internet service is                                                         success of the attacker using different metrics in differ-
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons       33

ent parts of the pipeline as follows. For genome recon-           We consider a scenario, in which the attacker has
struction (in Section 6.3), we use precision and recall to   no information about the victim’s genome, but it knows
quantify this inference power of the attacker. As we will    that the victim is added to the beacon between times t
show in Section 7, the success of genome reconstruction      and t+δ. Let n and (n+m) represent the number of indi-
mainly depends on the size of the beacon, the number of      viduals in the beacon at times t and t + δ, respectively.
newly added donors to the beacon between times t and         As discussed, for most real-life beacons, the attacker
t + δ, and the fraction of attacker’s snapshot at time       knows m (by monitoring the changes in beacon using
t. In real life, sizes of beacons show a large variation.    the metadata of the beacon). In all attack scenarios, we
The size of a beacon can be as small as 100, such as         assume that the attacker reconstructs m0 genomes (m0
NBDC Human Database [4] or as large as 100K, such            can be different than m and the selection of m0 effects
as The Genome Aggregation Database (gnomAD) [5].             the precision and recall of the attacker). Our goal is to
As discussed, these numbers can be monitored from the        evaluate the performance for different m0 values to show
metadata of such beacons. Thus, as we will we show,          the attack is robust even if the attacker does not know
for small-size beacons, even if the size of the beacon is    how many people are added. When metadata of the bea-
significantly increased (compared to its original size),     con, and hence m is not available, the attacker can de-
the attacker’s success may be high. For large-size bea-      termine a potential upper bound (k) for the number of
cons, on the other hand, the number of newly added           newly added donors (m) by examining the number of
donors should be a small fraction of the original size       flipped responses (from “no” to “yes”). Then, for each
for a successful attack. As a result of the genome re-       i from 1 to k, it can reconstruct genomes using RN →Y
construction, the attacker potentially reconstructs mul-     assuming m = i, and hence instead of m, the attacker
tiple genomes and among these, one belongs to the vic-       ends up having k(k+1)2    potential genomes to identify the
tim. For this part, we show how the attacker can uti-        victim’s best matching reconstructed genome.
lize machine learning techniques to identify the victim’s         Using its auxiliary information (as discussed in
genome among the reconstructed ones (in Section 6.4)         Section 5), the attacker can probabilistically infer the
and we use the classification accuracy of the attacker       genome of the victim by utilizing the changes in bea-
as its success metric. Finally, to quantify the success of   con’s responses (at times t and t + δ) as follows: (i)
the membership inference (Section 6.5), we use a power       if the previous response (at time t) was “no” and the
analysis as the success metric. To evaluate the success      current response (at time t + δ) is “yes”, the probabil-
of the attacker in the membership inference attack, we       ity that the victim having a minor allele at the cor-
first let the attacker run the genome reconstruction at-     responding query position increases depending on how
tack and then use the proposed machine learning tech-        many new individuals are added to the beacon in this
nique to identify the victim’s genome among the recon-       time interval; (ii) if the previous response was “yes” and
structed ones. Thus, the success metric for the mem-         the current response is also “yes”, attacker cannot infer
bership inference considers the attacker’s success in the    much about the victim’s genome, especially if the total
entire pipeline.                                             size of the beacon is large; and (iii) if both the previous
                                                             and the current responses are “no”, the attacker under-
6 Genome Reconstruction Attack                               stands that the victim does not have a minor allele at
                                                             the corresponding query position.
  on Genomic Data-Sharing                                         Here, the most important (or the most sensitive) in-
  Beacons                                                    formation for the attacker can be considered as the “no”
                                                             responses at time t that turn to “yes” at time t + δ. Be-
As discussed, we define the genome reconstruction at-
                                                             cause, such responses let the attacker infer the positions
tack as inferring genomic data of a genome donor (i.e.,
                                                             that the victim has at least one minor allele with a high
victim) given their membership information to the bea-
                                                             probability (depending on how many new individuals
con. To show the effect of genome reconstruction attack
                                                             are added to the beacon in this time interval). Since mi-
more clearly, we consider dynamic beacons and we as-
                                                             nor alleles of individuals are typically the indicators for
sume the victim is among the newly joined donors to
                                                             privacy-sensitive information about them, in this work,
the beacon. For clarity of the discussion, we present the
                                                             we focus on the success of the attacker based on its
identified attack only considering newly joined donors.
                                                             success in inferring the minor alleles of a victim using
Considering the donors that leave the beacon is sym-
                                                             the beacon responses that turn to “yes”. Exhaustively
metrical and trivial. We discuss this case in Section 8.2.
                                                             generating all potential solutions of this problem would
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons            34

                          0
result in a total of 2β∗m genomes, where β is the total        6.2 Greedy Algorithm for Genome
number of responses that turn to “yes” at time t + δ               Reconstruction
(which can be on the order of tens of thousands), and
                                                               The above-mentioned baseline algorithm assumes every
hence it is intractable. In the following, we first describe
                                                               SNP is independent and the correlations among them
a baseline method that provides a tractable solution to
                                                               are disregarded. However, SNPs are inherently corre-
this problem. Next, we present a greedy approach to run
                                                               lated and considering such correlations in the genome
such an attack more accurately, and then we will detail
                                                               reconstruction attack may result in significantly more
a more sophisticated, clustering-based approach for the
                                                               accurate results. In the greedy algorithm discussed here,
genome reconstruction attack.
                                                               the attacker forms the bins considering the correlations
                                                               between the SNPs in set RN →Y . Using an iterative ap-
6.1 Baseline Approach for Genome                               proach, the attacker assigns each SNP (minor allele) to
    Reconstruction                                             an individual such that the probability of assignment is
Here, we describe a baseline approach, in which the at-        proportional to the average correlation of the new SNP
tacker, using the responses of the beacon, reconstructs        with the already assigned SNPs of the individual (i.e.,
the genomes (of the newly joined donors) by assigning          bin i). If no assignment is made this way, a random in-
them to m0 bins according to MAF values of the SNPs.           dividual is selected to make sure there is at least one
Genome reconstruction attack using the baseline algo-          person with the corresponding new SNP.
rithm for a particular victim v at time t + δ can be               Genome reconstruction attack using the greedy al-
described as follows. The input of the attacker is (i) re-     gorithm for a particular victim v at time t + δ can be
sponses of the beacon to all possible queries at time t        described as follows. The input of the attacker includes
(i.e., complete snapshot of the beacon at time t); (ii)        everything in the baseline approach and also a correla-
the fact that m new donors are added to the beacon             tion model between the SNPs that is consistent with
between times t and t + δ; (iii) the fact that the vic-        the population structure of the beacon (that can be
tim is among the newly added donors; and (iv) publicly         computed using publicly available genomic datasets).
available MAF values of the SNPs.                              Different correlation models have been explored for ge-
     First, the attacker identifies the set of SNPs for        nomic data before. In [62], authors showed how the cor-
which the response of the beacon was “no” at time t            relations in the genome can be modelled using a Markov
and it becomes “yes” at time t + δ. Thus, the attacker         chain model. We create our correlation model by consid-
constructs a set RN →Y , consisting of these SNPs. Then,       ering the pairwise correlations between all the SNPs in
the attacker creates m0 empty bins representing SNP            the beacon (which results in richer information for the
sets of newcomer donors. For each SNP j in set RN →Y ,         attacker). The attacker calculates the likelihood of the
the attacker retrieves its MAF value, M AFj . Next, the        victim v having at least one minor allele at a SNP posi-
attacker assigns the value of SNP j for each individual        tion j as Pk (Ŝjv ) = P (Ŝjv |Ŝkv ), where k may be any other
i (in each bin) consistent with the SNP’s MAF value as         position in the genome. We use Sokal-Michener distance
follows: (i) Ŝji = 0 with probability (1−M AFj )2 and (ii)    to compute correlations between SNPs as follows:
Ŝjv = 1 with probability M AFj2 + 2M AFj (1 − M AFj ).                      A = 2(nŜ v =1,Ŝ v =0 + nŜ v =0,Ŝ v =1 )
                                                                                        j       k            j       k
Since the beacon’s response for SNPs in RN →Y has
                                                                                B = nŜ v =1,Ŝ v =1 + nŜ v =0,Ŝ v =0
flipped from “no” to “yes”, for all SNPs in RN →Y , there                                   j   k                j   k

should be at least one bin (among m0 bins) with at least                                                        A
                                                                           DSokal−M ichener (Ŝjv , Ŝkv )   =
one mutation (i.e., homozygous minor or heterozygous                                                           A+B
SNP). Thus, once the values of the SNPs in RN →Y for
                                                                    In the greedy approach, first, the attacker con-
all m0 bins are determined, the attacker checks if there
                                                               structs set RN →Y . Then, it creates m0 empty bins (m0
is any SNP in set RN →Y that is not assigned to any
                                                               does not have to be equal to m) representing the num-
bin. If there is such a SNP, the attacker randomly picks
                                                               ber of rare SNPs in RN →Y . We assume that the SNPs
a bin and assigns the value of the corresponding SNP
                                                               with an MAF value below a threshold τ are categorized
as Ŝji = 1 for the corresponding bin. The details of this
                                                               as rare SNPs. Observing rare SNPs do not have corre-
baseline approach are also shown in Algorithm 2 (in
                                                               lations among each other, assigning the rare SNPs in
Appendix B).
                                                               RN →Y to different bins as seeds is assumed to result in
                                                               an accurate initial separation of individuals. Next, for
                                                               each remaining SNP j in RN →Y , the attacker computes
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons              35

the average correlation between that and all the previ-       response of the beacon was “no” at time t and it be-
ously assigned SNPs in bin i using the aforementioned         comes “yes” at time t + δ and constructs set RN →Y .
correlation model. This is done for each bin i. Let Ŝji      Then, the attacker builds a graph of SNPs using the
be a binary random variable for SNP j and bin i. The          correlation model, in which the vertices are the SNPs in
attacker assigns Ŝjc = 1 for bin c which has the highest     RN →Y and undirected edges are weighted by the corre-
average correlation value and Ŝji = 0, ∀i ∈ [1, m0 ] and     lation values between these SNPs. This graph represents
i 6= c. Eventually, the attacker constructs m0 potential      a pairwise similarity model for the SNPs and is used for
genomes (in m0 bins) belonging to m newcomer donors.          a quantitative assessment of the correlation of each SNP
                                                              pair in RN →Y .
6.3 Clustering-Based Algorithm for                                 Next, the attacker applies either the spectral or
    Genome Reconstruction                                     fuzzy clustering algorithms on the constructed graph.
                                                              The outcome of spectral clustering is a set of disjoint
Greedy algorithm (in Section 6.2) reconstructs genomes
                                                              clusters. Fuzzy clustering results in groups of SNPs that
by following a particular order (determined based on
                                                              maximizes the similarity in a group while allowing a
the MAFs of the SNPs). Different orders may provide
                                                              SNP to be shared by multiple individuals. Thus, in fuzzy
different solutions. Thus, to consider all query responses
                                                              clustering, each SNP i is assigned to clusters for which
together in a collective way, we propose clustering-based
                                                              the algorithm returns a relatively high probability of as-
approaches for the genome reconstruction attack that
                                                              sociation. After clustering, the attacker obtains m0 dif-
cluster the identified minor alleles for the newly joined
                                                              ferent clusters which corresponds to m0 reconstructed
donors to the beacon. The proposed clustering tech-
                                                              genomes. The details are shown in Algorithm 1.
niques essentially use the correlations between the SNPs
(that are computed using the aforementioned correla-
                                                              6.4 Identifying the Victim Using
tion model) to distribute SNPs into different bins. We
                                                                  Genotype-Phenotype Associations
use two types of clustering techniques: (i) hard cluster-
ing to create non-overlapping bins and (ii) soft or fuzzy     In previous sections, for genome reconstruction, we as-
clusterin to assign a SNP into multiple bins.                 sumed that the attacker can correctly identify the vic-
     For (i), we employ spectral clustering, in which a       tim’s genome among several reconstructed bins. As-
standard clustering method (such as k-means cluster-          suming the attacker has some moderate auxiliary in-
ing) is applied on certain eigenvectors of the Laplacian      formation about the victim, here, we study and show
matrix of a graph [57]. In this graph, the SNPs cor-          how accurately the attacker can identify the victim’s
respond to vertices and correlations between the SNPs         genome among other candidates. For this, we assume
correspond to weights of edges. Spectral clustering is our    the attacker uses information about some phenotypic
method of choice as it has been shown to provide favor-       characteristics of the victim and it relies upon the fact
able results in many high dimensional feature spaces          that SNPs are intrinsically linked to phenotypic traits
like ours [60]. And, for (ii) we employ the fuzzy c-means     (such as eye color, hair color, etc.) This provides a com-
clustering (FCM) algorithm [14], which is a common            plete methodology for the genome reconstruction attack
choice for these types of tasks. The algorithm is similar     against beacons in real-life. As we will discuss later, the
to k-means clustering, but it also allows probabilistic as-   success of the attacker to correctly identify the victim’s
signments of samples to multiple clusters. Different from     genome among the reconstructed ones increases if the
k-means clustering, FCM assigns a membership value            attacker has access to more auxiliary information about
uij = P (Ŝji = 1) for each element j and for each cluster    the victim.
i. This membership values are used as weights in the ob-          Assume victim v is among the m new additions to
jective function. After convergence, these membership         the beacon (it is trivial to extend the methodology if
values are used as the probability of assignments of el-      there are more than one victim). The attacker is as-
ements to each cluster. The description of both cluster-      sumed to have access to two distinct sets: (i) a set
ing methods are similar except for the clustering steps.      S = {S  ~1 , S ~2 , . . . , S~m0 } of m0 reconstructed genotypes
Thus, in the following, we describe both methods to-          as a result of the genome reconstruction attack, where
gether.                                                       ~i = (Ŝ i , . . . , Ŝ i ) is a vector containing the SNP values
                                                              S       1              k
     The input of both clustering-based algorithms is the     of genotype i (or bin i); and (ii) a set Pv = (pv1 , . . . , pvt )
same as the input of the greedy algorithm. First, the at-     containing the values of t phenotypic traits of victim v.
tacker identifies the set of SNP positions for which the      Such phenotype information can be obtained from pub-
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons      36

  Algorithm 1: Clustering-Based Algorithm                             In [40], Humbert et al. focused on the deanoymiza-
  for Genome Reconstruction Attack                               tion risk and modelled genotype-phenotype association
      Input: b: beacon; m: Number of added people to b;          as an assignment problem. They showed this risk by us-
             Population P that represents the composition        ing the Hungarian algorithm [47]. Different from [40],
             in b                                                here, we rely on machine learning for maximizing the
      Output: m0 reconstructed genomes
                                                                 matching likelihood and genotype-phenotype associa-
      // Step 1: Query Beacon
  1   snapshot1 ← queryBeacon(b, t)                              tions. We observe that such a formulation provides more
      // Including victim, m donors join Beacon                  accurate results. Also, rather than using SNP values (0,
         between time t and t + δ                                1 or 2), due to the nature of the proposed attack, we
  2   snapshot2 ← queryBeacon(b, t + δ)                          represent the state of each SNP j of individual i as Ŝji ,
  3
                                                                 which can be either 0 or 1, as discussed before.
      // Step 2: Obtain No-Yes SNPs
  4   NoYesResponses ← []
                                                                      For phenotype inference, we train a separate model
  5   for i ← 0 to snapshot1.length do                           for each of the considered phenotypes, where SNPs with
  6       if snapshot1[i] = "No" and snapshot2[i] = "Yes"        flipped responses (from “no” to “yes”) are used as fea-
            then                                                 tures. Since phenotype datasets are highly imbalanced,
  7           NoYesResponses.append(i)                           we apply Synthetic Minority Oversampling Technique
  8       end
                                                                 (SMOTE) [16] for each of these datasets to resolve this
  9   end
 10
                                                                 problem. In SMOTE, a minority class instance is se-
      // Step 3: Cluster No-Yes SNPs                             lected along with its nearest neighbors at random. Then,
 11   G ← Graph()                                                a new sample is generated as a combination of the orig-
 12   for i ← 0 to N oY esResponses.length − 1 do                inal instance and a random neighbor. Next, we train
 13       for j ← i + 1 to NoYesResponses.length do              a random forest model for each phenotype. We use re-
 14            ri ← NoYesResponses[i]
                                                                 peated stratified 5-fold cross validation to tune the hy-
 15            rj ← NoYesResponses[j]
 16            c ← corr(P,ri,rj)                                 perparameters. After training the phenotype models, we
 17            G.addEdge(ri,rj,c)                                form the ensemble classifier using the ones that have
 18       end                                                    better validation F1-macro score than random guess.
 19   end                                                        We discard the other models.
 20   clusters ← graphClustering(G, m0 )                              Ensemble classifier calculates the matching likeli-
 21
                                                                 hood of given genome and set of phenotypic traits.
      // Step 4: Reconstruct genomes
 22   S ← []
                                                                 Softmax output of each phenotype model correspond-
 23   for i ← 0 to m0 do                                         ing to a given phenotypic trait of the victim (i.e., prob-
 24       S[i] ← getRef erenceGenome(P )                         ability that a reconstructed genome having blue eye)
 25       foreach s in clusters[i] do                            are summed to calculate the matching likelihood. For
 26            S[i][s] ← getM inorAllele(P, s)                   single victim, this calculation is done for each recon-
 27       end
                                                                 structed genome and the victim is matched with the
 28   end
 29   return S
                                                                 reconstructed genome with the highest matching likeli-
                                                                 hood score. Note that this matching does not need to be
licly available resources or using the physical traits of        one-to-one; a single reconstructed genome might match
the victim. For instance, the attacker can obtain such           with different set of phenotypic traits. We discuss the
information from victim’s social media accounts. The             performance of identification of victim’s reconstructed
goal of the attacker is to correctly match the victim’s          genome under different settings in Section 7.4.
phenotype to the correct reconstructed genome (that
is the most similar to the victim’s) among all candi-            6.5 Using Genome Reconstruction in
date reconstructed genome sequences. In the test phase,              Membership Inference Attack
the attacker has m newly added donors and m0 re-
                                                                 To show one consequence of the proposed genome re-
constructed genomes. Attacker’s task is to match each
                                                                 construction attack, we also model and analyze how the
donor with the best matching reconstructed genome.
                                                                 proposed attack can be utilized for membership infer-
Thus, for each newly added donor, the attacker calcu-
                                                                 ence attack (introduced in Appendix A). We consider a
lates the likelihood scores of matching with all m0 re-
                                                                 scenario in which the attacker knows the membership of
constructed genomes.
                                                                 an individual to a beacon with which no sensitive associ-
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons        37

ated phenotype (e.g., phenotype neutral). The attacker          uals that are not in B2 . In this work, in order to model
first utilizes the responses of this beacon to infer specific   the uncertainty of correctly matching the victim (using
parts of a victim’s genome (i.e., SNPs). Then, it uses          phenotype inference as in Section 6.4), we first experi-
these inferred SNPs to infer the membership of the vic-         mentally compute the error rate of the overall process.
tim to a beacon with a sensitive phenotype. This attack         For instance, if the accuracy of correctly matching the
is important and realistic, because knowing the mem-            phenotype of the victim to their reconstructed genome
bership of an individual to a phenotype neutral beacon          is p%, then p% of the 20 individuals are selected from
(e.g., Kaviar Beacon) may not seem to pose a privacy is-        correctly identified reconstructions and remaining indi-
sue. However, using the proposed genome reconstruction          viduals are selected from other new people added to the
attack and the membership information of the victim to          beacon along with the victim (incorrect identifications).
the beacon with non-sensitive phenotype, the attacker                When Λ is less than a threshold tα , the null hypoth-
can first infer the SNPs of the victim and then, infer          esis is rejected and we find tα from the null hypothesis
the membership of the victim to another beacon which            with α = 0.05 (corresponding to 5% false positive rate).
is associated a sensitive phenotype (e.g., SFARI beacon         Then, we computed the power as proportion of the in-
which is associated with autism phenotype).                     dividuals in the alternate hypothesis (including 20 dif-
     To show this, first, we run the proposed genome re-        ferent individuals in B2 ) having a Λ value that is less
construction attack that is explained in Section 6.3 and        than tα . As before, p% of the 20 individuals are selected
infer the SNPs of the victim with at least one minor            from correctly identified reconstructions and remaining
allele on a beacon B1 . Using these inferred SNPs, we           people are selected from other new people added to the
then run the membership inference attack to infer the           beacon along with the victim.
membership of the victim in another beacon B2 . For
membership inference attack, we use the optimal attack          7 Evaluation
in [59] (described in Appendix A), which is shown to
                                                                To evaluate the identified vulnerabilities, we evaluated
be an effective attack for membership inference (for our
                                                                our methods using real-life genomic datasets. Here, we
scenario, optimal attack in [59] and the QI-attack in [73]
                                                                describe the datasets and present the evaluation results.
perform similarly, so we choose to use the optimal at-
tack due to its simplicity). However, in contrast to the
                                                                7.1 Datasets
original optimal attack, in the null and alternate hy-
pothesis equations in (1) and (2), there is an additional       We used two different genome datasets for evalua-
error due to the inference error of the genome recon-           tion: (i) genome dataset of CEU population from
struction attack. This is because the attacker queries          the HapMap dataset [29] and (ii) OpenSNP genome
the alleles of the victim that it infers as a result of the     dataset [7]. Using the HapMap dataset, we created the
genome reconstruction attack and there is a degree of           beacons and victims from CEU population which con-
uncertainty. Thus, we first experimentally compute the          tains 164 donors and around 4 million SNPs for each
error rate of the genome reconstruction attack for a par-       donor. We created the correlation model (i.e., SNP-SNP
ticular scenario (e.g., for particular m and n values). We      relation network or similarity model) for this beacon us-
then include this additional error on the γ parameter in        ing individuals from the same HapMap dataset that are
(2), which represents the probability that the attacker’s       not in the constructed beacon and set of victims. Us-
copy of the victim’s genome does not match the beacon’s         ing the OpenSNP dataset, we created the beacons and
copy for a SNP. Furthermore, as opposed to original op-         victims from a random population which contains 2980
timal attack, here the attacker may not have access to          donors and around 2 million SNPs for each donor. We
the SNPs of the victim with the lowest MAF values;              created the correlation model using the rest of the Open-
instead the attacker only knows the SNPs that are in-           SNP dataset.
ferred as a result of the genome reconstruction attack.              For the OpenSNP dataset, we also collected the re-
     We evaluate the success of this attack in terms of         ported phenotypes of individuals. Since sample sizes
the power of the attacker in Section 7.5. Similar to Rais-      are small, we used the reported phenotypes in a bi-
aro et al. and von Thenen et al., we plot the power curve       nary form. From OpenSNP, we used the following com-
of the membership inference attack at 5% false positive         monly reported phenotypes: (i) eye color, 967 samples,
rate. We empirically build the null hypothesis (H0 in           (ii) hair type, 371 samples, (iii) hair color, 468 samples,
Appendix A). For every query, we determine the distri-          (iv) tan ability, 287 samples, (v) asthma, 226 samples,
bution of Λ under the null hypothesis using 20 individ-         (vi) lactose intolerance, 347 samples, (vii) earwax, 244
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons         38

samples, (viii) tongue rolling, 434 samples, (ix) intol-             the ith query is calculated from given set of l case peo-
                                                                     ple as P i = ( Λi ∈Λi 1Λi
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons            39

Fig. 2. Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of newly
added donors.

Section 6.1) performs substantially worse than the pro-           the size of the beacon increases, both the precision and
posed clustering-based approach. The results also show            recall of the reconstruction attack almost remains the
that spectral clustering-based genome reconstruction is           same (for a fixed number of newly added donors).
slightly better than the fuzzy clustering-based approach.              Even if the success of the genome reconstruction re-
We observed that allowing a SNP (that includes at least           mains high, the number of flipped responses (from “no”
one minor allele) to be in multiple bins results in high          to “yes”) may decrease when beacon size is increased (as
false positives. Therefore, in the remaining of this sec-         shown in Figure 4). In other words, the number of vul-
tion, we use spectral clustering-based genome recon-              nerable SNPs (the ones that can be inferred using the
struction for the evaluations.                                    change in the beacon responses) of a victim decreases
     To show the benefit of utilizing a beacon (and bea-          and this might result in lower performance in phenotype
con update) in its genome reconstruction attack, we also          inference and membership inference parts of the attack.
computed the reconstruction accuracy of an attacker               However, with high probability, as the beacon size in-
when it only uses publicly available information (e.g.,           crease, low-MAF SNPs of the victim (which typically
population statistics and victim’s phenotype). As dis-            provide the most valuable information for the member-
cussed, each victim we consider has a subset of 21 phe-           ship inference attack) still remain vulnerable, since with
notypes listed in Section 7.1. Using the associations of          high probability, such SNPs are not observed in other
victim’s phenotypes with the corresponding SNPs (ex-              donors in the beacon. For example, in the previous ex-
tracted from SNPedia [8]), we assigned some SNP values            periment (in Figure 4), when the size of the beacon is
of the victim. We observed that, on the average, such a           increased from 50 to 400, total number of vulnerable
reconstruction achieves a precision of 18% and a recall           SNPs of a victim reduces by 94%, however, number of
of 47% on total of 232 SNPs. Therefore, we conclude               vulnerable SNPs of a victim with MAF value smaller
that having access to a beacon and knowing the mem-               than 0.01 only reduces by 52%.
bership of a victim to a beacon significantly increases                Keeping the ratio of newly added donors fixed (to
the success of the genome reconstruction attack.                  5%), we also observed the change in the success of the
     To show the effect of varying number of bins (m0 ) in        attack with increasing beacon size when m0 = m in
the genome reconstruction attack, in Figures 3 and 10             Figure 5 (we did this evaluation only for the Open-
(in Appendix C), we show the attacker’s success when              SNP beacon since HapMap beacon did not have more
the number of newly added donors m = 5 and beacon                 than 100 donors). We observed that, when the beacon
size n = 50 for OpenSNP and HapMap beacons, respec-               size increases beyond 100, although the recall of the at-
tively. We observed that for both beacons, precision in-          tacker still remains high, its precision starts decreas-
creases and recall decreases with increasing m0 . Also, as        ing. This shows that the success of the identified attack
expected, precision and recall becomes balanced when              mainly relies on the number of clusters the attackers
m0 = m.                                                           needs to generate (in the proposed clustering-based al-
     Next, in Figures 4 and 11 (in Appendix C), we show           gorithm). For small or mid-size beacons (e.g., NBDC
the effect of the beacon size (n) at time t when 5 new            Human Database [4] with slightly more than 100 in-
donors are added between times t and t + δ for Open-              dividuals), even if the beacon update significantly in-
SNP and HapMap beacons, respectively. Here, we as-                creases beacon’s size, the identified attack is still effec-
sume that the number of bins (m0 ) is equal to the num-           tive. On the other hand, for large size beacons (e.g., gno-
ber of newly added donors (m). We observed that as
Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons            40

Fig. 3. Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of
bins/clusters (m0 ) in the genome reconstruction attack. Number of newly added donors (m) is 5.

Fig. 4. Precision and recall for the genome reconstruction of a newly added donor to OpenSNP beacon with varying number of beacon
size (n). Number of newly added donors m is 5 and m0 = m for all plots.

mAD [5], with more than 100K individuals), the update             Among these, we obtained the highest classifier accu-
size should be small to have a vulnerability.                     racy with the Random Forest, and hence all reported
     Finally, we explored the scenario, in which the at-          results are based on this model.
tacker only has a partial snapshot of the beacon (instead              In Figure 7, we show the ensemble classifier accu-
of a full snapshot). In Figure 6, we show the success of          racy for varying number of newly added donors to the
the reconstruction attack when m = 5 donors are added             beacon (here, we assumed m0 = m and we observed sim-
(at time t + δ) into the OpenSNP beacon with size 50              ilar patterns when m0 6= m as well). We used the origi-
when the attacker has varying snapshots of the bea-               nal genomes of individuals in the training dataset when
con at time t and when m = m0 . We observed that the              building the model. For test, we used reconstructed
success (precision and recall) of reconstruction do not           genomes of the victims (that may have noise due to
change with varying snapshots. However, the number of             reconstruction error). Beacon size is 50 in these experi-
inferred SNPs (as a result of the genome reconstruction           ments (i.e., n = 50).
attack) decreases linearly with the decreasing snapshot                We observed that the proposed algorithm provides
that is known by the attacker at time t.                          70% accuracy when the size of the beacon is increased
                                                                  by adding 2 individuals in the update, and the accu-
7.4 Identifying the Victim’s Genome                               racy slightly decreases with increasing number of newly
    Using Phenotype Inference                                     added donors. These results show that the attacker can
                                                                  identify the reconstructed genome of the victim among
Here, we evaluate the success of the attacker in iden-
                                                                  all m0 reconstructed genomes with high accuracy. As
tifying the reconstructed genome of the victim among
                                                                  discussed before, in this experiment, we assumed the at-
all reconstructed genomes using the algorithm in Sec-
                                                                  tacker has moderate auxiliary knowledge about the vic-
tion 6.4. Since HapMap dataset does not include phe-
                                                                  tim (i.e., phenotypic-traits, which can be easily learnt
notype information about the genome donors, we only
                                                                  from social network profiles of the victim). However,
use the OpenSNP beacon for this evaluation.
                                                                  since genotype-phenotype associations are not strong
     We employed and compared several machine learn-
                                                                  yet, there is an accuracy bottleneck in the overall pro-
ing models for genotype-phenotype associations, includ-
                                                                  cess due to this step. A stronger attacker (that has ac-
ing: Logistic Regression [23], SVM [22], Multi-layer Per-
                                                                  cess to richer auxiliary information about the victim)
ceptron [72], Random Forest [67], and XGBoost [17].
You can also read