Datasets for Online Controlled Experiments - OpenReview

Page created by Manuel White

Shopping

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Datasets for Online Controlled Experiments

C. H. Bryan Liu ?† Ângelo Cardoso † Paul Couturier ? Emma J. McCoy ?
? †
Imperial College London & ASOS.com, UK
bryan.liu12@imperial.ac.uk

Abstract

1 Online Controlled Experiments (OCE) are the gold standard to measure impact and
2 guide decisions for digital products and services. Despite many methodological
3 advances in this area, the scarcity of public datasets and the lack of a systematic
4 review and categorization hinders its development. We present the first survey and
5 taxonomy for OCE datasets, which highlight the lack of a public dataset to support
6 the design and running of experiments with adaptive stopping, an increasingly pop-
7 ular approach to enable quickly deploying improvements or rolling back degrading
8 changes. We release the first such dataset, containing daily checkpoints of decision
9 metrics from multiple, real experiments run on a global e-commerce platform. The
10 dataset design is guided by a broader discussion on data requirements for common
11 statistical tests used in digital experimentation. We demonstrate how to use the
12 dataset in the adaptive stopping scenario using sequential and Bayesian hypothesis
13 tests and learn the relevant parameters for each approach.

14 1 Introduction
15 Online controlled experiments (OCEs) have become popular among digital technology organisations
16 in measuring the impact of their products and services, and guiding business decisions [50, 53, 75].
17 Large tech companies including Google [41], Linkedin [84], and Microsoft [49] reported running
18 thousands of experiments on any given day, and there are multiple companies established solely to
19 manage OCEs for other businesses [10, 45]. It is also considered a key step in the machine learning
20 development lifecycle [8, 89].
21 OCEs are essentially randomized controlled trials run on the Web. The simplest example, commonly
22 known as an A/B test, splits a group of entities (e.g. users to a website) randomly into two groups,
23 where one group is exposed to some treatment (e.g. showing a "free delivery" banner on the website)
24 while the other act as the control (e.g. seeing the original website without any mention of free
25 delivery). We calculate the decision metric(s) (e.g. proportion of users who bought something)
26 based on responses from both groups, and compare the metrics using a statistical test to draw causal
27 statements about the treatment.
28 The ability to run experiments on the Web allows one to interact with a large number of subjects
29 within a short time frame and collect a large number of responses. This, together with the scale of
30 experimentation carried out by tech organizations, should lead to a wealth of datasets describing
31 the result of an experiment. However, there are not many publicly available OCE datasets, and we
32 believe they were never systematically reviewed nor categorized. This is in contrast to the machine
33 learning field, which also enjoyed its application boom in the past decade yet already has established
34 archives and detailed categorizations for its datasets [28, 77].
35 We argue the lack of relevant datasets arising from real experiments is hindering further development
36 of OCE methods (e.g. new statistical tests, bias correction, and variance reduction methods). Many
37 statistical tests proposed relied on simulated data that impose restrictive distributional assumptions

Submitted to the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets
and Benchmarks. Do not distribute.

38 and thus may not be representative of the real-world scenario. Moreover, it may be difficult to
39 understand how two methods differ to each other and assess their relative strength/weakness without
40 a common dataset to compare on.
41 To address this problem, we present the first ever survey and taxonomy for OCE datasets. Our
42 survey identified 12 datasets, including standalone experiment archives, accompanying datasets from
43 scholarly works, and demo datasets from online courses on the design and analysis of experiments.
44 We also categorize these datasets based on dimensions such as the number of experiments each
45 dataset contains, how granular each data point is time-wise and subject-wise, and whether it include
46 results from real experiment(s).
47 The taxonomy enables us to engage in a discussion on the data requirements for an experiment by
48 systematically mapping out which data dimension is required for which statistical test and/or learning
49 the hyperparameter(s) associated with the test. We also recognize that in practice data are often used
50 for purposes beyond what it is originally collected for [47]. Hence, we posit the mapping is equally
51 useful in allowing one to understand the options they have when choosing statistical tests given the
52 format of data they possess. Together with the survey, the taxonomy helps us to identify what types
53 of dataset are required for commonly used statistical tests, yet are missing from the public domain.
54 One of the gaps the survey and taxonomy identify is datasets that can support the design and running
55 of experiments with adaptive stopping (a.k.a. continuous monitoring / optional stopping), which we
56 motivate their use below. Traditionally, experimenters analyze experiments using Null Hypothesis
57 Statistical Tests (NHST, e.g. a Student’s t-test). These tests require one to calculate and commit to a
58 required sample size based on some expected treatment effect size, all prior to starting the experiment.
59 Making extra decisions during the experiment, be it stopping the experiment early due to seeing
60 favorable results, or extending the experiment as it teeters “on the edge of statistical significance” [61],
61 is discouraged as they risk one having more false discoveries than intended [37, 60].
62 Clearly the restrictions above are incompatible with modern decision-making processes. Businesses
63 operating online are incentivized to deploy any beneficial and roll back any damaging changes
64 as quickly as possible. Using the “free delivery” banner example above, the business may have
65 calculated that they require four weeks to observe enough users based on an expected 1% change
66 in the decision metric. If the experiment shows, two weeks in, that the banner is leading to a 2%
67 improvement, it will be unwise not to deploy the banner to all users simply due to the need to run the
68 experiment for another two weeks. Likewise, if the banner is shown leading to a 2% loss, it makes
69 every sense to immediately terminate the experiment and roll back the banner to stem further losses.
70 As a result, more experimenters are moving away from NHST and adopting adaptive stopping
71 techniques. Experiments with adaptive stopping allows one to decide whether to stop an experiment
72 at any time in between based on the sample responses observed so far without worrying much on
73 false positive/discovery rate control. To encourage further development in this area, both in methods
74 and data, we release the ASOS Digital Experiments Dataset, which contains daily checkpoints of
75 decision metrics from multiple, real OCEs run on the global online fashion retail platform.
76 The dataset design is guided by the requirements identified by the mapping between the taxonomy
77 and statistical tests, and to the best of our knowledge, is the first public dataset that can support the
78 end-to-end design and running of online experiments with adaptive stopping. We demonstrate it can
79 indeed do so by (1) running a sequential test and a Bayesian hypothesis test on all the experiments in
80 the dataset, and (2) estimating the value of hyperparameters associated with the tests. While the notion
81 of ground-truth does not exist in real OCEs, we show the dataset can also act as a quasi-benchmark
82 for statistical tests by comparing results from the tests above with that of a t-test.
83 To summarize, our contributions are:
84 1. (Sections 2 & 3) We create, to the best of our knowledge, the first ever taxonomy on online
85 controlled experiment datasets and apply it to datasets that are publicly available;
86 2. (Section 4) We map the relationship between the taxonomy to statistical tests commonly used in
87 experiments by identifying the minimally sufficient set of statistics and dimensions required in
88 each test. The mapping, which also applies to offline and non-randomized controlled experiments,
89 enables experimenters to quickly identify the data collection requirements for their experiment
90 design (and conversely the test options available given the data availability); and
91 3. (Section 5) We make available, to the best of our knowledge, the first real, multi-experiment time
92 series dataset, enabling the design and running of experimentation with adaptive stopping.

93 2 A Taxonomy for Online Controlled Experiment Datasets

94 We begin by presenting a taxonomy on OCE datasets, which is necessary to characterize and
95 understand the results of a survey. To the best of our knowledge, there are no surveys nor taxonomies
96 specifically on this topic prior to this work. While there is a large volume of work concerning the
97 categorization of datasets in machine learning [28, 77], of research work in the online randomized
98 controlled experiment methods [3, 4, 32, 68], and of general experiment design [43, 52], our search
99 on Google Scholar and Semantic Scholar using combinations of the keywords “online controlled
100 experiment”/“A/B test”, “dataset”, and “taxonomy”/“categorization” yields no relevant results.
101 The taxonomy focuses on the following four main dimensions:
102 Experiment count A dataset can contain the data collected from a single experiment or multiple
103 experiments. Results from a single experiment is useful for demonstrating how a test works, though
104 any learning should ideally involve multiple experiments. Two closely related but relatively minor
105 dimensions are the variant count (number of control/treatment groups in the experiment) and the
106 metric count (number of performance metrics the experiment is tracking). Having an experiment
107 with multiple variants and metrics enable the demonstration of methods such as false discovery rate
108 control procedures [7] and learning the correlation structure within an experiment.
109 Response granularity Depending on the experiment analysis requirements and constraints imposed
110 by the online experimentation platform, the dataset may contain data aggregated to various levels.
111 Consider the “free delivery” banner example in Section 1, where the website users are randomly
112 allocated to the treatment (showing the banner) and control (not showing the banner) groups, with the
113 goal to understand whether the banner changes the proportion of users who bought something. In this
114 case each individual user is considered a randomization unit [50].
115 A dataset may contain, for each experiment, only summary statistics on the group level, e.g. the
116 proportion of users who have bought something in the control and treatment groups respectively.
117 It can also record one response per randomization unit, with each row containing the user ID and
118 whether the user bought something. The more detailed activity logs at a sub-randomization unit level
119 can also have each row containing information about a particular page view from a particular user.
120 Time granularity An experiment can last anytime between a week to many months [50], which
121 provides many possibilities in recording the result. A dataset can opt to record the overall result
122 only, showing the end state of an experiment. It may or may not come with a timestamp for
123 each randomization unit or experiment if there are multiple instances of them. It can also record
124 intermediate checkpoints for the decision metrics, ideally at regular intervals such as daily or hourly.
125 These checkpoints can either be a snapshot of the interval (recording activities between time t
126 and t + 1, time t + 1 and t + 2, etc.) or cumulative from the start of the experiment (recording
127 activities between time 0 and 1, time 0 and 2, etc.).
128 Syntheticity A dataset can record data generated from a real process. It can also be synthetic—
129 generated via simulations with distributional assumptions applied. A dataset can also be semi-synthetic
130 if it is generated from a real-life process and subsequently augmented with synthetic data.
131 Note we can also describe datasets arising from any experiments (including offline and non-
132 randomized controlled experiments) using these four dimensions. We will discuss in Section 4
133 how these dimensions map to common statistical tests used in online experimentation.
134 In addition, we also record the application domain, target demographics, and the temporal coverage
135 of the experiment(s) featured in a dataset. In an age when data are often reused, it is crucial for
136 one to understand the underlying context, and that learnings from a dataset created under a certain
137 context may not translate to another context. We also see the surfacing of such context as a way to
138 promote considerations in fairness and transparency for experimenters as more experiment datasets
139 become available [44]. For example, having target demographics information on a meta-level helps
140 experimenters to identify who were involved, or perhaps more importantly, who were not involved in
141 experiments and could be adversely impacted by a treatment that is effectively untested.
142 Finally, two datasets can also differ in their medium of documentation and the presence/absence
143 of a data management / long-term preservation plan. The latter includes the hosting location, the
144 presence/absence of a DOI, and the type of license. We record these attributes for the datasets
145 surveyed below for completeness.

146 3 Public Online Controlled Experiment Datasets
147 Here we discuss our approach to produce the first ever survey on OCE datasets and present its results.
148 The survey is compiled via two search directions, which we describe below. For both directions, we
149 conduct a first round search in May 2021, with a follow up round in August 2021 to ensure we have
150 the most updated results.
151 We first search on the vanilla Google search engine using the keywords “Online controlled experiment
152 "dataset"”, “A/B test "dataset"”, and “Multivariate test "dataset"”. For each keyword, we inspect the
153 first 10 pages of the search result (top 100 results) for scholarly articles, web pages, blog posts, and
154 documents that may host and/or describe a publicly available OCE dataset. The search term “dataset”
155 is in double quotes to limit the search results to those with explicit mention of dataset(s). We also
156 search on specialist data search engines/hosts, namely on Google Dataset Search (GDS) and Kaggle,
157 using the keywords “Online controlled experiment(s)” and “A/B test(s)”. We inspect the metadata
158 and description for all the results returned (except for GDS, where we inspect the first 100 results for
159 “A/B test(s)”) for relevant datasets as defined below.1
160 A dataset must record the result arising from a randomized controlled experiment run online to be
161 included in the survey. The criteria excludes experimental data collected from offline experiments, e.g.
162 those in agriculture [83], medicine [27], and economics [29]. It also excludes datasets used to perform
163 quasi-experiments and observational studies, e.g. the LaLonde dataset used in econometrics [19] and
164 datasets constructed for uplift modeling tasks [25, 40].2
165 The result is presented in Table 1. We place the 12 OCE datasets identified in this exercise along the
166 four taxonomy dimensions defined in Section 2 and record the additional features. These datasets
167 include two standalone archives for online media and education experiments respectively [56, 70],
168 and three accompanying datasets for peer-reviewed research articles [23, 74, 86]. There are also tens
169 of Kaggle datasets, blog posts, and code repositories that describes and/or duplicates one of the five
170 example datasets used in four different massive open online courses on online controlled experiment
171 design and analysis [9, 11, 38, 76]. Finally, we identify two standalone datasets hosted on Kaggle
172 with relatively light documentation [31, 48].
173 From the table we observe a number of gaps in OCE dataset availability, the most obvious one being
174 the lack of datasets that record responses at a sub-randomization unit level. In the sections below, we
175 will identify more of these gaps and discuss their implications to OCE analysis.

176 4 Matching Dataset Taxonomy with Statistical Tests
177 Specifying the data requirements (or structure) and performing statistical tests are perhaps two of the
178 most common tasks carried out by data scientists. However, the link between the two processes is
179 seldom mapped out explicitly. It is all too common to consider from scratch the question “I need
180 to run this statistical test, how should I format my dataset?” (or more controversially, “I have this
181 dataset, what statistical tests can I run?” [47]) for every new project/application, despite the list of
182 possible dataset dimensions and statistical tests remaining largely the same.
183 We aim to speed up the process above by describing what summary statistics are required to perform
184 common statistical tests in OCE, and link the statistics back to the taxonomy dimensions defined in
185 Section 2. The exercise is similar to identifying the sufficient statistic(s) for a statistical model [33],
186 though the identification is done for the encapsulating statistical inference procedure, with a practical
187 focus on data dimension requirements. We do so by stating the formula used to calculate the
188 corresponding effect sizes and test statistics and observe the summary statistics required in common.
189 The general approach enables one to also apply the resultant mapping to any experiments that involves
1
Searching for the keyword “Online controlled experiment” on GDS and Kaggle returned 42 and 7 results
respectively, and that for “A/B test” returned “100+” and 286 results respectively. Curiously, replacing “experi-
ment” and “test” in the keywords with their plural form changes the number of results, with the former returning
6 and 10 results on GDS and Kaggle respectively, and the latter returning “100+” and 303 results respectively.
2
Uplift modeling (UM) tasks for online applications often start with an OCE [66] and thus we can consider
UM datasets as OCE datasets with extra randomization unit level features. The nature of the tasks are different
though: OCEs concern validating the average treatment effect across the population using a statistical test,
whereas UM concerns modeling the conditional average treatment effect for each individual user, making it
more a general causal inference task that is outside the scope of this survey.

Experiment Count
                                                                                  Response                                                                                Data management /
    Dataset Name                              Reference(s)   - Variant / Metric                 Time Granularity      Syntheticity   Application      Documentation
                                                                                  Granularity                                                                             long-term preservation plan
                                                               Count
                                                                                                                                                                          Host: Open Science Framework
                                                             Multiple (32,487)                  Overall result only                  Media & Ads      Peer reviewed
    Upworthy Research Archive                 [56]                                Group                               Real                                                DOI: 3 (see [56])
                                                             - 2-14 / 2 (C)                     Timestamp per expt.                  Copy/Creative    data article
                                                                                                                                                                          Licence: CC BY 4.0
    ASSISTments Dataset from Multiple                        Multiple (22)                      Overall result only                  Education        Peer reviewed
                                              [70]                                Rand. Unit                          Real                                                Host: Author website
    Randomized Controlled Experiments                        - 2 / 2 (BC)                       7 timestamp                          Teaching model   data article

                                                                                                                                                      Accompanying
                                                                                                                                                                          Host: University library
    A/B Testing Web Analytics Data                           Single                             Overall result only                  Education        dataset to
                                              [86]                                Group                               Real                                                DOI: 3 (see [86])
    (From [87])                                              - 5 / 2 (C)                        X timestamp                          UX/UI change     peer-reviewed
                                                                                                                                                                          Licence: CC BY-SA 4.0
                                                                                                                                                      research article
    Dataset of two experiments of the
                                                                                                                                                                          Host: Journal website
    application of gamified peer assessment                  Multiple (2)                       Overall result only                  Education        Peer reviewed
                                              [74]                                Rand. Unit                          Real                                                DOI: 3 (see [74])
    model into online learning environment                   - 3+2 / 3+3 (R+C)                  7 timestamp                          Teaching model   data article
                                                                                                                                                                          Licence: CC BY 4.0
    MeuTutor (From [73])
    The Effects of Change Decomposition on                                                                                                                                Host: University library
                                                             Single                             Overall result only                  Tech             Peer reviewed
    Code Review - A Controlled Experiment -   [23]                                Rand. Unit                          Real                                                DOI: 3 (see [23])
                                                             - 2 / multi (CLR)                  7 timestamp                          Dev. Process     data article
    Online appendix (From [24])                                                                                                                                           License: CC BY-SA 4.0
    Udacity Free Trial Screener Experiment                                                                                                            Blog posts &        Host: Kaggle / GitHub (multiple)
                                              See e.g.       Single                             Daily checkpoint                     Education
    (From Udacity A/B Testing Course -                                            Group                               Real                            Kaggle notebooks    DOI: Unknown
                                              [62, 71, 79]   - 2 / 4 (C)                        Snapshot                             UX/UI change
    Final Project [38])                                                                                                                               e.g. [57, 69, 88]   Licence: Unknown
    "Analyse A/B Test Results" Dataset                                                                                                                Blog posts &        Host: Kaggle / GitHub (multiple)
                                              See e.g.       Single                             Overall result only                  E-commerce
    (From Udacity Online Data Analyst                                             Rand. Unit                          Unknown                         Kaggle notebooks    DOI: Unknown
                                              [2, 18, 64]    - 2 / 1 (B)                        Timestamp per RU                     UX/UI change
5

    Course - Project 3 [76])                                                                                                                          e.g. [15, 54, 67]   Licence: Unknown
                                                                                                                                                                          Host: Kaggle / Course website
    Mobile Games A/B Testing with Cookie      See e.g.       Single                             Overall result only                  Gaming
                                                                                  Rand. Unit                          Real                            Kaggle notebook     DOI: Unknown
    Cats (DataCamp project [9])               [30, 72, 85]   - 2 / 3 (BC)                       7 timestamp                          Design change
                                                                                                                                                                          Licence: Unknown
                                                                                                                                                                          Host: Course website
    Experiment Dataset (From DataCamp                        Single                             Overall result only                  Tech             Course notes
                                              [12]                                Rand. Unit                          Synthetic                                           DOI: Unknown
    A/B Testing in R Course [11])                            - 2 / 1 (B)                        Timestamp per RU                     UX/UI Change     Blog posts
                                                                                                                                                                          Licence: Unknown
    Data Visualization Website - April 2018                                                                                                                               Host: Course website
                                                             Single                             Overall result only                  Tech
    (From DataCamp A/B Testing                [13]                                Rand. Unit                          Synthetic                       Course notes        DOI: Unknown
                                                             - 2 / 4 (BR)                       Timestamp per RU                     UX/UI Change
    in R Course [11])                                                                                                                                                     Licence: Unknown
                                                                                                                                                                          Host: Kaggle
                                                             Single                             Overall result only                  E-commerce
    Grocery Website Data for AB Test          [48]                                Rand. Unit                          Unknown                         Kaggle notebook     DOI: Unknown
                                                             - 2 / 1 (B)                        7 timestamp                          UX/UI change
                                                                                                                                                                          Licence: Unknown
                                                                                                                                                                          Host: Kaggle
                                                             Single                             Overall result only                  Media & Ads
    Ad A/B Testing (aka SmartAd AB Data)      [31]                                Rand. Unit                          Unknown                         Kaggle notebook     DOI: N/A
                                                             - 2 / 2 (B)                        Timestamp per RU                     Display ads
                                                                                                                                                                          Licence: CC BY-SA 3.0

     Table 1: Results from the first ever survey of OCE datasets. The 12 datasets identified are placed on the four taxonomy dimensions defined in Section 2 together
     with the additional attributes recorded. In the Experiment Count column, a second-line value (corresponding to Variant/Metric Count) of “x / y (BCLR)” means
     the dataset features x variants and y metrics, with the metrics based on Binary, Count, Likert-scale, and Real-valued responses. In the Response Granularity and
     Time Granlarity columns, Randomization Unit is abbreviated as RU or Rand. Unit. Note that a large proportion of resources are accessed via links to non-scholarly
     articles—blog plots, Kaggle dataset pages, and GitHub repositories. These resources may not persist over time.

190   a two-sample statistical test, including offline experiments and experiments without a randomized
191   control. For brevity, we will refrain from discussing the full model assumptions as well as their
192   applicability. Instead we point readers to the relevant work in the literature.

193   4.1       Effect size and Welch’s t-test

194   We consider a two-sample setting and let X1 , · · · , XN and Y1 , · · · , YM be i.i.d. samples from
195   the distributions FX (·) and FY (·) respectively. We assume the first two moments exist for the
                                                                              2
196   distributions FX and FY , with their mean and variance denoted (µX , σX    ) and (µY , σY2 ) respectively.
197   We also denote the sample mean and variance of the two samples (X̄, sX ) and (Ȳ , s2Y ) respectively.
                                                                                2

198   Often we are interested in the difference between the mean of the two distributions ∆ = µY − µX ,
199   commonly known as the effect size (of the difference in mean) or the average treatment effect. A
200   standardized effect size enables us to compare the difference across many experiments and is thus
201   useful in meta-analyses. One commonly used effect size is Cohen’s d, defined as the difference in
202   sample means divided by the pooled sample standard deviation [17]:
                                               s
                                                   (N − 1)s2X + (M − 1)s2Y
                                d = Ȳ − X̄                                   .                     (1)
                                                           N +M −2
203   We are also interested in whether the samples carry sufficient evidence to indicate ∆ is different to a
204   prescribed value θ0 . This can be done via a hypothesis test with H0 : ∆ = θ0 and H1 : ∆ 6= θ0 .3 One
205   of the most common statistical test used in online controlled experiments is the Welch’s t-test [81], in
206   which we calculate the test statistic t as follow:
                                                          r 2
                                                               sX     s2
                                                                   + Y .
                                                        
                                          t = Ȳ − X̄                                                     (2)
                                                               N      M
207   We observe that in order to calculate the two stated quantities above, we require six quantities: two
208   means (X̄, Ȳ ), two (sample) variances (s2X , s2Y ), and two counts (N , M ). We call these quantities
209   Dimension Zero (D0) quantities as they are the bare minimum required to run a statistical test—these
210   quantities will be expanded along the taxonomy dimensions defined in Section 2.
211   Cluster randomization / dependent data The sample variance estimates (s2X , s2Y ) may be biased
212   in the case where cluster randomization is involved. Using again the “free delivery” banner example,
213   instead of randomly assigning each individual user to the control and treatment groups, the business
214   may randomly assign postcodes to the two groups, with all users from the same postcode getting the
215   same version of the website. In this case user responses may become correlated, which violates the
216   independence assumptions in statistical tests. Common workarounds including the use of bootstrap [6]
217   and the Delta method [22] generally require access to sub-randomization unit responses.

218   4.2       Experiments with adaptive stopping

219   As discussed in Section 1, experiments with adaptive stopping are getting increasingly popular
220   among the OCE community. Here we motivate the data requirement for statistical tests in this
221   domain by looking at the quantities required to calculate the test statistics for a Mixture Sequential
222   Probability Ratio Test (mSPRT) [45] and a Bayesian hypothesis test using Bayes factor [21], two
223   popular approaches in online experimentation. There are many other tests that supports adaptive
224   stopping [59, 78], though the data requirements, in terms of the dimensions defined in Section 2,
225   should be largely identical.
226   We first observe running a mSPRT with a normal mixing distribution H = N (θ0 , τ 2 ) involves
227   calculating the following test statistic upon observing the first n Xi and Yj (see Eq. (11) in [45]):
                              s
                                        2 + σ2
                                                               n2 τ 2 (Ȳn − X̄n − θ0 )2
                                                                                          
                       H,θ0           σX       Y
                     Λ̃n    =      2 + σ 2 + nτ 2 exp 2(σ 2 + σ 2 )(σ 2 + σ 2 + nτ 2 ) ,                  (3)
                                  σX       Y                  X        Y    X     Y
                         Pn                        Pn
228   where X̄n = n−1 i=1 Xi and Ȳn = n−1 j=1 Yj represent the sample mean of the Xs and Y s up
229   to sample n respectively, and τ 2 is a hyperparameter to be specified or learnt from data.
            3
          There are other ways to specify the hypotheses such as that in superiority and non-inferiority tests [34],
      though they are unlikely to change the data requirement as long as it remains anchored on θ0 .

                                                            6

230   For Bayesian hypothesis tests using Bayes factor, we calculate the (square root of) Wald test statistic
231   upon seeing the first n Xi and first m Yj [20, 21]:
                              Ȳm − X̄n                      Ȳm − X̄n               1
                      Wn,m = r          =q             2                     q1         ,             (4)
                                 2     2                σX      σ2 
                                σX    σY
                                                             + mY / n1 +    1        1
                                                                                 n + m
                                 n + m                   n                 m
                                                                                √
                                           |                     {z          } | {z }
                                                               δn,m                 En,m

232   where δn,m and En,m are the effect size (standardized by the pooled variance) and the effective sample
233   size of the test respectively. In OCE it is common to appeal to the central limit theorem and assume a
234   normal likelihood for the effect size, i.e. δn,m ∼ N (µ, 1/En,m ). We then compare the hypotheses
235   H0 : µ = θ0 and H1 : µ ∼ N (θ0 , V 2 ) by calculating the Bayes factor [46]:
                                       f (δn,m |H1 )   φ(δn,m ; θ0 , V 2 + 1/En,m )
                            BFn,m =                  =                              ,                    (5)
                                       f (δn,m |H0 )      φ(δn,m ; θ0 , 1/En,m )
236   where φ(· ; a, b) is the PDF of a normal distribution with mean a and variance b, and V 2 is a
237   hyperparameter that we specify or learn from data.
238   During an experiment with adaptive stopping, we calculate the test statistics stated above many times
239   for different n and m. This means a dataset can only support the running of such experiments if it
240   contains intermediate checkpoints for the counts (n, m) and the means (X̄n , Ȳm ), ideally cumulative
241   from the start of the experiment. Often one also requires the variances at the same time points (see
242   below). The only exception to the dimensional requirement above is the case where the dataset
243   contains responses at a randomization unit or finer level of granularity, and despite recording the
244   overall results only, has a timestamp per randomization unit. Under this special case, we will still be
245   able to construct the cumulative means (X̄n , Ȳm ) for all relevant values of n and m by ordering the
246   randomization units by their associated timestamps.
247   Learning the effect size distribution (hyper)parameters The two tests introduced above feature
248   some hyperparameters (τ 2 and V 2 ) that have to be specified or learnt from data. These parameters
249   characterizes the prior belief of the effect size distribution, which will be the most effective if it
250   “matches the distribution of true effects across the experiments a user runs” [45]. Common parameter
251   estimation procedures [1, 5, 39] require results from multiple related experiments.
252   Estimating the response variance In the equations above, the response variance of the two
                 2
253   samples σX     and σY2 are assumed to be known. In practice we often use the plug-in empirical
254   estimates (sX )n and (s2Y )m —the sample variances for the first n Xi and first m Yj respectively, and
                   2

255   thus the data dimensional requirement is identical to that of the counts and means as discussed above.
256   In the case where the plug-in estimate may be biased due to dependent data, we will also require a
257   sub-randomization unit response granularity (see Section 4.1).

258   4.3   Non-parametric tests

259   We also briefly discuss the data requirements for non-parametric tests, where we do not impose
260   any distributional assumptions on the responses but compare the hypotheses H0 : FX ≡ FY and
261   H1 : FX 6= FY , where we recall FX and FY are the distributions of the two samples.
262   One of the most commonly used (frequentist) non-parametric test in OCE, the Mann-Whitney
263   U -test [55], calculates the following test statistic:
                                                                       1   if Y < X,
                               N X M
                                                                     (
                             X
                        U=             S(Xi , Yj ), where S(X, Y ) =   1/2 if Y = X,        (6)
                              i=1 j=1                                  0   if Y > X.
264   While a rank-based method is available for large N and M , both methods require the knowledge of
265   all the Xi and Yj . Such requirement is the same for other non-parametric tests, e.g. the Wilcoxon
266   signed-rank [82], Kruskal-Wallis [51], and Kolmogorov–Smirnov tests [26]. This suggests a dataset
267   can only support a non-parametric test if it at least provides responses at a randomization unit level.
268   We conclude by showing how we can combine the individual data requirements above to obtain the
269   requirement to design and/or run experiments for more complicated statistical tests. This is possible
270   due to the orthogonal design of the taxonomy dimensions. Consider an experiment with adaptive

                                                        7

Metric 1 Metric 2 Metric 3 Metric 4

20 20 20 20

10 10 10 10

0 0 0 0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Figure 1: Distribution of p-values attained by the 99 OCEs in the ASOS Digital Experiment Dataset
using Welch’s t-tests, split by decision metrics. The leftmost bar in each histogram represents
experiments with p < 0.05. Here we treat OCEs with multiple variants as multiple independent OCEs.

271 stopping using Bayesian non-parametric tests (e.g. with a Pólya Tree prior [16, 42]). It involves a
272 non-parametric test and hence require responses at a randomization unit level. It computes multiple
273 Bayes factors for adaptive stopping and hence requires intermediate checkpoints for the responses (or
274 a timestamp for each randomization unit). Finally, to learn the hyperparameters of the Pólya Tree
275 prior we require multiple related experiments. The substantial data requirement along 3+ dimensions
276 perhaps explains the lack of relevant OCE datasets and the tendency for experimenters to use simpler
277 statistical tests for the day-to-day design and/or running of OCEs.

278 5 A Novel Dataset for Experiments with Adaptive Stopping

279 We finally introduce the ASOS Digital Experiments Dataset, which we believe is the first public
280 dataset that supports the end-to-end design and running of OCEs with adaptive stopping. We motivate
281 why this is the case, provide a light description of the dataset (and a link to the more detailed
282 accompanying datasheet),5 and showcase the capabilities of the dataset via a series of experiments.
283 We also discuss the ethical implications of releasing this dataset.
284 Recall from Section 4.2 that in order to support the end-to-end design and running of experiments
285 with adaptive stopping, we require a dataset that (1) includes multiple related experiments; (2) is
286 real, so that any parameters learnt is reflective of the real-world scenario; and either (3a) contains
287 intermediate checkpoints for the summary statistics during each experiment (i.e. time-granular), or
288 (3b) contains responses at a randomization unit granularity with a timestamp for each randomization
289 unit (i.e. response-granular with timestamps).
290 None of the datasets surveyed in Section 3 meet all three criteria. While the Upworthy [56],
291 ASSISTments [70], and MeuTutor [74] datasets meet the first two criteria, they all fail to meet the
292 third.4 The Udacity Free Trial Screener Experiment dataset meets the last two criteria by having
293 results from a real experiment with daily snapshots of the decision metrics (and hence time-granular),
294 which supports the running of an experiment with adaptive stopping. However, the dataset only
295 contains a single experiment, which is not helpful for learning the effect size distribution (the design).
296 The ASOS Digital Experiments Dataset contain results from OCEs run by a business unit within
297 ASOS.com, a global online fashion retail platform. In terms of the taxonomy defined in Section 2,
298 the dataset contain multiple (78), real experiments, with two to five variants in each experiment and
299 four decision metrics based on binary, count and real-valued responses. The results are aggregated on
300 a group level, with daily or 12-hourly checkpoints of the metric values cumulative from the start of
301 the experiment. The dataset design meets all the three criteria stated above, and hence differentiates
302 itself from other public datasets.
303 We provide readers with an accompanying datasheet (based on [36]) that provides further information
304 about the dataset, and host the dataset on Open Science Framework to ensure it is easily discoverable
305 and can be preserved long-term.5 It is worth noting that the dataset is released with the intent to
306 support development in the statistical methods required to run OCEs. The experiment results shown
307 in the dataset is not representative of ASOS.com’s overall business operations, product development,
308 or experimentation program operations, and no conclusion of such should be drawn from this dataset.
4
All three report the overall results only and hence are not time-granular. The Upworthy dataset reports
group-level statistics and hence is not response-granular. The ASSISTments and MeuTutor datasets are response-
granular but they lack the timestamp to order the samples.
5
Link to dataset and accompanying datasheet: [Private link, reviewers please refer to the OpenReview portal]

1.0                               1.0

            mSPRT p-value
                                                                                          Expt.-Variant ID

                                                 P(H0|Data)
                                                                                                 08bcc2-1
                            0.5                               0.5                                591c2c-1
                                                                                                 a4386f-1
                                                                                                 bac0d3-1
                            0.0                      0.0                                         c56288-1
                  0.0 0.2 0.4 0.6 0.8 1.0                0.0 0.2 0.4 0.6 0.8 1.0
      Figure 2: Change in (left) p-values in a mixed Sequential Probability Ratio Test (τ 2 = 5.92e-06)
      and (right) posterior belief in the null hypothesis (π(H0 |data)) in a Bayesian hypothesis test (V 2 =
      5.93e-06, π(H0 ) = 0.75) during the experiment for five experiments selected at random. The
      experiment duration (x-axis) is normalized by its overall runtime. Only results for Metric 4 is shown.

309   5.1         Potential use cases
310   Meta-analyses The multi-experiment nature of the dataset enables one to perform meta-analyses.
311   A simple example is to characterize the distribution of p-values (under a Welch’s t-test) across all
312   experiments (see Figure 1). We observe there are roughly a quarter of experiments in this dataset
313   attaining p < 0.05, and attribute this to the fact that what we experiment in OCEs are often guided by
314   what domain experts think may have an impact. Having that said, we invite external validation on
315   whether there is evidence for data dredging using e.g. [58].
316   Design and running of experiments with adaptive stopping We then demonstrate the dataset
317   can indeed support OCEs with adaptive stopping by performing a mixed Sequential Probability Ratio
318   test (mSPRT) and a Bayesian hypothesis test via Bayes factor for each experiment and metric. This
319   require learning the hyperparameters τ 2 and V 2 . We learn, for each metric, a naïve estimate for
320   V 2 by collating the δn,m (see (4)) at the end of each experiment and taking their sample variance.
321   This yields the estimates 1.30e-05, 1.07e-05, 6.49e-06, and 5.93e-06 for the four metrics
322   respectively. For τ 2 , we learn near-identical naïve estimates by collating the value of Cohen’s d
323   (see (1)) instead. However, as τ 2 captures the spread of unstandardized effect sizes, we specify in each
324   test τ 2 = d · (s2X )n , where (s2X )n is the sample variance of all responses up to the nth observation in
325   that particular experiment. The Bayesian tests also require a prior belief in the null hypothesis being
326   true (π(H0 ))—we set it to 0.75 based on what we observed in the t-tests above.
327   We then calculate the p-value in mSPRT and the posterior belief in the null hypothesis (π(H0 |data))
328   in Bayesian test for each experiment and metric at each daily/12-hourly checkpoint, following the
329   procedures stated in [45] and [21, 46] respectively. We plot the results for five experiments selected
330   at random in Figure 2, which shows the p-value for a mSPRT is monotonically non-increasing, while
331   the posterior belief for a Bayesian test can fluctuate depending on the effect size observed so far.
332   A quasi-benchmark for adaptive stopping methods Real online controlled experiments, unlike
333   machine learning tasks, generally do not have a notion of ground truth. The use of quasi-ground
334   truth enables the comparison between two hyperparameter settings of the same adaptive stopping
335   method or two adaptive stopping methods. Using as quasi-ground truth the significant / not significant
336   verdict from a Welch’s t-test at the end of the experiment as the “ground truth”. We could then
337   compare this “ground truth” to the significant / not significant verdict of a mSPRT at different stages
338   of individual experiments. This yields many “confusion matrices” over different stages of individual
339   experiments where a “Type I error” corresponds to cases where a Welch’s t-test gives a not significant
340   result and a mSRPT reports a significant result, a confusion matrix for the end of each experiment
341   can be seen in Table 2. As the dataset was collected without early stopping it allows us to perform
342   sensitivity analysis and optimization on the hyperparameters of mSPRT under what can be construed
343   as “precision-recall” tradeoff of statistically significant treatments.
344   Other use cases The time series nature of this dataset enables one to detect bias (of the estimator)
345   across time, e.g. those that are caused by concept drift or feedback loops. In the context of OCEs, [14]
346   described a number of methods to detect invalid experiments over time that may be run on this dataset.
347   Moreover, being both a multi-experiment and time series dataset also enables one to learn the
348   correlation structure across experiments, decision metrics, and time [65, 80].

349   5.2         Ethical considerations

350   We finally discuss the ethical implications of releasing the dataset, touching on data protection and
351   anonymization, potential misuses, and the ethical considerations for running OCEs in general.

                                                                    9

t-test Significant Not significant
mSPRT Significant Not significant Significant Not significant
Metric 1 19 7 2 71
Metric 2 20 7 18 49
Metric 3 16 9 6 63
Metric 4 16 11 4 63
Table 2: Comparing the number of statistically significant / not significant results reported by a
Welch’s t-test and an mSPRT at the end of an experiment for all four metrics.

352 Data protection and anonymization The dataset records aggregated activities of hundreds of
353 thousands or millions of website users for business measurement purposes and hence it is impossible to
354 identify a particular user. Moreover, to minimize the risk of disclosing business sensitive information,
355 all experiment context is either removed or anonymized such that one should not be able to tell who
356 is in an experiment, when is it run, what treatment does it involve, and what decision metrics are used.
357 We refer readers to the accompanying datasheet5 for further details in this area.
358 Potential misuses An OCE dataset, no matter how anonymized it is, reflects the behavior of its
359 participants under a certain application domain and time. We urge potential users of this dataset
360 to exercise caution when attempting to generalize the learnings. It is important to emphasize that
361 the learnings is different from the statistical methods and processes that are demonstrated on this
362 dataset. We believe the latter are generalizable, i.e. they can be applied on other datasets with a
363 similar data dimensions regardless of the datasets’ application domain, target demographics, and
364 temporal coverage, and appeal for potential users of the dataset to focus on such.
365 One example of generalizing the learnings is the use of this dataset as a full performance benchmark.
366 As discussed above, this dataset does not have a notion of ground truth and any quasi-ground truths
367 constructed are themselves a source of bias to estimators. Thus, experiment design comparison need
368 to considered at a theoretical level [52]. Another example will be directly applying the value of
369 hyperparameter(s) obtained while training a model on this dataset to another dataset. While this may
370 work for similar application domains, the less similar they are the the less likely the hyperparameters
371 learnt will transfer introducing risk in incurring bias both on the estimator and in fairness.
372 Running OCEs in general The dataset is released with the aim to support experiments with
373 adaptive stopping, which will enable faster experimentation cycle. As we run more experiments, the
374 ethical concerns for running OCEs in general will naturally mount, as they are ultimately human
375 subjects research. We reiterate the importance of the following three principles when we design and
376 run experiments [35, 63] : respect for persons, beneficence (properly assess and balance the risks and
377 benefits), and justice (ensure participants are not exploited), and refer readers to Chapter 9 of [50]
378 and its references for further discussions in this area.

379 6 Conclusion

380 Online controlled experiments (OCE) are a powerful tool for online organizations to assess their
381 digital products and services’ impact. To safeguard future methodological development in the area, it
382 is vital to have access and systematic understanding of relevant datasets arising from real experiments.
383 We described the result of the first ever survey on publicly available OCE datasets, and provided a
384 dimensional taxonomy that links the data collection and statistical test requirements. We also released
385 the first ever dataset that can support OCEs with adaptive stopping, which design is grounded on a
386 theoretical discussion between the taxonomy and statistical tests. Via extensive experiments, we also
387 showed that the dataset is capable of addressing the identified gap in the literature.
388 Our work on surveying, categorizing, and enriching the publicly available OCE datasets is just the
389 beginning and we invite the community to join in the effort. As discussed above we have yet to see a
390 dataset that can support methods dealing with correlated data due to cluster randomization, or the
391 end-to-end design and running of experiments with adaptive stopping using Bayesian non-parametric
392 tests. We also see ample opportunity to generalize the survey to cover datasets arising from uplift
393 modeling tasks, quasi-experiments, and observational studies. Finally, we can further expand the
394 taxonomy, which already supports datasets from all experiments, with extra dimensions (e.g. number
395 of features to support stratification, control variate, and uplift modeling methods) as the area matures.

396   Acknowledgments and Disclosure of Funding
397   CHBL is part-funded by the EPSRC CDT in Modern Statistics and Statistical Machine Learning at
398   Imperial College London and University of Oxford (StatML.IO) and ASOS.com.

399   References
400    [1] Christopher P. Adams. Empirical Bayesian estimation of treatment effects. SSRN Electronic Jour-
401        nal, 2018. doi: 10.2139/ssrn.3157819. URL https://doi.org/10.2139/ssrn.3157819.
402    [2] Sadiq Alreemi. Analyze AB test results, 2019. URL https://github.com/SadiqAlreemi/
403        Analyze-AB-Test-Results. Code repository.
404    [3] Florian Auer and Michael Felderer. Current state of research on continuous experimentation: A
405        systematic mapping study. In 2018 44th Euromicro Conference on Software Engineering and
406        Advanced Applications (SEAA), pages 335–344, 2018. doi: 10.1109/SEAA.2018.00062.
407    [4] Florian Auer, Rasmus Ros, Lukas Kaltenbrunner, Per Runeson, and Michael Felderer. Controlled
408        experimentation in continuous experimentation: Knowledge and challenges. Information and
409        Software Technology, 134:106551, 2021. ISSN 0950-5849. doi: https://doi.org/10.1016/
410        j.infsof.2021.106551. URL https://www.sciencedirect.com/science/article/pii/
411        S0950584921000367.
412    [5] Eduardo M. Azevedo, Alex Deng, José L. Montiel Olea, and E. Glen Weyl. Empirical Bayes
413        estimation of treatment effects with many A/B tests: An overview. AEA Papers and Proceedings,
414        109:43–47, May 2019. doi: 10.1257/pandp.20191003. URL https://www.aeaweb.org/
415        articles?id=10.1257/pandp.20191003.
416    [6] Eytan Bakshy and Dean Eckles. Uncertainty in online experiments with dependent data: An
417        evaluation of bootstrap methods. In Proceedings of the 19th ACM SIGKDD International
418        Conference on Knowledge Discovery and Data Mining, KDD ’13, page 1303–1311, New
419        York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450321747. doi:
420        10.1145/2487575.2488218. URL https://doi.org/10.1145/2487575.2488218.
421    [7] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and
422        powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B
423        (Methodological), 57(1):289–300, January 1995. doi: 10.1111/j.2517-6161.1995.tb02031.x.
424        URL https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.
425    [8] Lucas Bernardi, Themistoklis Mavridis, and Pablo Estevez. 150 successful machine learning
426        models: 6 lessons learned at Booking.com. In Proceedings of the 25th ACM SIGKDD Interna-
427        tional Conference on Knowledge Discovery & Data Mining, KDD ’19, page 1743–1751, New
428        York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi:
429        10.1145/3292500.3330744. URL https://doi.org/10.1145/3292500.3330744.
430    [9] Rasmus Bååth and Bret Romero. Mobile games A/B testing with Cookie Cats [MOOC]. URL
431        https://www.datacamp.com/projects/184.
432   [10] Will Browne and Mike Swarbrick Jones.       What works in e-commerce - a meta-
433        analysis of 6700 online experiments. http://www.qubit.com/wp-content/uploads/
434        2017/12/qubit-research-meta-analysis.pdf, 2017. URL http://www.qubit.com/
435        wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf. White paper.
436   [11] David Campos, Chester Ismay, and Shon Inouye. A/B testing in R | DataCamp [MOOC]. URL
437        https://www.datacamp.com/courses/ab-testing-in-r.
438   [12] David Campos, Chester Ismay, and Shon Inouye. Experiment dataset [Dataset], n.d..
439        URL https://assets.datacamp.com/production/repositories/2292/datasets/
440        52b52cb1ca28ce10f9a09689325c4d94d889a6da/experiment_data.csv.
441   [13] David Campos, Chester Ismay, and Shon Inouye.   Data visualization website -
442        April 2018 [Dataset], n.d.. URL https://assets.datacamp.com/production/
443        repositories/2292/datasets/b502094e5de478105cccea959d4f915a7c0afe35/
444        data_viz_website_2018_04.csv.

                                                      11

445   [14] Nanyu Chen, Min Liu, and Ya Xu. How A/B tests could go wrong: Automatic diagnosis
446        of invalid online experiments. In Proceedings of the Twelfth ACM International Conference
447        on Web Search and Data Mining, WSDM ’19, page 501–509, New York, NY, USA, 2019.
448        Association for Computing Machinery. ISBN 9781450359405. doi: 10.1145/3289600.3291000.
449        URL https://doi.org/10.1145/3289600.3291000.
450   [15] Sisi Chen. A/B tests project — Udacity DAND P2, 2020. URL https://rachelchen0104.
451        medium.com/a-b-tests-project-udacity-dand-p2-55faddd4d5e0. Blog post.
452   [16] Yuhui Chen and Timothy E. Hanson. Bayesian nonparametric k-sample tests for censored and
453        uncensored data. Computational Statistics & Data Analysis, 71:335–346, 2014. ISSN 0167-
454        9473. doi: https://doi.org/10.1016/j.csda.2012.11.003. URL https://www.sciencedirect.
455        com/science/article/pii/S0167947312003945.
456   [17] Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. Taylor & Francis, 1988.
457        ISBN 9781134742707.
458   [18] Ahmed Dawoud. E-commerce A/B testing, Version 1 [Dataset], 2021. URL https://www.
459        kaggle.com/ahmedmohameddawoud/ecommerce-ab-testing/version/1.
460   [19] Rajeev H. Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reevaluating
461        the evaluation of training programs. Journal of the American Statistical Association, 94(448):
462        1053–1062, 1999. ISSN 01621459. URL http://www.jstor.org/stable/2669919.
463   [20] Alex Deng. Objective Bayesian two sample hypothesis testing for online controlled experi-
464        ments. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15
465        Companion, page 923–928, New York, NY, USA, 2015. Association for Computing Machinery.
466        ISBN 9781450334730. doi: 10.1145/2740908.2742563. URL https://doi.org/10.1145/
467        2740908.2742563.
468   [21] Alex Deng, Jiannan Lu, and Shouyuan Chen. Continuous monitoring of A/B tests without pain:
469        Optional stopping in Bayesian testing. In 2016 IEEE International Conference on Data Science
470        and Advanced Analytics (DSAA), pages 243–252, 2016. doi: 10.1109/DSAA.2016.33.
471   [22] Alex Deng, Ulf Knoblich, and Jiannan Lu. Applying the delta method in metric analytics:
472        A practical guide with novel ideas. In Proceedings of the 24th ACM SIGKDD International
473        Conference on Knowledge Discovery & Data Mining, KDD ’18, page 233–242, New York,
474        NY, USA, 2018. Association for Computing Machinery. ISBN 9781450355520. doi: 10.1145/
475        3219819.3219919. URL https://doi.org/10.1145/3219819.3219919.
476   [23] Marco di Biase. The effects of change decomposition on code review - A controlled experi-
477        ment - Online appendix, May 2018. URL https://data.4tu.nl/articles/dataset/
478        The_Effects_of_Change_Decomposition_on_Code_Review_-_A_Controlled_
479        Experiment_-_Online_appendix/12706046/1.
480   [24] Marco di Biase, Magiel Bruntink, Arie van Deursen, and Alberto Bacchelli. The effects of
481        change decomposition on code review—a controlled experiment. PeerJ Computer Science, 5:
482        e193, May 2019. doi: 10.7717/peerj-cs.193. URL https://doi.org/10.7717/peerj-cs.
483        193.
484   [25] Eustache Diemert, Artem Betlei, Christophe Renaudin, and Amini Massih-Reza. A large
485        scale benchmark for uplift modeling, 2018. URL http://ama.imag.fr/~amini/Publis/
486        large-scale-benchmark.pdf. 2018 AdKDD & TargetAd Workshop (in conjunction with
487        KDD ’18).
488   [26] Yadolah Dodge. Kolmogorov–Smirnov Test. In The Concise Encyclopedia of Statistics, pages
489        283–287. Springer New York, New York, NY, 2008. ISBN 978-0-387-32833-1. doi: 10.1007/
490        978-0-387-32833-1_214. URL https://doi.org/10.1007/978-0-387-32833-1_214.
491   [27] Joseph Doyle, Sarah Abraham, Laura Feeney, Sarah Reimer, and Amy Finkelstein. Clinical
492        decision support for high-cost imaging: a randomized clinical trial [Dataset], 2018. URL
493        https://doi.org/10.7910/DVN/BRKDVQ.

                                                      12

494   [28] Dheeru Dua and Casey Graff. UCI machine learning repository, 2019. URL http://archive.
495        ics.uci.edu/ml.

496   [29] Pascaline Dupas, Vivian Hoffman, Michael Kremer, and Alix Peterson Zwane. Targeting health
497        subsidies through a non-price mechanism: A randomized controlled trial in Kenya, 2016. URL
498        https://doi.org/10.7910/DVN/PBLJXJ.

499   [30] Arpit Dwivedi. Cookie Cats, Version 1 [Dataset], 2020. URL https://www.kaggle.com/
500        arpitdw/cokie-cats/version/1.

501   [31] Osuolale Emmanuel. Ad A/B testing, Version 1 [Dataset], 2020. URL https://www.kaggle.
502        com/osuolaleemmanuel/ad-ab-testing/version/1.

503   [32] Aleksander Fabijan, Pavel Dmitriev, Helena Holmstrom Olsson, and Jan Bosch. Online
504        controlled experimentation at scale: An empirical survey on the current state of A/B testing. In
505        2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA),
506        pages 68–72, 2018. doi: 10.1109/SEAA.2018.00021.

507   [33] Ronald A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical
508        Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or
509        Physical Character, 222(594-604):309–368, January 1922. doi: 10.1098/rsta.1922.0009. URL
510        https://doi.org/10.1098/rsta.1922.0009.

511   [34] Committee for Proprietary Medicinal Products. Points to consider on switching between
512        superiority and non-inferiority. British Journal of Clinical Pharmacology, 52(3):223–228, Sep
513        2001. ISSN 0306-5251. doi: 10.1046/j.0306-5251.2001.01397-3.x. URL https://pubmed.
514        ncbi.nlm.nih.gov/11560553.

515   [35] United States. National Commission for the Protection of Human Subjects of Biomedical and
516        Behavioral Research. The Belmont report: ethical principles and guidelines for the protection
517        of human subjects of research, volume 2. The Commission, 1978.

518   [36] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna
519        Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets, 2020.

520   [37] Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole,
521        Steven N. Goodman, and Douglas G. Altman. Statistical tests, P values, confidence intervals,
522        and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4):337–
523        350, April 2016. doi: 10.1007/s10654-016-0149-3. URL https://doi.org/10.1007/
524        s10654-016-0149-3.

525   [38] Carrie Grimes, Caroline Buckey, and Diane Tang. A/B testing | Udacity free courses [MOOC].
526        URL https://www.udacity.com/course/ab-testing--ud257.

527   [39] F. Richard Guo, James McQueen, and Thomas S. Richardson. Empirical Bayes for large-scale
528        randomized experiments: a spectral approach, 2020.

529   [40] Kevin Hillstrom. The minethatdata e-mail analytics and data mining challenge, 2008.

530   [41] Henning Hohnhold, Deirdre O’Brien, and Diane Tang. Focusing on the long-term: It’s good
531        for users and business. In Proceedings of the 21th ACM SIGKDD International Conference on
532        Knowledge Discovery and Data Mining, KDD ’15, pages 1849–1858, New York, NY, USA,
533        2015. ACM. ISBN 978-1-4503-3664-2.

534   [42] Chris C. Holmes, François Caron, Jim E. Griffin, and David A. Stephens. Two-sample Bayesian
535        nonparametric hypothesis testing. Bayesian Analysis, 10(2):297 – 320, 2015. doi: 10.1214/
536        14-BA914. URL https://doi.org/10.1214/14-BA914.

537   [43] S. Hopewell, S. Dutton, L.-M. Yu, A.-W. Chan, and D. G Altman. The quality of reports of
538        randomised trials in 2000 and 2006: comparative study of articles indexed in PubMed. BMJ,
539        340(mar23 1):c723–c723, March 2010. doi: 10.1136/bmj.c723. URL https://doi.org/10.
540        1136/bmj.c723.

                                                       13

You can also read