Datasets for Online Controlled Experiments - OpenReview

Page created by Manuel White
 
CONTINUE READING
Datasets for Online Controlled Experiments

           C. H. Bryan Liu ?†         Ângelo Cardoso †     Paul Couturier ? Emma J. McCoy ?
                                ?                             †
                                    Imperial College London & ASOS.com, UK
                                        bryan.liu12@imperial.ac.uk

                                                    Abstract

 1            Online Controlled Experiments (OCE) are the gold standard to measure impact and
 2            guide decisions for digital products and services. Despite many methodological
 3            advances in this area, the scarcity of public datasets and the lack of a systematic
 4            review and categorization hinders its development. We present the first survey and
 5            taxonomy for OCE datasets, which highlight the lack of a public dataset to support
 6            the design and running of experiments with adaptive stopping, an increasingly pop-
 7            ular approach to enable quickly deploying improvements or rolling back degrading
 8            changes. We release the first such dataset, containing daily checkpoints of decision
 9            metrics from multiple, real experiments run on a global e-commerce platform. The
10            dataset design is guided by a broader discussion on data requirements for common
11            statistical tests used in digital experimentation. We demonstrate how to use the
12            dataset in the adaptive stopping scenario using sequential and Bayesian hypothesis
13            tests and learn the relevant parameters for each approach.

14   1   Introduction
15   Online controlled experiments (OCEs) have become popular among digital technology organisations
16   in measuring the impact of their products and services, and guiding business decisions [50, 53, 75].
17   Large tech companies including Google [41], Linkedin [84], and Microsoft [49] reported running
18   thousands of experiments on any given day, and there are multiple companies established solely to
19   manage OCEs for other businesses [10, 45]. It is also considered a key step in the machine learning
20   development lifecycle [8, 89].
21   OCEs are essentially randomized controlled trials run on the Web. The simplest example, commonly
22   known as an A/B test, splits a group of entities (e.g. users to a website) randomly into two groups,
23   where one group is exposed to some treatment (e.g. showing a "free delivery" banner on the website)
24   while the other act as the control (e.g. seeing the original website without any mention of free
25   delivery). We calculate the decision metric(s) (e.g. proportion of users who bought something)
26   based on responses from both groups, and compare the metrics using a statistical test to draw causal
27   statements about the treatment.
28   The ability to run experiments on the Web allows one to interact with a large number of subjects
29   within a short time frame and collect a large number of responses. This, together with the scale of
30   experimentation carried out by tech organizations, should lead to a wealth of datasets describing
31   the result of an experiment. However, there are not many publicly available OCE datasets, and we
32   believe they were never systematically reviewed nor categorized. This is in contrast to the machine
33   learning field, which also enjoyed its application boom in the past decade yet already has established
34   archives and detailed categorizations for its datasets [28, 77].
35   We argue the lack of relevant datasets arising from real experiments is hindering further development
36   of OCE methods (e.g. new statistical tests, bias correction, and variance reduction methods). Many
37   statistical tests proposed relied on simulated data that impose restrictive distributional assumptions

     Submitted to the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets
     and Benchmarks. Do not distribute.
38   and thus may not be representative of the real-world scenario. Moreover, it may be difficult to
39   understand how two methods differ to each other and assess their relative strength/weakness without
40   a common dataset to compare on.
41   To address this problem, we present the first ever survey and taxonomy for OCE datasets. Our
42   survey identified 12 datasets, including standalone experiment archives, accompanying datasets from
43   scholarly works, and demo datasets from online courses on the design and analysis of experiments.
44   We also categorize these datasets based on dimensions such as the number of experiments each
45   dataset contains, how granular each data point is time-wise and subject-wise, and whether it include
46   results from real experiment(s).
47   The taxonomy enables us to engage in a discussion on the data requirements for an experiment by
48   systematically mapping out which data dimension is required for which statistical test and/or learning
49   the hyperparameter(s) associated with the test. We also recognize that in practice data are often used
50   for purposes beyond what it is originally collected for [47]. Hence, we posit the mapping is equally
51   useful in allowing one to understand the options they have when choosing statistical tests given the
52   format of data they possess. Together with the survey, the taxonomy helps us to identify what types
53   of dataset are required for commonly used statistical tests, yet are missing from the public domain.
54   One of the gaps the survey and taxonomy identify is datasets that can support the design and running
55   of experiments with adaptive stopping (a.k.a. continuous monitoring / optional stopping), which we
56   motivate their use below. Traditionally, experimenters analyze experiments using Null Hypothesis
57   Statistical Tests (NHST, e.g. a Student’s t-test). These tests require one to calculate and commit to a
58   required sample size based on some expected treatment effect size, all prior to starting the experiment.
59   Making extra decisions during the experiment, be it stopping the experiment early due to seeing
60   favorable results, or extending the experiment as it teeters “on the edge of statistical significance” [61],
61   is discouraged as they risk one having more false discoveries than intended [37, 60].
62   Clearly the restrictions above are incompatible with modern decision-making processes. Businesses
63   operating online are incentivized to deploy any beneficial and roll back any damaging changes
64   as quickly as possible. Using the “free delivery” banner example above, the business may have
65   calculated that they require four weeks to observe enough users based on an expected 1% change
66   in the decision metric. If the experiment shows, two weeks in, that the banner is leading to a 2%
67   improvement, it will be unwise not to deploy the banner to all users simply due to the need to run the
68   experiment for another two weeks. Likewise, if the banner is shown leading to a 2% loss, it makes
69   every sense to immediately terminate the experiment and roll back the banner to stem further losses.
70   As a result, more experimenters are moving away from NHST and adopting adaptive stopping
71   techniques. Experiments with adaptive stopping allows one to decide whether to stop an experiment
72   at any time in between based on the sample responses observed so far without worrying much on
73   false positive/discovery rate control. To encourage further development in this area, both in methods
74   and data, we release the ASOS Digital Experiments Dataset, which contains daily checkpoints of
75   decision metrics from multiple, real OCEs run on the global online fashion retail platform.
76   The dataset design is guided by the requirements identified by the mapping between the taxonomy
77   and statistical tests, and to the best of our knowledge, is the first public dataset that can support the
78   end-to-end design and running of online experiments with adaptive stopping. We demonstrate it can
79   indeed do so by (1) running a sequential test and a Bayesian hypothesis test on all the experiments in
80   the dataset, and (2) estimating the value of hyperparameters associated with the tests. While the notion
81   of ground-truth does not exist in real OCEs, we show the dataset can also act as a quasi-benchmark
82   for statistical tests by comparing results from the tests above with that of a t-test.
83   To summarize, our contributions are:
84   1. (Sections 2 & 3) We create, to the best of our knowledge, the first ever taxonomy on online
85      controlled experiment datasets and apply it to datasets that are publicly available;
86   2. (Section 4) We map the relationship between the taxonomy to statistical tests commonly used in
87      experiments by identifying the minimally sufficient set of statistics and dimensions required in
88      each test. The mapping, which also applies to offline and non-randomized controlled experiments,
89      enables experimenters to quickly identify the data collection requirements for their experiment
90      design (and conversely the test options available given the data availability); and
91   3. (Section 5) We make available, to the best of our knowledge, the first real, multi-experiment time
92      series dataset, enabling the design and running of experimentation with adaptive stopping.

                                                          2
93    2   A Taxonomy for Online Controlled Experiment Datasets

94    We begin by presenting a taxonomy on OCE datasets, which is necessary to characterize and
95    understand the results of a survey. To the best of our knowledge, there are no surveys nor taxonomies
96    specifically on this topic prior to this work. While there is a large volume of work concerning the
97    categorization of datasets in machine learning [28, 77], of research work in the online randomized
98    controlled experiment methods [3, 4, 32, 68], and of general experiment design [43, 52], our search
99    on Google Scholar and Semantic Scholar using combinations of the keywords “online controlled
100   experiment”/“A/B test”, “dataset”, and “taxonomy”/“categorization” yields no relevant results.
101   The taxonomy focuses on the following four main dimensions:
102   Experiment count A dataset can contain the data collected from a single experiment or multiple
103   experiments. Results from a single experiment is useful for demonstrating how a test works, though
104   any learning should ideally involve multiple experiments. Two closely related but relatively minor
105   dimensions are the variant count (number of control/treatment groups in the experiment) and the
106   metric count (number of performance metrics the experiment is tracking). Having an experiment
107   with multiple variants and metrics enable the demonstration of methods such as false discovery rate
108   control procedures [7] and learning the correlation structure within an experiment.
109   Response granularity Depending on the experiment analysis requirements and constraints imposed
110   by the online experimentation platform, the dataset may contain data aggregated to various levels.
111   Consider the “free delivery” banner example in Section 1, where the website users are randomly
112   allocated to the treatment (showing the banner) and control (not showing the banner) groups, with the
113   goal to understand whether the banner changes the proportion of users who bought something. In this
114   case each individual user is considered a randomization unit [50].
115   A dataset may contain, for each experiment, only summary statistics on the group level, e.g. the
116   proportion of users who have bought something in the control and treatment groups respectively.
117   It can also record one response per randomization unit, with each row containing the user ID and
118   whether the user bought something. The more detailed activity logs at a sub-randomization unit level
119   can also have each row containing information about a particular page view from a particular user.
120   Time granularity An experiment can last anytime between a week to many months [50], which
121   provides many possibilities in recording the result. A dataset can opt to record the overall result
122   only, showing the end state of an experiment. It may or may not come with a timestamp for
123   each randomization unit or experiment if there are multiple instances of them. It can also record
124   intermediate checkpoints for the decision metrics, ideally at regular intervals such as daily or hourly.
125   These checkpoints can either be a snapshot of the interval (recording activities between time t
126   and t + 1, time t + 1 and t + 2, etc.) or cumulative from the start of the experiment (recording
127   activities between time 0 and 1, time 0 and 2, etc.).
128   Syntheticity A dataset can record data generated from a real process. It can also be synthetic—
129   generated via simulations with distributional assumptions applied. A dataset can also be semi-synthetic
130   if it is generated from a real-life process and subsequently augmented with synthetic data.
131   Note we can also describe datasets arising from any experiments (including offline and non-
132   randomized controlled experiments) using these four dimensions. We will discuss in Section 4
133   how these dimensions map to common statistical tests used in online experimentation.
134   In addition, we also record the application domain, target demographics, and the temporal coverage
135   of the experiment(s) featured in a dataset. In an age when data are often reused, it is crucial for
136   one to understand the underlying context, and that learnings from a dataset created under a certain
137   context may not translate to another context. We also see the surfacing of such context as a way to
138   promote considerations in fairness and transparency for experimenters as more experiment datasets
139   become available [44]. For example, having target demographics information on a meta-level helps
140   experimenters to identify who were involved, or perhaps more importantly, who were not involved in
141   experiments and could be adversely impacted by a treatment that is effectively untested.
142   Finally, two datasets can also differ in their medium of documentation and the presence/absence
143   of a data management / long-term preservation plan. The latter includes the hosting location, the
144   presence/absence of a DOI, and the type of license. We record these attributes for the datasets
145   surveyed below for completeness.

                                                         3
146   3       Public Online Controlled Experiment Datasets
147   Here we discuss our approach to produce the first ever survey on OCE datasets and present its results.
148   The survey is compiled via two search directions, which we describe below. For both directions, we
149   conduct a first round search in May 2021, with a follow up round in August 2021 to ensure we have
150   the most updated results.
151   We first search on the vanilla Google search engine using the keywords “Online controlled experiment
152   "dataset"”, “A/B test "dataset"”, and “Multivariate test "dataset"”. For each keyword, we inspect the
153   first 10 pages of the search result (top 100 results) for scholarly articles, web pages, blog posts, and
154   documents that may host and/or describe a publicly available OCE dataset. The search term “dataset”
155   is in double quotes to limit the search results to those with explicit mention of dataset(s). We also
156   search on specialist data search engines/hosts, namely on Google Dataset Search (GDS) and Kaggle,
157   using the keywords “Online controlled experiment(s)” and “A/B test(s)”. We inspect the metadata
158   and description for all the results returned (except for GDS, where we inspect the first 100 results for
159   “A/B test(s)”) for relevant datasets as defined below.1
160   A dataset must record the result arising from a randomized controlled experiment run online to be
161   included in the survey. The criteria excludes experimental data collected from offline experiments, e.g.
162   those in agriculture [83], medicine [27], and economics [29]. It also excludes datasets used to perform
163   quasi-experiments and observational studies, e.g. the LaLonde dataset used in econometrics [19] and
164   datasets constructed for uplift modeling tasks [25, 40].2
165   The result is presented in Table 1. We place the 12 OCE datasets identified in this exercise along the
166   four taxonomy dimensions defined in Section 2 and record the additional features. These datasets
167   include two standalone archives for online media and education experiments respectively [56, 70],
168   and three accompanying datasets for peer-reviewed research articles [23, 74, 86]. There are also tens
169   of Kaggle datasets, blog posts, and code repositories that describes and/or duplicates one of the five
170   example datasets used in four different massive open online courses on online controlled experiment
171   design and analysis [9, 11, 38, 76]. Finally, we identify two standalone datasets hosted on Kaggle
172   with relatively light documentation [31, 48].
173   From the table we observe a number of gaps in OCE dataset availability, the most obvious one being
174   the lack of datasets that record responses at a sub-randomization unit level. In the sections below, we
175   will identify more of these gaps and discuss their implications to OCE analysis.

176   4       Matching Dataset Taxonomy with Statistical Tests
177   Specifying the data requirements (or structure) and performing statistical tests are perhaps two of the
178   most common tasks carried out by data scientists. However, the link between the two processes is
179   seldom mapped out explicitly. It is all too common to consider from scratch the question “I need
180   to run this statistical test, how should I format my dataset?” (or more controversially, “I have this
181   dataset, what statistical tests can I run?” [47]) for every new project/application, despite the list of
182   possible dataset dimensions and statistical tests remaining largely the same.
183   We aim to speed up the process above by describing what summary statistics are required to perform
184   common statistical tests in OCE, and link the statistics back to the taxonomy dimensions defined in
185   Section 2. The exercise is similar to identifying the sufficient statistic(s) for a statistical model [33],
186   though the identification is done for the encapsulating statistical inference procedure, with a practical
187   focus on data dimension requirements. We do so by stating the formula used to calculate the
188   corresponding effect sizes and test statistics and observe the summary statistics required in common.
189   The general approach enables one to also apply the resultant mapping to any experiments that involves
          1
           Searching for the keyword “Online controlled experiment” on GDS and Kaggle returned 42 and 7 results
      respectively, and that for “A/B test” returned “100+” and 286 results respectively. Curiously, replacing “experi-
      ment” and “test” in the keywords with their plural form changes the number of results, with the former returning
      6 and 10 results on GDS and Kaggle respectively, and the latter returning “100+” and 303 results respectively.
         2
           Uplift modeling (UM) tasks for online applications often start with an OCE [66] and thus we can consider
      UM datasets as OCE datasets with extra randomization unit level features. The nature of the tasks are different
      though: OCEs concern validating the average treatment effect across the population using a statistical test,
      whereas UM concerns modeling the conditional average treatment effect for each individual user, making it
      more a general causal inference task that is outside the scope of this survey.

                                                             4
Experiment Count
                                                                                  Response                                                                                Data management /
    Dataset Name                              Reference(s)   - Variant / Metric                 Time Granularity      Syntheticity   Application      Documentation
                                                                                  Granularity                                                                             long-term preservation plan
                                                               Count
                                                                                                                                                                          Host: Open Science Framework
                                                             Multiple (32,487)                  Overall result only                  Media & Ads      Peer reviewed
    Upworthy Research Archive                 [56]                                Group                               Real                                                DOI: 3 (see [56])
                                                             - 2-14 / 2 (C)                     Timestamp per expt.                  Copy/Creative    data article
                                                                                                                                                                          Licence: CC BY 4.0
    ASSISTments Dataset from Multiple                        Multiple (22)                      Overall result only                  Education        Peer reviewed
                                              [70]                                Rand. Unit                          Real                                                Host: Author website
    Randomized Controlled Experiments                        - 2 / 2 (BC)                       7 timestamp                          Teaching model   data article

                                                                                                                                                      Accompanying
                                                                                                                                                                          Host: University library
    A/B Testing Web Analytics Data                           Single                             Overall result only                  Education        dataset to
                                              [86]                                Group                               Real                                                DOI: 3 (see [86])
    (From [87])                                              - 5 / 2 (C)                        X timestamp                          UX/UI change     peer-reviewed
                                                                                                                                                                          Licence: CC BY-SA 4.0
                                                                                                                                                      research article
    Dataset of two experiments of the
                                                                                                                                                                          Host: Journal website
    application of gamified peer assessment                  Multiple (2)                       Overall result only                  Education        Peer reviewed
                                              [74]                                Rand. Unit                          Real                                                DOI: 3 (see [74])
    model into online learning environment                   - 3+2 / 3+3 (R+C)                  7 timestamp                          Teaching model   data article
                                                                                                                                                                          Licence: CC BY 4.0
    MeuTutor (From [73])
    The Effects of Change Decomposition on                                                                                                                                Host: University library
                                                             Single                             Overall result only                  Tech             Peer reviewed
    Code Review - A Controlled Experiment -   [23]                                Rand. Unit                          Real                                                DOI: 3 (see [23])
                                                             - 2 / multi (CLR)                  7 timestamp                          Dev. Process     data article
    Online appendix (From [24])                                                                                                                                           License: CC BY-SA 4.0
    Udacity Free Trial Screener Experiment                                                                                                            Blog posts &        Host: Kaggle / GitHub (multiple)
                                              See e.g.       Single                             Daily checkpoint                     Education
    (From Udacity A/B Testing Course -                                            Group                               Real                            Kaggle notebooks    DOI: Unknown
                                              [62, 71, 79]   - 2 / 4 (C)                        Snapshot                             UX/UI change
    Final Project [38])                                                                                                                               e.g. [57, 69, 88]   Licence: Unknown
    "Analyse A/B Test Results" Dataset                                                                                                                Blog posts &        Host: Kaggle / GitHub (multiple)
                                              See e.g.       Single                             Overall result only                  E-commerce
    (From Udacity Online Data Analyst                                             Rand. Unit                          Unknown                         Kaggle notebooks    DOI: Unknown
                                              [2, 18, 64]    - 2 / 1 (B)                        Timestamp per RU                     UX/UI change
5

    Course - Project 3 [76])                                                                                                                          e.g. [15, 54, 67]   Licence: Unknown
                                                                                                                                                                          Host: Kaggle / Course website
    Mobile Games A/B Testing with Cookie      See e.g.       Single                             Overall result only                  Gaming
                                                                                  Rand. Unit                          Real                            Kaggle notebook     DOI: Unknown
    Cats (DataCamp project [9])               [30, 72, 85]   - 2 / 3 (BC)                       7 timestamp                          Design change
                                                                                                                                                                          Licence: Unknown
                                                                                                                                                                          Host: Course website
    Experiment Dataset (From DataCamp                        Single                             Overall result only                  Tech             Course notes
                                              [12]                                Rand. Unit                          Synthetic                                           DOI: Unknown
    A/B Testing in R Course [11])                            - 2 / 1 (B)                        Timestamp per RU                     UX/UI Change     Blog posts
                                                                                                                                                                          Licence: Unknown
    Data Visualization Website - April 2018                                                                                                                               Host: Course website
                                                             Single                             Overall result only                  Tech
    (From DataCamp A/B Testing                [13]                                Rand. Unit                          Synthetic                       Course notes        DOI: Unknown
                                                             - 2 / 4 (BR)                       Timestamp per RU                     UX/UI Change
    in R Course [11])                                                                                                                                                     Licence: Unknown
                                                                                                                                                                          Host: Kaggle
                                                             Single                             Overall result only                  E-commerce
    Grocery Website Data for AB Test          [48]                                Rand. Unit                          Unknown                         Kaggle notebook     DOI: Unknown
                                                             - 2 / 1 (B)                        7 timestamp                          UX/UI change
                                                                                                                                                                          Licence: Unknown
                                                                                                                                                                          Host: Kaggle
                                                             Single                             Overall result only                  Media & Ads
    Ad A/B Testing (aka SmartAd AB Data)      [31]                                Rand. Unit                          Unknown                         Kaggle notebook     DOI: N/A
                                                             - 2 / 2 (B)                        Timestamp per RU                     Display ads
                                                                                                                                                                          Licence: CC BY-SA 3.0

     Table 1: Results from the first ever survey of OCE datasets. The 12 datasets identified are placed on the four taxonomy dimensions defined in Section 2 together
     with the additional attributes recorded. In the Experiment Count column, a second-line value (corresponding to Variant/Metric Count) of “x / y (BCLR)” means
     the dataset features x variants and y metrics, with the metrics based on Binary, Count, Likert-scale, and Real-valued responses. In the Response Granularity and
     Time Granlarity columns, Randomization Unit is abbreviated as RU or Rand. Unit. Note that a large proportion of resources are accessed via links to non-scholarly
     articles—blog plots, Kaggle dataset pages, and GitHub repositories. These resources may not persist over time.
190   a two-sample statistical test, including offline experiments and experiments without a randomized
191   control. For brevity, we will refrain from discussing the full model assumptions as well as their
192   applicability. Instead we point readers to the relevant work in the literature.

193   4.1       Effect size and Welch’s t-test

194   We consider a two-sample setting and let X1 , · · · , XN and Y1 , · · · , YM be i.i.d. samples from
195   the distributions FX (·) and FY (·) respectively. We assume the first two moments exist for the
                                                                              2
196   distributions FX and FY , with their mean and variance denoted (µX , σX    ) and (µY , σY2 ) respectively.
197   We also denote the sample mean and variance of the two samples (X̄, sX ) and (Ȳ , s2Y ) respectively.
                                                                                2

198   Often we are interested in the difference between the mean of the two distributions ∆ = µY − µX ,
199   commonly known as the effect size (of the difference in mean) or the average treatment effect. A
200   standardized effect size enables us to compare the difference across many experiments and is thus
201   useful in meta-analyses. One commonly used effect size is Cohen’s d, defined as the difference in
202   sample means divided by the pooled sample standard deviation [17]:
                                               s
                                                   (N − 1)s2X + (M − 1)s2Y
                                d = Ȳ − X̄                                   .                     (1)
                                                           N +M −2
203   We are also interested in whether the samples carry sufficient evidence to indicate ∆ is different to a
204   prescribed value θ0 . This can be done via a hypothesis test with H0 : ∆ = θ0 and H1 : ∆ 6= θ0 .3 One
205   of the most common statistical test used in online controlled experiments is the Welch’s t-test [81], in
206   which we calculate the test statistic t as follow:
                                                          r 2
                                                               sX     s2
                                                                   + Y .
                                                        
                                          t = Ȳ − X̄                                                     (2)
                                                               N      M
207   We observe that in order to calculate the two stated quantities above, we require six quantities: two
208   means (X̄, Ȳ ), two (sample) variances (s2X , s2Y ), and two counts (N , M ). We call these quantities
209   Dimension Zero (D0) quantities as they are the bare minimum required to run a statistical test—these
210   quantities will be expanded along the taxonomy dimensions defined in Section 2.
211   Cluster randomization / dependent data The sample variance estimates (s2X , s2Y ) may be biased
212   in the case where cluster randomization is involved. Using again the “free delivery” banner example,
213   instead of randomly assigning each individual user to the control and treatment groups, the business
214   may randomly assign postcodes to the two groups, with all users from the same postcode getting the
215   same version of the website. In this case user responses may become correlated, which violates the
216   independence assumptions in statistical tests. Common workarounds including the use of bootstrap [6]
217   and the Delta method [22] generally require access to sub-randomization unit responses.

218   4.2       Experiments with adaptive stopping

219   As discussed in Section 1, experiments with adaptive stopping are getting increasingly popular
220   among the OCE community. Here we motivate the data requirement for statistical tests in this
221   domain by looking at the quantities required to calculate the test statistics for a Mixture Sequential
222   Probability Ratio Test (mSPRT) [45] and a Bayesian hypothesis test using Bayes factor [21], two
223   popular approaches in online experimentation. There are many other tests that supports adaptive
224   stopping [59, 78], though the data requirements, in terms of the dimensions defined in Section 2,
225   should be largely identical.
226   We first observe running a mSPRT with a normal mixing distribution H = N (θ0 , τ 2 ) involves
227   calculating the following test statistic upon observing the first n Xi and Yj (see Eq. (11) in [45]):
                              s
                                        2 + σ2
                                                               n2 τ 2 (Ȳn − X̄n − θ0 )2
                                                                                          
                       H,θ0           σX       Y
                     Λ̃n    =      2 + σ 2 + nτ 2 exp 2(σ 2 + σ 2 )(σ 2 + σ 2 + nτ 2 ) ,                  (3)
                                  σX       Y                  X        Y    X     Y
                         Pn                        Pn
228   where X̄n = n−1 i=1 Xi and Ȳn = n−1 j=1 Yj represent the sample mean of the Xs and Y s up
229   to sample n respectively, and τ 2 is a hyperparameter to be specified or learnt from data.
            3
          There are other ways to specify the hypotheses such as that in superiority and non-inferiority tests [34],
      though they are unlikely to change the data requirement as long as it remains anchored on θ0 .

                                                            6
230   For Bayesian hypothesis tests using Bayes factor, we calculate the (square root of) Wald test statistic
231   upon seeing the first n Xi and first m Yj [20, 21]:
                              Ȳm − X̄n                      Ȳm − X̄n               1
                      Wn,m = r          =q             2                     q1         ,             (4)
                                 2     2                σX      σ2 
                                σX    σY
                                                             + mY / n1 +    1        1
                                                                                 n + m
                                 n + m                   n                 m
                                                                                √
                                           |                     {z          } | {z }
                                                               δn,m                 En,m

232   where δn,m and En,m are the effect size (standardized by the pooled variance) and the effective sample
233   size of the test respectively. In OCE it is common to appeal to the central limit theorem and assume a
234   normal likelihood for the effect size, i.e. δn,m ∼ N (µ, 1/En,m ). We then compare the hypotheses
235   H0 : µ = θ0 and H1 : µ ∼ N (θ0 , V 2 ) by calculating the Bayes factor [46]:
                                       f (δn,m |H1 )   φ(δn,m ; θ0 , V 2 + 1/En,m )
                            BFn,m =                  =                              ,                    (5)
                                       f (δn,m |H0 )      φ(δn,m ; θ0 , 1/En,m )
236   where φ(· ; a, b) is the PDF of a normal distribution with mean a and variance b, and V 2 is a
237   hyperparameter that we specify or learn from data.
238   During an experiment with adaptive stopping, we calculate the test statistics stated above many times
239   for different n and m. This means a dataset can only support the running of such experiments if it
240   contains intermediate checkpoints for the counts (n, m) and the means (X̄n , Ȳm ), ideally cumulative
241   from the start of the experiment. Often one also requires the variances at the same time points (see
242   below). The only exception to the dimensional requirement above is the case where the dataset
243   contains responses at a randomization unit or finer level of granularity, and despite recording the
244   overall results only, has a timestamp per randomization unit. Under this special case, we will still be
245   able to construct the cumulative means (X̄n , Ȳm ) for all relevant values of n and m by ordering the
246   randomization units by their associated timestamps.
247   Learning the effect size distribution (hyper)parameters The two tests introduced above feature
248   some hyperparameters (τ 2 and V 2 ) that have to be specified or learnt from data. These parameters
249   characterizes the prior belief of the effect size distribution, which will be the most effective if it
250   “matches the distribution of true effects across the experiments a user runs” [45]. Common parameter
251   estimation procedures [1, 5, 39] require results from multiple related experiments.
252   Estimating the response variance In the equations above, the response variance of the two
                 2
253   samples σX     and σY2 are assumed to be known. In practice we often use the plug-in empirical
254   estimates (sX )n and (s2Y )m —the sample variances for the first n Xi and first m Yj respectively, and
                   2

255   thus the data dimensional requirement is identical to that of the counts and means as discussed above.
256   In the case where the plug-in estimate may be biased due to dependent data, we will also require a
257   sub-randomization unit response granularity (see Section 4.1).

258   4.3   Non-parametric tests

259   We also briefly discuss the data requirements for non-parametric tests, where we do not impose
260   any distributional assumptions on the responses but compare the hypotheses H0 : FX ≡ FY and
261   H1 : FX 6= FY , where we recall FX and FY are the distributions of the two samples.
262   One of the most commonly used (frequentist) non-parametric test in OCE, the Mann-Whitney
263   U -test [55], calculates the following test statistic:
                                                                       1   if Y < X,
                               N X M
                                                                     (
                             X
                        U=             S(Xi , Yj ), where S(X, Y ) =   1/2 if Y = X,        (6)
                              i=1 j=1                                  0   if Y > X.
264   While a rank-based method is available for large N and M , both methods require the knowledge of
265   all the Xi and Yj . Such requirement is the same for other non-parametric tests, e.g. the Wilcoxon
266   signed-rank [82], Kruskal-Wallis [51], and Kolmogorov–Smirnov tests [26]. This suggests a dataset
267   can only support a non-parametric test if it at least provides responses at a randomization unit level.
268   We conclude by showing how we can combine the individual data requirements above to obtain the
269   requirement to design and/or run experiments for more complicated statistical tests. This is possible
270   due to the orthogonal design of the taxonomy dimensions. Consider an experiment with adaptive

                                                        7
Metric 1                    Metric 2                    Metric 3                    Metric 4

      20                         20                          20                          20

      10                         10                          10                          10

       0                         0                        0                      0
        0.0       0.5       1.0 0.0         0.5       1.0 0.0       0.5      1.0 0.0        0.5      1.0
      Figure 1: Distribution of p-values attained by the 99 OCEs in the ASOS Digital Experiment Dataset
      using Welch’s t-tests, split by decision metrics. The leftmost bar in each histogram represents
      experiments with p < 0.05. Here we treat OCEs with multiple variants as multiple independent OCEs.

271   stopping using Bayesian non-parametric tests (e.g. with a Pólya Tree prior [16, 42]). It involves a
272   non-parametric test and hence require responses at a randomization unit level. It computes multiple
273   Bayes factors for adaptive stopping and hence requires intermediate checkpoints for the responses (or
274   a timestamp for each randomization unit). Finally, to learn the hyperparameters of the Pólya Tree
275   prior we require multiple related experiments. The substantial data requirement along 3+ dimensions
276   perhaps explains the lack of relevant OCE datasets and the tendency for experimenters to use simpler
277   statistical tests for the day-to-day design and/or running of OCEs.

278   5    A Novel Dataset for Experiments with Adaptive Stopping

279   We finally introduce the ASOS Digital Experiments Dataset, which we believe is the first public
280   dataset that supports the end-to-end design and running of OCEs with adaptive stopping. We motivate
281   why this is the case, provide a light description of the dataset (and a link to the more detailed
282   accompanying datasheet),5 and showcase the capabilities of the dataset via a series of experiments.
283   We also discuss the ethical implications of releasing this dataset.
284   Recall from Section 4.2 that in order to support the end-to-end design and running of experiments
285   with adaptive stopping, we require a dataset that (1) includes multiple related experiments; (2) is
286   real, so that any parameters learnt is reflective of the real-world scenario; and either (3a) contains
287   intermediate checkpoints for the summary statistics during each experiment (i.e. time-granular), or
288   (3b) contains responses at a randomization unit granularity with a timestamp for each randomization
289   unit (i.e. response-granular with timestamps).
290   None of the datasets surveyed in Section 3 meet all three criteria. While the Upworthy [56],
291   ASSISTments [70], and MeuTutor [74] datasets meet the first two criteria, they all fail to meet the
292   third.4 The Udacity Free Trial Screener Experiment dataset meets the last two criteria by having
293   results from a real experiment with daily snapshots of the decision metrics (and hence time-granular),
294   which supports the running of an experiment with adaptive stopping. However, the dataset only
295   contains a single experiment, which is not helpful for learning the effect size distribution (the design).
296   The ASOS Digital Experiments Dataset contain results from OCEs run by a business unit within
297   ASOS.com, a global online fashion retail platform. In terms of the taxonomy defined in Section 2,
298   the dataset contain multiple (78), real experiments, with two to five variants in each experiment and
299   four decision metrics based on binary, count and real-valued responses. The results are aggregated on
300   a group level, with daily or 12-hourly checkpoints of the metric values cumulative from the start of
301   the experiment. The dataset design meets all the three criteria stated above, and hence differentiates
302   itself from other public datasets.
303   We provide readers with an accompanying datasheet (based on [36]) that provides further information
304   about the dataset, and host the dataset on Open Science Framework to ensure it is easily discoverable
305   and can be preserved long-term.5 It is worth noting that the dataset is released with the intent to
306   support development in the statistical methods required to run OCEs. The experiment results shown
307   in the dataset is not representative of ASOS.com’s overall business operations, product development,
308   or experimentation program operations, and no conclusion of such should be drawn from this dataset.
         4
           All three report the overall results only and hence are not time-granular. The Upworthy dataset reports
      group-level statistics and hence is not response-granular. The ASSISTments and MeuTutor datasets are response-
      granular but they lack the timestamp to order the samples.
         5
           Link to dataset and accompanying datasheet: [Private link, reviewers please refer to the OpenReview portal]

                                                             8
1.0                               1.0

            mSPRT p-value
                                                                                          Expt.-Variant ID

                                                 P(H0|Data)
                                                                                                 08bcc2-1
                            0.5                               0.5                                591c2c-1
                                                                                                 a4386f-1
                                                                                                 bac0d3-1
                            0.0                      0.0                                         c56288-1
                  0.0 0.2 0.4 0.6 0.8 1.0                0.0 0.2 0.4 0.6 0.8 1.0
      Figure 2: Change in (left) p-values in a mixed Sequential Probability Ratio Test (τ 2 = 5.92e-06)
      and (right) posterior belief in the null hypothesis (π(H0 |data)) in a Bayesian hypothesis test (V 2 =
      5.93e-06, π(H0 ) = 0.75) during the experiment for five experiments selected at random. The
      experiment duration (x-axis) is normalized by its overall runtime. Only results for Metric 4 is shown.

309   5.1         Potential use cases
310   Meta-analyses The multi-experiment nature of the dataset enables one to perform meta-analyses.
311   A simple example is to characterize the distribution of p-values (under a Welch’s t-test) across all
312   experiments (see Figure 1). We observe there are roughly a quarter of experiments in this dataset
313   attaining p < 0.05, and attribute this to the fact that what we experiment in OCEs are often guided by
314   what domain experts think may have an impact. Having that said, we invite external validation on
315   whether there is evidence for data dredging using e.g. [58].
316   Design and running of experiments with adaptive stopping We then demonstrate the dataset
317   can indeed support OCEs with adaptive stopping by performing a mixed Sequential Probability Ratio
318   test (mSPRT) and a Bayesian hypothesis test via Bayes factor for each experiment and metric. This
319   require learning the hyperparameters τ 2 and V 2 . We learn, for each metric, a naïve estimate for
320   V 2 by collating the δn,m (see (4)) at the end of each experiment and taking their sample variance.
321   This yields the estimates 1.30e-05, 1.07e-05, 6.49e-06, and 5.93e-06 for the four metrics
322   respectively. For τ 2 , we learn near-identical naïve estimates by collating the value of Cohen’s d
323   (see (1)) instead. However, as τ 2 captures the spread of unstandardized effect sizes, we specify in each
324   test τ 2 = d · (s2X )n , where (s2X )n is the sample variance of all responses up to the nth observation in
325   that particular experiment. The Bayesian tests also require a prior belief in the null hypothesis being
326   true (π(H0 ))—we set it to 0.75 based on what we observed in the t-tests above.
327   We then calculate the p-value in mSPRT and the posterior belief in the null hypothesis (π(H0 |data))
328   in Bayesian test for each experiment and metric at each daily/12-hourly checkpoint, following the
329   procedures stated in [45] and [21, 46] respectively. We plot the results for five experiments selected
330   at random in Figure 2, which shows the p-value for a mSPRT is monotonically non-increasing, while
331   the posterior belief for a Bayesian test can fluctuate depending on the effect size observed so far.
332   A quasi-benchmark for adaptive stopping methods Real online controlled experiments, unlike
333   machine learning tasks, generally do not have a notion of ground truth. The use of quasi-ground
334   truth enables the comparison between two hyperparameter settings of the same adaptive stopping
335   method or two adaptive stopping methods. Using as quasi-ground truth the significant / not significant
336   verdict from a Welch’s t-test at the end of the experiment as the “ground truth”. We could then
337   compare this “ground truth” to the significant / not significant verdict of a mSPRT at different stages
338   of individual experiments. This yields many “confusion matrices” over different stages of individual
339   experiments where a “Type I error” corresponds to cases where a Welch’s t-test gives a not significant
340   result and a mSRPT reports a significant result, a confusion matrix for the end of each experiment
341   can be seen in Table 2. As the dataset was collected without early stopping it allows us to perform
342   sensitivity analysis and optimization on the hyperparameters of mSPRT under what can be construed
343   as “precision-recall” tradeoff of statistically significant treatments.
344   Other use cases The time series nature of this dataset enables one to detect bias (of the estimator)
345   across time, e.g. those that are caused by concept drift or feedback loops. In the context of OCEs, [14]
346   described a number of methods to detect invalid experiments over time that may be run on this dataset.
347   Moreover, being both a multi-experiment and time series dataset also enables one to learn the
348   correlation structure across experiments, decision metrics, and time [65, 80].

349   5.2         Ethical considerations

350   We finally discuss the ethical implications of releasing the dataset, touching on data protection and
351   anonymization, potential misuses, and the ethical considerations for running OCEs in general.

                                                                    9
t-test           Significant                   Not significant
                     mSPRT Significant Not significant Significant Not significant
                    Metric 1     19                 7              2              71
                    Metric 2     20                 7             18              49
                    Metric 3     16                 9              6              63
                    Metric 4     16                11              4              63
      Table 2: Comparing the number of statistically significant / not significant results reported by a
      Welch’s t-test and an mSPRT at the end of an experiment for all four metrics.

352   Data protection and anonymization The dataset records aggregated activities of hundreds of
353   thousands or millions of website users for business measurement purposes and hence it is impossible to
354   identify a particular user. Moreover, to minimize the risk of disclosing business sensitive information,
355   all experiment context is either removed or anonymized such that one should not be able to tell who
356   is in an experiment, when is it run, what treatment does it involve, and what decision metrics are used.
357   We refer readers to the accompanying datasheet5 for further details in this area.
358   Potential misuses An OCE dataset, no matter how anonymized it is, reflects the behavior of its
359   participants under a certain application domain and time. We urge potential users of this dataset
360   to exercise caution when attempting to generalize the learnings. It is important to emphasize that
361   the learnings is different from the statistical methods and processes that are demonstrated on this
362   dataset. We believe the latter are generalizable, i.e. they can be applied on other datasets with a
363   similar data dimensions regardless of the datasets’ application domain, target demographics, and
364   temporal coverage, and appeal for potential users of the dataset to focus on such.
365   One example of generalizing the learnings is the use of this dataset as a full performance benchmark.
366   As discussed above, this dataset does not have a notion of ground truth and any quasi-ground truths
367   constructed are themselves a source of bias to estimators. Thus, experiment design comparison need
368   to considered at a theoretical level [52]. Another example will be directly applying the value of
369   hyperparameter(s) obtained while training a model on this dataset to another dataset. While this may
370   work for similar application domains, the less similar they are the the less likely the hyperparameters
371   learnt will transfer introducing risk in incurring bias both on the estimator and in fairness.
372   Running OCEs in general The dataset is released with the aim to support experiments with
373   adaptive stopping, which will enable faster experimentation cycle. As we run more experiments, the
374   ethical concerns for running OCEs in general will naturally mount, as they are ultimately human
375   subjects research. We reiterate the importance of the following three principles when we design and
376   run experiments [35, 63] : respect for persons, beneficence (properly assess and balance the risks and
377   benefits), and justice (ensure participants are not exploited), and refer readers to Chapter 9 of [50]
378   and its references for further discussions in this area.

379   6   Conclusion

380   Online controlled experiments (OCE) are a powerful tool for online organizations to assess their
381   digital products and services’ impact. To safeguard future methodological development in the area, it
382   is vital to have access and systematic understanding of relevant datasets arising from real experiments.
383   We described the result of the first ever survey on publicly available OCE datasets, and provided a
384   dimensional taxonomy that links the data collection and statistical test requirements. We also released
385   the first ever dataset that can support OCEs with adaptive stopping, which design is grounded on a
386   theoretical discussion between the taxonomy and statistical tests. Via extensive experiments, we also
387   showed that the dataset is capable of addressing the identified gap in the literature.
388   Our work on surveying, categorizing, and enriching the publicly available OCE datasets is just the
389   beginning and we invite the community to join in the effort. As discussed above we have yet to see a
390   dataset that can support methods dealing with correlated data due to cluster randomization, or the
391   end-to-end design and running of experiments with adaptive stopping using Bayesian non-parametric
392   tests. We also see ample opportunity to generalize the survey to cover datasets arising from uplift
393   modeling tasks, quasi-experiments, and observational studies. Finally, we can further expand the
394   taxonomy, which already supports datasets from all experiments, with extra dimensions (e.g. number
395   of features to support stratification, control variate, and uplift modeling methods) as the area matures.

                                                        10
396   Acknowledgments and Disclosure of Funding
397   CHBL is part-funded by the EPSRC CDT in Modern Statistics and Statistical Machine Learning at
398   Imperial College London and University of Oxford (StatML.IO) and ASOS.com.

399   References
400    [1] Christopher P. Adams. Empirical Bayesian estimation of treatment effects. SSRN Electronic Jour-
401        nal, 2018. doi: 10.2139/ssrn.3157819. URL https://doi.org/10.2139/ssrn.3157819.
402    [2] Sadiq Alreemi. Analyze AB test results, 2019. URL https://github.com/SadiqAlreemi/
403        Analyze-AB-Test-Results. Code repository.
404    [3] Florian Auer and Michael Felderer. Current state of research on continuous experimentation: A
405        systematic mapping study. In 2018 44th Euromicro Conference on Software Engineering and
406        Advanced Applications (SEAA), pages 335–344, 2018. doi: 10.1109/SEAA.2018.00062.
407    [4] Florian Auer, Rasmus Ros, Lukas Kaltenbrunner, Per Runeson, and Michael Felderer. Controlled
408        experimentation in continuous experimentation: Knowledge and challenges. Information and
409        Software Technology, 134:106551, 2021. ISSN 0950-5849. doi: https://doi.org/10.1016/
410        j.infsof.2021.106551. URL https://www.sciencedirect.com/science/article/pii/
411        S0950584921000367.
412    [5] Eduardo M. Azevedo, Alex Deng, José L. Montiel Olea, and E. Glen Weyl. Empirical Bayes
413        estimation of treatment effects with many A/B tests: An overview. AEA Papers and Proceedings,
414        109:43–47, May 2019. doi: 10.1257/pandp.20191003. URL https://www.aeaweb.org/
415        articles?id=10.1257/pandp.20191003.
416    [6] Eytan Bakshy and Dean Eckles. Uncertainty in online experiments with dependent data: An
417        evaluation of bootstrap methods. In Proceedings of the 19th ACM SIGKDD International
418        Conference on Knowledge Discovery and Data Mining, KDD ’13, page 1303–1311, New
419        York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450321747. doi:
420        10.1145/2487575.2488218. URL https://doi.org/10.1145/2487575.2488218.
421    [7] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and
422        powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B
423        (Methodological), 57(1):289–300, January 1995. doi: 10.1111/j.2517-6161.1995.tb02031.x.
424        URL https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.
425    [8] Lucas Bernardi, Themistoklis Mavridis, and Pablo Estevez. 150 successful machine learning
426        models: 6 lessons learned at Booking.com. In Proceedings of the 25th ACM SIGKDD Interna-
427        tional Conference on Knowledge Discovery & Data Mining, KDD ’19, page 1743–1751, New
428        York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi:
429        10.1145/3292500.3330744. URL https://doi.org/10.1145/3292500.3330744.
430    [9] Rasmus Bååth and Bret Romero. Mobile games A/B testing with Cookie Cats [MOOC]. URL
431        https://www.datacamp.com/projects/184.
432   [10] Will Browne and Mike Swarbrick Jones.       What works in e-commerce - a meta-
433        analysis of 6700 online experiments. http://www.qubit.com/wp-content/uploads/
434        2017/12/qubit-research-meta-analysis.pdf, 2017. URL http://www.qubit.com/
435        wp-content/uploads/2017/12/qubit-research-meta-analysis.pdf. White paper.
436   [11] David Campos, Chester Ismay, and Shon Inouye. A/B testing in R | DataCamp [MOOC]. URL
437        https://www.datacamp.com/courses/ab-testing-in-r.
438   [12] David Campos, Chester Ismay, and Shon Inouye. Experiment dataset [Dataset], n.d..
439        URL https://assets.datacamp.com/production/repositories/2292/datasets/
440        52b52cb1ca28ce10f9a09689325c4d94d889a6da/experiment_data.csv.
441   [13] David Campos, Chester Ismay, and Shon Inouye.   Data visualization website -
442        April 2018 [Dataset], n.d.. URL https://assets.datacamp.com/production/
443        repositories/2292/datasets/b502094e5de478105cccea959d4f915a7c0afe35/
444        data_viz_website_2018_04.csv.

                                                      11
445   [14] Nanyu Chen, Min Liu, and Ya Xu. How A/B tests could go wrong: Automatic diagnosis
446        of invalid online experiments. In Proceedings of the Twelfth ACM International Conference
447        on Web Search and Data Mining, WSDM ’19, page 501–509, New York, NY, USA, 2019.
448        Association for Computing Machinery. ISBN 9781450359405. doi: 10.1145/3289600.3291000.
449        URL https://doi.org/10.1145/3289600.3291000.
450   [15] Sisi Chen. A/B tests project — Udacity DAND P2, 2020. URL https://rachelchen0104.
451        medium.com/a-b-tests-project-udacity-dand-p2-55faddd4d5e0. Blog post.
452   [16] Yuhui Chen and Timothy E. Hanson. Bayesian nonparametric k-sample tests for censored and
453        uncensored data. Computational Statistics & Data Analysis, 71:335–346, 2014. ISSN 0167-
454        9473. doi: https://doi.org/10.1016/j.csda.2012.11.003. URL https://www.sciencedirect.
455        com/science/article/pii/S0167947312003945.
456   [17] Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. Taylor & Francis, 1988.
457        ISBN 9781134742707.
458   [18] Ahmed Dawoud. E-commerce A/B testing, Version 1 [Dataset], 2021. URL https://www.
459        kaggle.com/ahmedmohameddawoud/ecommerce-ab-testing/version/1.
460   [19] Rajeev H. Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reevaluating
461        the evaluation of training programs. Journal of the American Statistical Association, 94(448):
462        1053–1062, 1999. ISSN 01621459. URL http://www.jstor.org/stable/2669919.
463   [20] Alex Deng. Objective Bayesian two sample hypothesis testing for online controlled experi-
464        ments. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15
465        Companion, page 923–928, New York, NY, USA, 2015. Association for Computing Machinery.
466        ISBN 9781450334730. doi: 10.1145/2740908.2742563. URL https://doi.org/10.1145/
467        2740908.2742563.
468   [21] Alex Deng, Jiannan Lu, and Shouyuan Chen. Continuous monitoring of A/B tests without pain:
469        Optional stopping in Bayesian testing. In 2016 IEEE International Conference on Data Science
470        and Advanced Analytics (DSAA), pages 243–252, 2016. doi: 10.1109/DSAA.2016.33.
471   [22] Alex Deng, Ulf Knoblich, and Jiannan Lu. Applying the delta method in metric analytics:
472        A practical guide with novel ideas. In Proceedings of the 24th ACM SIGKDD International
473        Conference on Knowledge Discovery & Data Mining, KDD ’18, page 233–242, New York,
474        NY, USA, 2018. Association for Computing Machinery. ISBN 9781450355520. doi: 10.1145/
475        3219819.3219919. URL https://doi.org/10.1145/3219819.3219919.
476   [23] Marco di Biase. The effects of change decomposition on code review - A controlled experi-
477        ment - Online appendix, May 2018. URL https://data.4tu.nl/articles/dataset/
478        The_Effects_of_Change_Decomposition_on_Code_Review_-_A_Controlled_
479        Experiment_-_Online_appendix/12706046/1.
480   [24] Marco di Biase, Magiel Bruntink, Arie van Deursen, and Alberto Bacchelli. The effects of
481        change decomposition on code review—a controlled experiment. PeerJ Computer Science, 5:
482        e193, May 2019. doi: 10.7717/peerj-cs.193. URL https://doi.org/10.7717/peerj-cs.
483        193.
484   [25] Eustache Diemert, Artem Betlei, Christophe Renaudin, and Amini Massih-Reza. A large
485        scale benchmark for uplift modeling, 2018. URL http://ama.imag.fr/~amini/Publis/
486        large-scale-benchmark.pdf. 2018 AdKDD & TargetAd Workshop (in conjunction with
487        KDD ’18).
488   [26] Yadolah Dodge. Kolmogorov–Smirnov Test. In The Concise Encyclopedia of Statistics, pages
489        283–287. Springer New York, New York, NY, 2008. ISBN 978-0-387-32833-1. doi: 10.1007/
490        978-0-387-32833-1_214. URL https://doi.org/10.1007/978-0-387-32833-1_214.
491   [27] Joseph Doyle, Sarah Abraham, Laura Feeney, Sarah Reimer, and Amy Finkelstein. Clinical
492        decision support for high-cost imaging: a randomized clinical trial [Dataset], 2018. URL
493        https://doi.org/10.7910/DVN/BRKDVQ.

                                                      12
494   [28] Dheeru Dua and Casey Graff. UCI machine learning repository, 2019. URL http://archive.
495        ics.uci.edu/ml.

496   [29] Pascaline Dupas, Vivian Hoffman, Michael Kremer, and Alix Peterson Zwane. Targeting health
497        subsidies through a non-price mechanism: A randomized controlled trial in Kenya, 2016. URL
498        https://doi.org/10.7910/DVN/PBLJXJ.

499   [30] Arpit Dwivedi. Cookie Cats, Version 1 [Dataset], 2020. URL https://www.kaggle.com/
500        arpitdw/cokie-cats/version/1.

501   [31] Osuolale Emmanuel. Ad A/B testing, Version 1 [Dataset], 2020. URL https://www.kaggle.
502        com/osuolaleemmanuel/ad-ab-testing/version/1.

503   [32] Aleksander Fabijan, Pavel Dmitriev, Helena Holmstrom Olsson, and Jan Bosch. Online
504        controlled experimentation at scale: An empirical survey on the current state of A/B testing. In
505        2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA),
506        pages 68–72, 2018. doi: 10.1109/SEAA.2018.00021.

507   [33] Ronald A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical
508        Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or
509        Physical Character, 222(594-604):309–368, January 1922. doi: 10.1098/rsta.1922.0009. URL
510        https://doi.org/10.1098/rsta.1922.0009.

511   [34] Committee for Proprietary Medicinal Products. Points to consider on switching between
512        superiority and non-inferiority. British Journal of Clinical Pharmacology, 52(3):223–228, Sep
513        2001. ISSN 0306-5251. doi: 10.1046/j.0306-5251.2001.01397-3.x. URL https://pubmed.
514        ncbi.nlm.nih.gov/11560553.

515   [35] United States. National Commission for the Protection of Human Subjects of Biomedical and
516        Behavioral Research. The Belmont report: ethical principles and guidelines for the protection
517        of human subjects of research, volume 2. The Commission, 1978.

518   [36] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna
519        Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets, 2020.

520   [37] Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole,
521        Steven N. Goodman, and Douglas G. Altman. Statistical tests, P values, confidence intervals,
522        and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4):337–
523        350, April 2016. doi: 10.1007/s10654-016-0149-3. URL https://doi.org/10.1007/
524        s10654-016-0149-3.

525   [38] Carrie Grimes, Caroline Buckey, and Diane Tang. A/B testing | Udacity free courses [MOOC].
526        URL https://www.udacity.com/course/ab-testing--ud257.

527   [39] F. Richard Guo, James McQueen, and Thomas S. Richardson. Empirical Bayes for large-scale
528        randomized experiments: a spectral approach, 2020.

529   [40] Kevin Hillstrom. The minethatdata e-mail analytics and data mining challenge, 2008.

530   [41] Henning Hohnhold, Deirdre O’Brien, and Diane Tang. Focusing on the long-term: It’s good
531        for users and business. In Proceedings of the 21th ACM SIGKDD International Conference on
532        Knowledge Discovery and Data Mining, KDD ’15, pages 1849–1858, New York, NY, USA,
533        2015. ACM. ISBN 978-1-4503-3664-2.

534   [42] Chris C. Holmes, François Caron, Jim E. Griffin, and David A. Stephens. Two-sample Bayesian
535        nonparametric hypothesis testing. Bayesian Analysis, 10(2):297 – 320, 2015. doi: 10.1214/
536        14-BA914. URL https://doi.org/10.1214/14-BA914.

537   [43] S. Hopewell, S. Dutton, L.-M. Yu, A.-W. Chan, and D. G Altman. The quality of reports of
538        randomised trials in 2000 and 2006: comparative study of articles indexed in PubMed. BMJ,
539        340(mar23 1):c723–c723, March 2010. doi: 10.1136/bmj.c723. URL https://doi.org/10.
540        1136/bmj.c723.

                                                       13
You can also read