(CPA) Receiver operating characteristic (ROC) movies, universal ROC (UROC) curves, and coefficient of predictive ability

Page created by Ruben Ortega
 
CONTINUE READING
Receiver operating characteristic (ROC) movies, universal
                                               ROC (UROC) curves, and coefficient of predictive ability
                                                                        (CPA)

                                                                      Tilmann Gneiting1,2 and Eva-Maria Walz2,1
                                                          1
                                                           Heidelberg Institute for Theoretical Studies (HITS), Germany
arXiv:1912.01956v3 [stat.ML] 24 Jun 2021

                                               2
                                                   Institute for Stochastics, Karlsruhe Institute of Technology (KIT), Germany
                                                                                          June 25, 2021

                                                                                              Abstract
                                                        Throughout science and technology, receiver operating characteristic (ROC) curves and
                                                    associated area under the curve (AUC) measures constitute powerful tools for assessing the
                                                    predictive abilities of features, markers and tests in binary classification problems. Despite
                                                    its immense popularity, ROC analysis has been subject to a fundamental restriction, in that
                                                    it applies to dichotomous (yes or no) outcomes only. Here we introduce ROC movies and
                                                    universal ROC (UROC) curves that apply to just any linearly ordered outcome, along with
                                                    an associated coefficient of predictive ability (CPA) measure. CPA equals the area under the
                                                    UROC curve, and admits appealing interpretations in terms of probabilities and rank based
                                                    covariances. For binary outcomes CPA equals AUC, and for pairwise distinct outcomes CPA
                                                    relates linearly to Spearman’s coefficient, in the same way that the C index relates linearly to
                                                    Kendall’s coefficient. ROC movies, UROC curves, and CPA nest and generalize the tools of
                                                    classical ROC analysis, and are bound to supersede them in a wealth of applications. Their
                                                    usage is illustrated in data examples from biomedicine and meteorology, where rank based
                                                    measures yield new insights in the WeatherBench comparison of the predictive performance
                                                    of convolutional neural networks and physical-numerical models for weather prediction.

                                                    Keywords: C index; classification and regression; evaluation metric; rank correlation; coeffi-
                                                    cient; ROC analysis

                                           1       Introduction
                                           Originating from signal processing and psychology, popularized in the 1980s (Hanley and McNeil,
                                           1982; Swets, 1988), and witnessing a surge of usage in machine learning (Bradley, 1997; Huang
                                           and Ling, 2005; Fawcett, 2006; Flach, 2016), receiver operating characteristic or relative operating
                                           characteristic (ROC) curves and area under the ROC curve (AUC) measures belong to the most
                                           widely used quantitative tools in science and technology. Strikingly, a Web of Science topic search
                                           for the terms “receiver operating characteristic” or “ROC” yields well over 15,000 scientific papers
                                           published in calendar year 2019 alone. In a nutshell, the ROC curve quantifies the potential value
                                           of a real-valued classifier score, feature, marker, or test as a predictor of a binary outcome. To give
                                           a classical example, Fig. 1 illustrates the initial levels of two biomedical markers, serum albumin

                                                                                                   1
a                                                    b                                                                          c
                                                                         1
                                                                                                                                         20
         9       Survival                                                                                                                     Survival
                 Non−Survival                                           5/6                                                                   Non−Survival

                                                                                                                                         15
                                                                        2/3

                                                          Sensitivity
 Count

                                                                                                                                 Count
         6
                                                                        1/2                                                              10

                                                                        1/3
         3
                                                                                                                                          5
                                                                                                            AUC: 0.77
                                                                        1/6
                                                                                                            AUC: 0.73

         0                                                               0                                                                0
             2                  3                 4                           0     1/6    1/3       1/2   2/3      5/6      1                          20            10        0
                          Serum Albumin (mg/dl)                                            1 − Specificity                                            Serum Bilirubin (mg/dl)

                                      its a test                                  Marker         Albumin         Bilirubin                                        its a test
Figure 1: Traditional ROC curves for two biomedical markers, serum albumin and serum biliru-
bin, as predictors of patient survival beyond a threshold value of 1462 days (four years) in a Mayo
Clinic trial. (a, c) Bar plots of marker levels conditional on survival or non-survival. The stronger
shading results from overlap. For bilirubin, we reverse orientation, as is customary in the biomed-
ical literature. (b) ROC curves and AUC values. The crosses correspond to binary classifiers at
the feature thresholds indicated in the bar plots.

and serum bilirubin, in a Mayo Clinic trial on primary biliary cirrhosis (PBC), a chronic fatal
disease of the liver (Dickson et al., 1989). While patient records specify the duration of survival in
days, traditional ROC analysis mandates the reduction of the outcome to a binary event, which
here we take as survival beyond four years. Assuming that higher marker values are more indicative
of survival, we can take any threshold value to predict survival if the marker exceeds the threshold,
and non-survival otherwise. This type of binary classifier yields true positives, false positives
(erroneous predictions of survival), true negatives, and false negatives (erroneous predictions of
non-survival). The ROC curve is the piecewise linear curve that plots the true positive rate, or
sensitivity, versus the false positive rate, or one minus the specificity, as the threshold for the
classifier moves through all possible values.
    Despite its popularity, ROC analysis has been subject to a fundamental shortcoming, namely,
the restriction to binary outcomes. Real-valued outcomes are ubiquitous in scientific practice, and
investigators have been forced to artificially make them binary if the tools of ROC analysis are
to be applied. In this light, researchers have been seeking generalizations of ROC analysis that
apply to just any type of ordinal or real-valued outcomes in natural ways (Etzioni et al., 1999;
Heagerty et al., 2000; Bi and Bennett, 2003; Pencina and D’Agostino, 2004; Heagerty and Zheng,
2005; Rosset et al., 2005; Mason and Weigel, 2009; Hernández-Orallo, 2013). Still, notwithstanding
decades of scientific endeavor, a fully satisfactory generalization has been elusive.
    In this paper, we propose a powerful generalization of ROC analysis, which overcomes extant
shortcomings, and introduce data science tools in the form of the ROC movie, the universal ROC
(UROC) curve, and an associated, rank based coefficient of (potential) predictive ability (CPA)
measure — tools that apply to just any linearly ordered outcome, including both binary, ordinal,
mixed discrete-continuous, and continuous variables. The ROC movie comprises the sequence of
the traditional, static ROC curves as the linearly ordered outcome is converted to a binary variable
at successively higher thresholds. The UROC curve is a weighted average of the individual ROC
curves that constitute the ROC movie, with weights that depend on the class configuration, as
induced by the unique values of the outcome, in judiciously predicated, well-defined ways. CPA
is a weighted average of the individual AUC values in the very same way that the UROC curve

                                                                                                 2
Start ROC Movie

Figure 2: ROC movies, UROC curves, and CPA for two biomedical markers, serum albumin and
serum bilirubin, as predictors of patient survival (in days) in a Mayo Clinic trial. The ROC movies
show the traditional ROC curves for binary events that correspond to patient survival beyond
successively higher thresholds. The numbers at upper left show the current value of the threshold
in days, at upper middle the respective relative weight, and at bottom right the AUC values. The
threshold value of 1462 days recovers the traditional ROC curves in Fig. 1. The video ends in a
static screen with the UROC curves and CPA values for the two markers.

is a weighted average of the individual ROC curves that constitute the ROC movie. Hence, CPA
equals the area under the UROC curve. This set of generalized tools reduces to the standard ROC
curve and AUC when applied to binary outcomes. Moreover, key properties and relations from
conventional ROC theory extend to ROC movies, UROC curves, and CPA in meaningful ways,
to result in a coherent toolbox that properly extends the standard ROC concept. For a graphical
preview, we return to the survival data example from Fig. 1, where the outcome was artificially
made binary. Equipped with the new set of tools we no longer need to transform survival time
into a specific dichotomous outcome. Figure 2 shows the ROC movie, the UROC curve, and CPA
for the survival dataset.
    The remainder of the paper is organized as follows. Section 2 provides a brief review of conven-
tional ROC analysis for dichotomous outcomes. The key technical development is in Sections 3 and
4, where we introduce and study ROC movies, UROC curves, and the rank based CPA measure.
To illustrate practical usage and relevance, real data examples from survival analysis and weather
prediction are presented in Section 5. We monitor recent progress in numerical weather prediction
(NWP) and shed new light on a recent comparison of the predictive abilities of convolutional neural
networks (CNNs) vs. traditional NWP models. The paper closes with a discussion in Section 6.

2    Receiver operating characteristic (ROC) curves and area
     under the curve (AUC) for binary outcomes
Before we introduce ROC movies, UROC curves, and CPA, it is essential that we establish notation
and review the classical case of ROC analysis for binary outcomes, as described in review articles

                                                 3
and monographs by Hanley and McNeil (1982), Swets (1988), Bradley (1997), Pepe (2003), Fawcett
(2006), and Flach (2016), among others.

2.1    Binary setting
Throughout this section we consider bivariate data of the form

                                   (x1 , y1 ), . . . , (xn , yn ) ∈ R × {0, 1},                           (1)

where xi ∈ R is a real-valued classifier score, feature, marker, or covariate value, and yi ∈ {0, 1} is
a binary outcome, for i = 1, . . . , n. Following the extant literature, we refer to y = 1 as the positive
outcome and to y = 0 as the negative outcome, and we assume that higher values of the feature
are indicative of stronger support for the positive outcome. Throughout we assume that there is
at least one index i ∈ {1, . . . , n} with yi = 0, and a further index j ∈ {1, . . . , n} with yj = 1.

2.2    Receiver operating characteristic (ROC) curves
We can use any threshold value x ∈ R to obtain a hard classifier, by predicting a positive outcome
for a feature value > x, and predicting a negative outcome for a feature value ≤ x. If we compare
to the actual outcome, four possibilities arise. True positive and true negative cases correspond
to correctly classified instances from class 1 and class 0, respectively. Similarly, false positive and
false negative cases are misclassified instances from class 1 and class 0, respectively.
    Considering the data (1), we obtain the respective true positive rate, hit rate or sensitivity (se),

                                               i=1 1{xi > x, yi = 1}
                                           1
                                             Pn
                                           n
                                  se(x) =                            ,
                                                   i=1 1{yi = 1}
                                              1
                                                Pn
                                              n

and the false negative rate, false alarm rate or one minus the specificity (sp),
                                                   1{xi > x, yi = 0}
                                            1
                                              Pn
                               1 − sp(x) = n 1i=1                    ,
                                                    i=1 1{yi = 0}
                                                  Pn
                                                n

at the threshold value x ∈ R, where the indicator 1{A} equals one if the event A is true and zero
otherwise.
    Evidently, it suffices to consider threshold values x equal to any of the unique values of x1 , . . . , xn
or some x0 < x1 . For every x of this form, we obtain a point

                                              (1 − sp(x), se(x))

in the unit square. Linear interpolation of the respective discrete point set results in a piecewise
linear curve from (0, 0) to (1, 1) that is called the receiver operating characteristic (ROC) curve.
For a mathematically oriented, detailed discussion of the construction see Section 2 of Gneiting
and Vogel (2018).

2.3    Area under the curve (AUC)
The area under the ROC curve is a widely used measure of the predictive potential of a feature
and generally referred to as the area under the curve (AUC).
   In what follows, a well-known interpretation of AUC in terms of probabilities will be useful.
To this end, we define the function
                                                                1
                                   s(x, x0 ) = 1{x < x0 } +       1{x = x0 },                             (2)
                                                                2

                                                        4
where x, x0 ∈ R. For subsequent use, note that if x and x0 are ranked within a list, and ties are
resolved by assigning equal ranks within tied groups, then s(x, x0 ) = s(rk(x), rk(x0 )), where rk(x)
and rk(x0 ) are the ranks of x and x0 .
    We now change
             Pn notation and refer toP            the feature values in class i ∈ {0, 1} as xik for k = 1, . . . , ni ,
where n0 = i=1 1{yi = 0} and n1 = i=1 1{yi = 1}, respectively. Thus, we have rewritten (1)
                                                    n

as
                      (x01 , 0), . . . , (x0n0 , 0), (x11 , 1), . . . , (x1n1 , 1) ∈ R × {0, 1}.                   (3)
Using the new notation, Result 4.10 of Pepe (2003) states that
                                                         n0 Xn1
                                                   1 X
                                       AUC =                    s(x0i , x1j ).                                     (4)
                                                  n0 n1 i=1 j=1

In words, AUC equals the probability that under random sampling a feature value from a positive
instance is greater than a feature value from a negative instance, with any ties resolved at random.
Expressed differently, AUC equals the tie-adjusted probability of concordance in feature–outcome
pairs, where we define instances (x, y) ∈ R2 and (x0 , y 0 ) ∈ R2 with y 6= y 0 to be concordant if either
x > x0 and y > y 0 , or x < x0 and y < y 0 . Similarly, instances (x, y) and (x0 , y 0 ) with y 6= y 0 are
discordant if either x > x0 and y < y 0 , or x < x0 and y > y 0 .
    Further investigation reveals a close connection to Somers’ D, a classical measure of ordinal
association (Somers, 1962). This measure is defined as
                                                         nc − nd
                                                   D=            ,
                                                          n0 n1
where n0 n1 is the total number of pairs with distinct outcomes that arise from the data in (3), nc
is the number of concordant pairs, and nd is the number of discordant pairs. Finally, let ne be the
number of pairs for which the feature values are equal. The relationship (4) yields
                                                        nc     1 ne
                                            AUC =            +         ,
                                                       n0 n1   2 n0 n1
and as n0 n1 = nc + nd + ne , it follows that
                                                          1
                                                AUC =       (D + 1)                                                (5)
                                                          2
relates linearly to Somers’ D.
     To give an example, suppose that the real-valued outcome Y and the features X, X 0 and
X are jointly Gaussian. Specifically, we assume that the joint distribution of (Y, X, X 0 , X 00 ) is
  00

multivariate normal with covariance matrix
                                                      
                                       1 0.8 0.5 0.2
                                    0.8 1 0.8 0.5
                                    0.5 0.8 1 0.8 .                                            (6)
                                                      

                                      0.2 0.5 0.8 1

In order to apply classical ROC analysis, the real-valued outcome Y needs to be converted to a
binary variable, namely, an event of the type Yθ = 1{Y ≥ θ} of Y being greater than or equal to
a threshold value θ. Figure 3 shows ROC curves for the features X, X 0 and X 00 as a predictor of
the binary variable Y1 , based on a sample of size n = 400. The AUC values for X, X 0 and X 00 as
a predictor of Y1 are .91, .72 and .61, respectively.

                                                          5
ROC Curves
                                    1

                                   5/6

                                   2/3
                     Sensitivity

                                                                                            Features
                                   1/2                                                            X
                                                                                                  X'
                                                                                                  X''
                                   1/3

                                                                         AUC: 0.92
                                   1/6                                   AUC: 0.76
                                                                         AUC: 0.62

                                    0

                                          0   1/6     1/3    1/2       2/3     5/6   1
                                                       1 − Specificity

Figure 3: Traditional ROC curves and AUC values for the features X, X 0 and X 00 as predictors
of the binary outcome Y1 = 1{Y ≥ 1} in the simulation example of Section 2.3, based on a sample
of size n = 400.

2.4    Key properties
A key requirement for a persuasive generalization of classical ROC analysis is the reduction to ROC
curves and AUC if the outcomes are binary. Furthermore, well established desirable properties from
ROC analysis ought to be retained. To facilitate judging whether the generalization in Sections
3 and 4 satisfies these desiderata, we summarize key properties of ROC curves and AUC in the
following (slightly informal) listing.
 (1) The ROC curve and AUC are straightforward to compute and interpret, in the (rough) sense
     of the larger the better.
 (2) AUC attains values between 0 and 1 and relates linearly to Somers’ D. For a perfect feature,
     AUC = 1 and D = 1; for a feature that is independent of the binary outcome, AUC = 12 und
     D = 0.
 (3) The numerical value of AUC admits an interpretation as the probability of concordance for
     feature–outcome pairs.
 (4) The ROC curve and AUC are purely rank based and, therefore, invariant under strictly
     increasing transformations. Specifically, if φ : R → R is a strictly increasing function, then
     the ROC curve and AUC computed from
                                              (φ(x1 ), y1 ), . . . , (φ(xn ), yn ) ∈ R × {0, 1}         (7)
      are the same as the ROC curve and AUC computed from (1).
As an immediate consequence of the latter property, ROC curves and AUC assess the discrimination
ability or potential predictive ability of a classifier, feature, marker, or test (Wilks, 2019). Distinctly
different methods are called for if one seeks to evaluate a classifier’s actual value in any given applied
setting (Adams and Hands, 1999; Hernández-Orallo et al., 2012; Ehm et al., 2016).

                                                                   6
3     ROC movies and universal ROC (UROC) curves for real-
      valued outcomes
As noted, traditional ROC analysis applies to binary outcomes only. Thus, researchers working
with real-valued outcomes, and desiring to apply ROC analysis, need to convert and reduce to
binary outcomes, by thresholding artificially at a cut-off value. Here we propose a powerful gen-
eralization of ROC analysis, which overcomes extant shortcomings, and introduce data analytic
tools in the form of the ROC movie, the universal ROC (UROC) curve, and an associated rank
based coefficient of (potential) predictive ability (CPA) measure — tools that apply to just any lin-
early ordered outcome, including both binary, ordinal, mixed discrete-continuous, and continuous
variables.

3.1    General real-valued setting
Generalizing the binary setting in (1), we now consider bivariate data of the form

                                    (x1 , y1 ), . . . , (xn , yn ) ∈ R × R,                            (8)

where xi is a real-valued point forecast, regression output, feature, marker, or covariate value, and
yi is a real-valued outcome, for i = 1, . . . , n. Throughout we assume that there are at least two
unique values among the outcomes y1 , . . . , yn .
    The crux of the subsequent development lies in a conversion to a sequence of binary problems.
To this end, we let
                                              z1 < · · · < zm
denote the m ≤ n unique values of y1 , . . . , yn , and we define
                                                 n
                                                       1{yi = zc }
                                                 X
                                          nc =
                                                 i=1

as the number of instances among the outcomes y1 , . . . , yn that equal zc , for c = 1, . . . , m, so that
n1 + · · · + nm = n. We refer to the respective groups of instances as classes.
    Next we transform the real-valued outcomes y1 , . . . , yn into binary outcomes 1{y1 ≥
θ}, . . . , 1{yn ≥ θ} relative to a threshold value θ ∈ R. Thus, instead of analysing the original
problem in (8), we consider a series of binary problems. By construction, only values of θ equal
to z2 , . . . , zm result in nontrivial, unique sets of binary outcomes. Therefore, we consider m − 1
derived classification problems with binary data of the form

                      (x1 , 1{y1 ≥ zc+1 }), . . . , (xn , 1{yn ≥ zc+1 }) ∈ R × {0, 1},                 (9)

where c = 1, . . . , m − 1. As the derived problems are binary, all the tools of traditional ROC
analysis apply.
    In the remainder of the section we describe our generalization of ROC curves for binary data
to ROC movies and universal ROC (UROC) curves for real-valued data. First, we argue that the
m − 1 classical ROC curves for the derived data in (9) can be merged into a single dynamical
display, to which we refer as a ROC movie (Definition 1). Then we define the UROC curve as
a judiciously weighted average of the classical ROC curves of which the ROC movie is composed
(Definition 2).
    Finally, we introduce a general measure of potential predictive ability for features, termed the
coefficient of predictive ability (CPA). CPA is a weighted average of the AUC values for the derived

                                                       7
binary problems in the very same way that the UROC curve is a weighted average of the (classical)
ROC curves that constitute the ROC movie. Hence, CPA equals the area under the UROC curve
(Definition 3). Alternatively, CPA can be interpreted as a weighted probability of concordance
(Theorem 1) or in terms of rank based covariances (Theorem 2). CPA reduces to AUC if the
outcomes are binary, and relates linearly to Spearman’s rank correlation coefficient if the outcomes
are continuous (Theorems 3 and 4).

3.2    ROC movies
We consider the sequence of m − 1 classification problems for the derived binary data in (9). For
c = 1, . . . , m−1, we let ROCc denote the associated ROC curve, and we let AUCc be the respective
AUC value.
Definition 1. For data of the form (8), the ROC movie is the sequence (ROCc )c=1,...,m−1 of the
ROC curves for the induced binary data in (9).
    If the original problem is binary there are m = 2 classes only, and the ROC movie reduces
to the classical ROC curve. In case the outcome attains m ≥ 3 distinct values the ROC movie
can be visualized by displaying the associated sequence of m − 1 ROC curves. In medical survival
analysis, the outcomes y1 , . . . , yn in data of the form (8) are survival times, and the analysis is
frequently hampered by censoring, as patients drop out of studies. In this setting, Etzioni et al.
(1999) and Heagerty et al. (2000) introduced the notion of time-dependent ROC curves, which are
classical ROC curves for the binary indicator 1{yi ≥ t} of survival through (follow-up) time t, with
censoring being handled efficiently. For an example see Fig. 2 of Heagerty et al. (2000), where the
ROC curves concern survival beyond follow-up times of 40, 60, and 100 months, respectively. If
the thresholds considered correspond to the unique values of the outcomes, the sequence of time-
dependent ROC curves becomes a ROC movie in the sense of Definition 1, save for the handling
of censored data. When the number m ≤ n of classes is small or modest, the generation of the
ROC movie is straightforward. Adaptations might be required as m grows, and we tend to this
question in Section 5.2.
    We have implemented ROC movies, UROC curves, and CPA within the uroc package for
the statistical programming language R (R Core Team, 2021) where the animation package of
Xie (2013) provides functionality for converting R images into a GIF animation, based on the
external software ImageMagick. The uroc package can be downloaded from https://github.c
om/evwalz/uroc. In addition, a Python (Python, 2021) implementation is available at https:
//github.com/evwalz/urocc. Returning to the example of Section 2.3, Fig. 4 compares the
features X, X 0 and X 00 as predictors of the real-valued outcome Y in a joint display of the three
ROC movies and UROC curves, based on the same sample of size n = 400 as in Fig. 3. In the
ROC movies, the threshold z = 1.00 recovers the traditional ROC curves in Fig. 3.

3.3    Universal ROC (UROC) curves
Next we propose a simple and efficient way of subsuming a ROC movie for data of the form (8)
into a single, static P   graphical display. As before, let z1 < · · · < zm denote the unique values of
y1 , . . . , yn , let nc = i=1 1{yi = zc }, and let ROCc denote the (classical) ROC curve associated
                            n

with the binary problem in (9), for c = 1, . . . , m − 1.
     By Theorem 5 of Gneiting and Vogel (2018), there is a natural bijection between the class
of the ROC curves and the class of the cumulative distribution functions (CDFs) of Borel prob-
ability measures on the unit interval. In particular, any ROC curve can be associated with a
non-decreasing, right-continuous function R : [0, 1] → [0, 1] such that R(0) = 0 and R(1) = 1.

                                                  8
Start ROC Movie

Figure 4: ROC movies and UROC curves for the features X, X 0 and X 00 as predictors of the
real-valued outcome Y in the simulation example of Section 2.3, based on the same sample as in
Fig. 3. In the ROC movies, the number at upper left shows the threshold under consideration, the
number at upper center the relative weight wc / maxl=1,...,m−1 wl from (11), and the numbers at
bottom right the respective AUC values.

Hence, any convex combination of the ROC curves ROC1 , . . . , ROCm−1 can also be associated
with a non-decreasing, right-continuous function on the unit interval. It is in this sense that we
define the following; in a nutshell, the UROC curve averages the traditional ROC curves of which
the ROC movie is composed.

Definition 2. For data of the form (8), the universal receiver operating characteristic (UROC)
curve is the curve associated with the function
                                                    m−1
                                                    X
                                                              wc ROCc                           (10)
                                                        c=1

on the unit interval, with weights

                                     c          m
                                                           !, m−1 m              
                                     X          X               X X
                            wc =           ni           ni          (j − i)ni nj              (11)
                                     i=1        i=c+1                i=1 j=i+1

for c = 1, . . . , m − 1.
   Importantly, the weights in (11) depend on the data in (8) via the outcomes y1 , . . . , yn only.
Thus, they are independent of the feature values and can be used meaningfully in order to compare
and rank features. Their specific choice is justified in Theorems 1 and 2 below. Clearly, the weights
are nonnegative and sum to one. If m = n then n1 = · · · = nm = 1, and (11) reduces to

                                             c(n − c)
                                   wc = 6                      for   c = 1, . . . , n − 1;      (12)
                                            n(n2 − 1)

                                                               9
so the weights are quadratic in the rank c and symmetric about the inner most rank(s), at which
they attain a maximum. As we will see, our choice of weights has the effect that in this setting
the area under the UROC curve, to which we refer as a general coefficient of predictive ability
(CPA), relates linearly to Spearman’s rank correlation coefficient, in the same way that AUC
relates linearly to Somers’ D.
    In Fig. 4 the UROC curves appear in the final static screen, subsequent to the ROC movies.
Within each ROC movie, the individual frames show the ROC curve ROCc for the feature consid-
ered. Furthermore, we display the threshold zc , the relative weight from (11) (the actual weight
normalized to the unit interval, i.e., we show wc / maxl=1,...,m−1 wl ), and AUCc , respectively, for
c = 1, . . . , m − 1. Once more we emphasize that the use of ROC movies, UROC curves, and CPA
frees researchers from the need to select — typically, arbitrary — threshold values and binarize,
as mandated by classical ROC analysis.
    Of course, if specific threshold values are of particular substantive interest, the respective ROC
curves can be extracted from the ROC movie, and it can be useful to plot AUCc versus the
associated threshold value zc . Displays of this type have been introduced and studied by Rosset
et al. (2005).

4     Coefficient of predictive ability (CPA)
We proceed to define the coefficient of predictive ability (CPA) as a general measure of potential
predictive ability, based on notation introduced in Sections 3.2 and 3.3.
Definition 3. For data of the form (8) and weights w1 , . . . , wm−1 as in (11), the coefficient of
predictive ability (CPA) is defined as
                                                   m−1
                                                   X
                                          CPA =           wc AUCc .                                      (13)
                                                    c=1

In words, CPA equals the area under the UROC curve.
    Importantly, ROC movies, UROC curves, and CPA satisfy a fundamental requirement on any
generalization of ROC curves and AUC, in that they reduce to the classical notions when applied
to a binary problem, whence m = 2 in (10) and (13), respectively. Another requirement that we
consider essential is that, when both the feature values x1 , . . . , xn and the outcomes y1 , . . . , yn are
pairwise distinct, the value of a performance measure remains unchanged if we transpose the roles
of the feature and the outcome. As we will see, this is true under our specific choice (11) of the
weights wc in the defining formula (13) for CPA, but is not true under other choices, such as in
the case of equal weights.

4.1    Interpretation as a weighted probability
We now express CPA in terms of pairwise comparisons via the function s in (2). To this end, we
usefully change notation for the data in (8) and refer to the feature values in class c ∈ {1, . . . , m}
as xck , for k = 1, . . . , nc . Thus, we rewrite (8) as

                  (x11 , z1 ), . . . , (x1n1 , z1 ), . . . , (xm1 , zm ), . . . , (xmnm , zm ) ∈ R × R,  (14)

where z1 < · · · < zm are the unique values of y1 , . . . , yn and nc = i=1 1{yi = zc }, for c = 1, . . . , m.
                                                                                     Pn

                                                     10
Theorem 1. For data of the form (14),
                                 Pm−1 Pm           Pni Pnj
                                   i=1        j=i+1     k=1     l=1 (j   − i) s(xik , xjl )
                       CPA =                  Pm−1    Pm                                      .   (15)
                                                i=1     j=i+1 (j   − i) ni nj

Proof. By (4), the individual AUC values satisfy
                                                            c m X  nj
                                                                ni X
                                              1             X X
                      AUCc = Pc               Pm                                 s(xik , xjl )
                                   i=1   ni    i=c+1   ni   i=1 j=c+1 k=1 l=1

for c = 1, . . . , m − 1. In view of (11) and (13), summation yields
                                 m−1
                                 X
                       CPA =         wc AUCc
                                 c=1
                                 Pm−1 Pc Pm          Pni Pnj
                                   c=1   i=1   j=c+1    k=1    l=1 s(xik , xjl )
                             =           Pm−1 Pm
                                           i=1   j=i+1 (j − i) ni nj
                                 Pm−1 Pm       Pni Pnj
                                   i=1   j=i+1   k=1    l=1 (j − i) s(xik , xjl )
                             =           Pm−1 Pm                                  ,
                                           i=1    j=i+1 (j − i) ni nj

as claimed.
    Thus, CPA is based on pairwise comparisons of feature values, counting the number of con-
cordant pairs in (14), adjusting to a count of 12 if feature values are tied, and weighting a pair’s
contribution by a class based distance, j − i, between the respective outcomes, zj > zi . In other
words, CPA equals a weighted probability of concordance, with weights that grow linearly in the
class based distance between outcomes.
    The specific form of CPA in (15) invites comparison to a widely used measure of discrimination
in biomedical applications, namely, the C index (Harrell et al., 1996; Pencina and D’Agostino,
2004)
                                 Pm−1 Pm         Pni Pnj
                                   i=1     j=i+1   k=1    l=1 s(xik , xjl )
                            C=            Pm−1 Pm                           .                  (16)
                                             i=1    j=i+1 ni nj

If the outcomes are binary, both the C index and CPA reduce to AUC. While CPA can be in-
terpreted as a weighted probability of concordance, C admits an interpretation as an unweighted
probability, whence Mason and Weigel (2009) recommend its use for administrative purposes.
However, the weighting in (15) may be more meaningful, as concordance between feature–outcome
pairs with outcomes that differ substantially in rank tends to be of greater practical relevance than
concordance between pairs with alike outcomes. While CPA admits the appealing, equivalent in-
terpretation (13) in terms of binary AUC values and the area under the UROC curve, relationships
of this type are unavailable for the C index.
    Subject to conditions, the C index relates linearly to Kendall’s rank correlation coefficient
(Somers, 1962; Pencina and D’Agostino, 2004; Mason and Weigel, 2009). In Section 4.3 we demon-
strate the same type of relationship for CPA and Spearman’s rank correlation coefficient, thereby
resolving a problem raised by Heagerty and Zheng (2005, p. 95). Just as the C index bridges and
generalizes AUC and Kendall’s coefficient, CPA bridges and nests AUC and Spearman’s coefficient,
with the added benefit of appealing interpretations in terms of the area under the UROC curve
and rank based covariances.

                                                       11
4.2    Representation in terms of covariances
The key result in this section represents CPA in terms of the covariance between the class of
the outcome and the mid rank of the feature, relative to the covariance between the class of the
outcome and the mid rank of the outcome itself.
    The mid rank method handles ties by assigning the arithmetic average of the ranks involved
(Woodbury, 1940; Kruskal, 1958). For instance, if the third to seventh positions in a list are tied,
their shared mid rank is 15 (3 + 4 + 5 + 6 + 7) = 5. This approach treats equal values alike and
guarantees that the sum of the ranks in any tied group is unchanged from the case of no ties. As
before, if yi = zj , where z1 < · · · < zm are the unique values of y1 , . . . , yn in (8), we say that the
class of yi is j. In brief, we express this as cl(yi ) = j. Similarly, we refer to the mid rank of xi
within x1 , . . . , xn as rk(xi ).
Theorem 2. Let the random vector (X, Y ) be drawn from the empirical distribution of the data
in (8) or (14). Then
                                                          
                                  1 cov(cl(Y ), rk(X))
                          CPA =                         +1 .                             (17)
                                  2 cov(cl(Y ), rk(Y ))
Proof. Suppose that the law of the random vector (X, Y ) is the empirical distribution of the data
in (8). Based on the equivalent representation in (14), we find that
                                       Pm Pni                               Pm
           cov(cl(Y ), rk(X))             i=1       i rk(xik ) − 21 (n + 1) i=1 ini
                                =P         P k=1                                          ,
            cov(cl(Y ), rk(Y ))    m          i−1        1               1
                                                                                   Pm
                                   i=1 ini    j=0 n j +  2 (n i + 1)   − 2 (n + 1)  i=1 ini

where n0 = 0. Consequently, we can rewrite (17) as
                     Pm Pni                    Pm         P                         
                                                               i−1       1         1
                       i=1   k=1 i rk(xik ) +     i=1 ini      j=0 nj + 2 ni − n − 2
              CPA =                Pm           P                                    .                            (18)
                                                     i−1
                                      i=1 in i  2    j=0  n j + ni − n

   We proceed to demonstrate that the numerator and denominator in (15) equal the numerator
and denominator in (18), respectively. To this end, we first compare feature values within classes
and note that
                  m X ni Xni                   m    ni                  m
                 X                            X     X            1     1X 2
                             is(xil , xik ) =     i     ni − k +     =      in ;
                 i=1                          i=1
                                                                 2     2 i=1 i
                        k=1 l=1                             k=1
for if the feature values in class i are all distinct, the largest one exceeds ni − 1 others, the second
largest exceeds ni − 2 others, and so on, and analogously in case of ties. We now show the equality
of the numerators in (15) and (18), in that
            m−1   m X  nj
                    ni X
            X     X
                          (j − i) s(xik , xjl )
             i=1 j=i+1 k=1 l=1
                m−1   m X  nj
                        ni X                             m−1     m X  nj
                                                                   ni X
                X     X                                  X       X
            =                        j s(xik , xjl ) −                           is(xik , xjl )
                i=1 j=i+1 k=1 l=1                        i=1 j=i+1 k=1 l=1
                               m−1      nj ni
                                      m X                                  m−1     nj ni
                                                                                 m X
                               X      X    X                               X     X    X
                           +                           j s(xik , xjl ) −                          j s(xik , xjl )
                               j=1 i=j+1 k=1 l=1                           j=1 i=j+1 k=1 l=1
                m X
                  m X  nj
                    ni X                              m−1   m X  nj
                                                              ni X
                X                                     X     X
            =                     j s(xik , xjl ) −                         i (s(xjl , xik ) + s(xik , xjl ))
                i=1 j=1 k=1 l=1                       i=1 j=i+1 k=1 l=1
                    j6=i

                                                            12
m Xnj                X m Xni X
                                                ni                  m−1   m
                X                   1                               X X
            =          j rk(xjl ) −    −           is(xil , xik ) −           ini nj
                j=1
                                    2    i=1                        i=1 j=i+1
                     l=1                                       k=1 l=1
                  ni
                m X                              m                m          m−1       m−1      i
                X                            1   X             1X 2          X         X       X
            =              irk(xik) −                  ini −         ini − n     ini +     ini     nj
                i=1 k=1
                                             2   i=1
                                                               2 i=1         i=1       i=1     j=0
                  ni
                m X                 m            m           m         m       i
                X                1X           1X 2          X         X       X
            =              irk(xik) −  ini −        ini − n     ini +     ini     nj
              i=1 k=1
                                 2 i=1        2 i=1         i=1       i=1     j=0
                                                                 
               m X ni             m        i−1
              X                  X         X         1          1
            =         irk(xik) +     ini      nj + ni − n −  .
              i=1                i=1       j=0
                                                     2          2
                     k=1

As for the denominators,
           m−1
           X        m
                    X                             m−1
                                                  X      m
                                                         X                     m−1
                                                                               X         m
                                                                                         X
                          (j − i) ni nj =                        j ni nj −                   ini nj
            i=1 j=i+1                             i=1 j=i+1                    i=1 j=i+1
                    m
                    X           i−1
                                X                m−1
                                                 X             m−1
                                                               X              i
                                                                              X
                =         ini         nk − n           ini +           ini          nk
                    i=1         k=0              i=1             i=1          k=1
                     m
                     X           i−1
                                 X                m−1
                                                  X              m−1
                                                                 X                  m−1
                                                                                    X           i−1
                                                                                                X            m
                                                                                                             X           i−1
                                                                                                                         X
                =2         ini         nk − n            ini +          in2i +            ini         nk −         ini         nk
                     i=1         k=0               i=1            i=1               i=1         k=0          i=1         k=0
                     m
                     X           i−1
                                 X                m−1
                                                  X               m−1
                                                                  X
                =2         ini         nk − n            ini +          in2i − nmnm + mn2m
                     i=1         k=0               i=1            i=1
                     m
                     X           i−1
                                 X                m
                                                  X              m
                                                                 X
                =2         ini         nk − n            ini +         in2i
                     i=1         k=0              i=1            i=1
                                                          
                    m
                    X                 i−1
                                      X
                =         ini 2            nj + ni − n ,
                    i=1               j=0

whence the proof is complete.
    Interestingly, the representation (17) in terms of rank and class based covariances appears to
be new even in the special case when the outcomes are binary, so that CPA reduces to AUC. The
representation also sheds new light on the asymmetry of CPA, in that, in general, the value of CPA
changes if we transpose the roles of the feature and the outcome. In contrast to customarily used
measures of bivariate association and dependence, which are necessarily symmetric (Nešlehová,
2007; Reshef et al., 2011; Weihs et al., 2018), CPA is directed when the outcome is binary or
ordinal. Thus, CPA avoids a technical issue with the use of rank-based correlation coefficients in
discrete settings, namely, that perfect classifiers do not reach the optimal values of the respective
performance measures (Nešlehová, 2007, p. 565). However, in the case of no ties at all, to which we
tend now, CPA becomes symmetric, as one would expect, given that the feature and the outcome
are on equal footing then.

                                                                   13
4.3     Relationship to Spearman’s rank correlation coefficient
Spearman’s rank correlation coefficient ρS for data of the form (8) is generally understood as
Pearson’s correlation coefficient applied to the respective ranks (Spearman, 1904). In case there are
no ties in either x1 , . . . , xn nor y1 , . . . , yn , the concept is unambiguous, and Spearman’s coefficient
can be computed as
                                                               n
                                                        6     X                     2
                                   ρS = 1 −                      (rk(xi ) − rk(yi )) ,                    (19)
                                                 n(n2 − 1) i=1
where rk(xi ) denotes the rank of xi within x1 , . . . , xn , and rk(yi ) the rank of yi within y1 , . . . , yn ,
   In this setting CPA relates linearly to Spearman’s rank correlation coefficient ρS , in the very
same way that AUC relates to Somers’ D in (5).
Theorem 3. In the case of no ties,
                                                      1
                                             CPA =      (ρS + 1) .                                          (20)
                                                      2
    Indeed, in case there are no ties, both mid ranks and classes reduce to ranks proper, and then
(20) is readily identified as a special case of (17). For an alternative proof, in the absence of ties
the weights wc in (11) are of the form (12). The stated result then follows upon combining the
defining equation (10), the equality stated at the bottom of the left column of page 4 in Rosset et
al. (2005), and equation (5) in the same reference.
    Note that CPA becomes symmetric in this case, as its value remains unchanged if we transpose
the roles of the feature and the outcome. Furthermore, if the joint distribution of a bivariate
random vector (X, Y ) is continuous, and we think of the data in (8) as a sample from the respective
population, then, by applying Definition 3 and Theorem 3 in the large sample limit, and taking
(12) into account, we (informally) obtain a population version of CPA, namely,
                                      Z 1
                                                                1
                            CPA = 6       α(1 − α) AUCα dα = (ρS + 1) ,                          (21)
                                       0                        2
where AUCα is the population version of AUC for (X, 1{Y ≥ qα }), with qα denoting the α-quantile
of the marginal law of Y . We defer a rigorous derivation of (21) to future work and stress that, as
both X and Y are continuous here, their roles can be interchanged.
    Under the assumption of multivariate normality, the population version of Spearman’s ρS relates
to Pearson’s correlation coefficient r as
                                                      6       r
                                               ρS =     arcsin ;                                            (22)
                                                      π       2
see, e.g., Kruskal (1958). Returning to the example in Section 2.3, where (Y, X, X 0 , X 00 ) is jointly
Gaussian with covariance matrix (6), Table 1 states, for each feature, the population values of
Pearson’s correlation coefficient r, CPA, and the C index relative to the real-valued outcome Y ,
as derived from (21) and (22) and the respective relationships for the C index and Kendall’s rank
correlation coefficient τ K , namely
                                               1
                                          C = (τ K + 1)                                             (23)
                                               2
and
                                                2
                                         τ K = arcsin r.                                            (24)
                                                π
These results imply that for a bivariate Gaussian population with Pearson correlation coefficient
r ∈ (0, 1) it is true that τ K > ρS > 0 and CPA > C > 1/2. In fact, under positive dependence

                                                       14
Table 1: Population values of Pearson’s correlation coefficient r, CPA, and the C index for the
features X, X 0 , and X 00 relative to the real-valued outcome Y , where (Y, X, X 0 , X 00 ) is Gaussian
with covariance matrix (6).

                                    Feature        r         CPA        C
                                       X          0.800      0.893    0.795
                                      X0          0.500      0.741    0.667
                                          00
                                      X           0.200      0.596    0.564

it always holds that τ K ≥ ρS ≥ 0, as demonstrated by Capéraà and Genest (1993), whence
CPA ≥ C ≥ 1/2. However, there are also settings where these inequalities get violated (Schreyer
et al, 2017). In Fig. 4 the CPA values for the features appear along with the UROC curves in
the final static screen, subsequent to the ROC movie. The empirical values show the expected
approximate agreement with the population quantities in the table.
     Suppose now that the values y1 , . . . , yn of the outcomes are unique, whereas the feature values
x1 , . . . , xn might involves ties. Let p ≥ 0 denote the number of tied groups within x1 , . . . , xn . If
p = 0 let V = 0. If p ≥ 1, let vj be the number of equal values in the jth group, for j = 1, . . . , p,
and let
                                                     p
                                                  1 X 3         
                                           V =           vj − vj .
                                                 12 j=1

Then Spearman’s mid rank adjusted coefficient ρM is defined as
                                                   n
                                                                                          !
                                      6            X                            2
                         ρM   =1−                           rk(xi ) − rk(yi )        +V       ,       (25)
                                  n(n2 − 1)         i=1

where rk is the aforementioned mid rank. As shown by Woodbury (1940), if one assigns all
possible combinations of integer ranks within tied sets, computes Spearman’s ρS in (19) on every
such combination and averages over the respective values, one obtains the formula for ρM in (25).
   The following result reduces to the statement of Theorem 3 in the case p = 0 when there are
no ties in x1 , . . . , xn either.
Theorem 4. In case there are no ties within y1 , . . . , yn ,
                                                       1
                                               CPA =     (ρM + 1) .                                   (26)
                                                       2
Proof. As noted, ρM arises from ρS if one assigns all possible combinations of integer ranks within
tied sets, computes ρS on every such combination and averages over the respective values. In
view of (18), if there are no ties in y1 , . . . , yn , averaging 21 (ρS + 1) over the combinations yields
1
2 (ρM + 1), which equals CPA by (17).

    The relationships (5), (20) and (26) constitute but special cases of the general, covariance based
representation (17). In this light, CPA provides a unified way of quantifying potential predictive
ability for the full gamut of dichotomous, categorical, mixed discrete-continuous and continuous
types of outcomes. In particular, CPA bridges and generalizes AUC, Somers’ D and Spearman’s
rank correlation coefficient, up to a common linear relationship.

                                                       15
4.4     Comparison of CPA to the C index and related measures
We proceed to a more detailed comparison of the CPA measure (13) to the C index (16) and
measures studied by Waegeman et al. (2008).1 As noted, both CPA and the C index are rank-
based, reduce to AUC when the outcome is binary, and become symmetric when both the features
and the outcomes are pairwise distinct. We relax these conditions slightly and restrict attention
to measures that use ranks only, reduce to AUC when the outcome is binary and there are no ties
in the feature values, and become symmetric when there are no ties at all. This excludes measures
based on the receiver error characteristic (REC, Bi and Bennett, 2003) and the regression receiver
operating characteristic (RROC, Hernández-Orallo, 2013) curve, which are neither rank based nor
reduce to AUC. The Ucons measure of Waegeman et al. (2008) averages consecutive AUC values
in the same fashion as CPA in (13), but uses constant weights, as opposed to the class dependent
weights (11) for CPA, and does not become symmetric when there are no ties at all.2 The Upairs
and Uovo measures of Waegeman et al. (2008) satisfy our criteria, relate closely to the C index,
and in the simulation setting of Fig. 5 it holds that Uovo = Upairs = C.3
    In view of the above requirements and properties, we restrict the subsequent comparison to
CPA, the C index, and the U measure introduced by Waegeman et al. (2008). For a dataset with m
classes U equals the proportion of sequences of m instances, one of each class, that align correctly
with the feature values. As noted, these measures are rank based and reduce to AUC when the
outcome is binary and there are no ties in the feature values. In the continuous case with no ties
in the feature values nor in the outcomes, they become symmetric, U attains the value 1 under a
perfect ranking and the value 0 otherwise, C = 21 (1 − τ K ), and CPA = 21 (1 − ρS ).
    In Fig. 5 we report on a simulation experiment where we draw samples of 220 instances from the
joint Gaussian distribution of the random vector (Y, X, X 0 , X 00 ) with covariance matrix (6), so that
the features have Pearson correlation coefficient r = 0.8, 0.5, and 0.2 with the continuous outcome
Y . By discretizing the outcome into 2k consecutive blocks of size 220−k each, where k = 1, . . . , 20,
and computing CPA, the C index and the U measure as a function of k, all discretization levels
are considered, ranging from a binary variable for k = 1 to continuous outcomes for k = 20. When
k = 1 the three measures coincide and equal AUC, essentially at the population value of
                                                          2         r  1
                                            AUC1/2 =        arcsin √ + ,                                              (27)
                                                          π          2 2
in the sense stated subsequent to (21). The U measure is tailored to ordinal outcomes with a few
classes only and degenerates rapidly with k. When k = 20, CPA and the C index are rescaled
versions of Spearman’s ρS and Kendall’s τ K , essentially at the population values in Table 1.
    Throughout, the measures lie in between their common value for k = 1, which equals AUC, and
the respective values for k = 20. For all features and all k > 1, the C index is smaller than CPA,
   1 We denote the measures U    b, U
                                    bpairs , U
                                             bovo , and U
                                                        bcons in equations (8), (16), (17), and (18) of Waegeman et al.
(2008) by U , Upairs , Uovo , and Ucons , respectively.
   2 To see that U
                   cons does not become symmetric when there are no ties in x1 , . . . , xn nor y1 , . . . , yn , consider a
dataset of size n ≥ 4, where y1 < · · · < yn and x3 < x1 < x2 < x4 < · · · < xn . Then AUC1 = (n − 3)/(n − 1),
AUC2 = (2n − 5)/(2n − 4), and AUCc = 1 for c = 3, . . . , n − 1, whereas if we interchange the roles of the feature
and the outcome, then AUC1 = (n − 2)/(n − 1), AUC2 = (2n − 6)/(2n − 4), and AUCc = 1 for c = 3, . . . , n − 1,
resulting in distinct unweighted sums.
   3 The U
           pairs measure corresponds to a performance criterion proposed by Herbrich (2000, equation (7.11)) and
equals the proportion of correctly ranked pairs of instances. Except for the treatment of ties in the feature, Upairs
equals the C index. In particular, if the feature values are pairwise       distinct then Upairs = C. The measure Uovo
represents the Hand and Till (2001) approach of averaging the m
                                                                      
                                                                    2
                                                                        one-versus-one AUC values in an m-class problem.
It has been compared to Upairs by Waegeman et al. (2008) and relates to the C index as well. In particular, if the
feature values are pairwise distinct and the dataset furthermore is balanced with class memberships n1 = · · · = nm ,
as in the simulation setting that we report on in Fig. 5, then Uovo = Upairs = C.

                                                            16
a                              b               0.90
           1.00

                                                      0.85

           0.75
                                                      0.80

                                                                                                                   CPA

                                      C index / CPA
                                                      0.75                                                         C index
       U

           0.50
                                                                                                                   X'', r = 0.8
                                                      0.70                                                         X'', r = 0.5
                                                                                                                   X'', r = 0.2

                                                      0.65
           0.25

                                                      0.60

           0.00
                                                      0.55
                  1   2   3   4   5                          1   3   5        7   9       11   13   15   17   19
                          k                                                           k

Figure 5: Rank based performance measures for the features X, X 0 and X 00 as predictors of the
real-valued outcome Y in the simulation example of Section 2.3, with Pearson correlation coefficient
r = 0.8, 0.5 and 0.2, respectively, based on a sample of size n = 220 . We discretize the continuous
outcome into 2k consecutive blocks of size 220−k each, and plot (a) U , and (b) CPA and the C
index as functions of the discretization level k = 1, . . . , 20. Note that k = 1 yields a binary outcome
and k = 20 a continuous outcome.

and CPA varies considerably less with the discretization level than the C index. To supplement
these experiments with an analytic demonstration, suppose that X and Y are bivariate Gaussian
with nonnegative Pearson correlation r. If we convert Y to a balanced binary outcome, then both
CPA and the C index reduce to a common value, namely, AUC1/2 in (27). As a function of r, the
ratio of the C index for the continuous vs. the balanced binary outcome attains values between
0.8996 and 1, whereas for CPA the respective ratio remains between 1 and 1.0156, as illustrated in
Fig. 6. These findings along with results in Capéraà and Genest (1993) and Schreyer et al (2017)
suggest that, quite generally, CPA and the C index yield qualitatively similar results in practice,
with CPA being less sensitive to quantization effects, and the value of CPA typically being larger
than for the C index.

4.5    Computational issues
We turn to a discussion of the computational costs of generalized ROC analysis for a dataset of
the form (8) or (14) with n instances and m ≤ n classes.
    It is well known that a traditional ROC curve can be generated from a dataset with n instances
in O(n log n) operations (Fawcett, 2006, Algorithm 1). A ROC movie comprises m − 1 traditional
ROC curves, so in a naı̈ve approach, ROC movies can be computed in O(mn log n) operations.
However, our implementation takes advantage of recursive relations between consecutive compo-
nent curves ROCi−1 and ROCi . While a formal analysis will need to be left to future work, we
believe that our algorithm has computational costs of O(n log n) operations only. If the number
m of unique values of the outcome is large, then for all practical purposes the ROC movie can be
shown at a modest number m0 of distinct values only, at a computational cost of O(m0 n log n)
operations. For example, in the setting of Fig. 8 in the meteorological case study in Section 5.2
there are m = 35, 993 unique values of the outcome, whereas the ROC movie uses m0 = 401 frames

                                                                         17
1.01556

                   0.99000
           Ratio

                   0.96000                                                            CPA
                                                                                      C index

                   0.93000

                   0.89960

                             0.00   0.25        0.50          0.75          1.00
                                                 r

Figure 6: Ratio of CPA (blue curve) respectively the C index (green curve) for the feature X
as a predictor of the continuous outcome Y over AUC for X and the balanced binary outcome
1{Y ≥ 0}, where X and Y are bivariate Gaussian with Pearson correlation r ∈ [0, 1]. The solid
horizontal line is at a ratio of 1, which is attained when r = 0 and r = 1.

only. For the vertical averaging of the component curves in the construction of UROC curves, we
partition the unit interval into 1,000 equally sized subintervals.
   Importantly, CPA can be computed in O(n log n) operations, without any need to invoke ROC
analysis, by sorting x1 , . . . , xn and y1 , . . . , yn , computing the respective mid ranks and classes,
and plugging into the rank based representation (18). Similarly, there are algorithms for the
computation of the C index in O(n log n) operations (Knight, 1966; Christensen, 2005).

4.6    Key properties: Comparison to traditional ROC analysis
We are now in a position to judge whether the proposed toolbox of ROC movies, UROC curves, and
CPA constitutes a proper generalization of traditional ROC analysis. To facilitate the assessment,
the subsequent statements admit immediate comparison with the key insights of classical ROC
analysis, as summarized in Section 2.4.
   We start with the trivial but important observation that the new tools nest the notions of
traditional ROC analysis. This is not to be taken for granted, as extant generalizations do not
necessarily share this property.
 (0) In the case of a binary outcome, both the ROC movie and the UROC curve reduce to the
     ROC curve, and CPA reduces to AUC.
 (1) ROC movies, the UROC curve and CPA are straightforward to compute and interpret, in
     the (rough) sense of the larger the better.
 (2) CPA attains values between 0 and 1 and relates linearly to the covariance between the class
     of the outcome and the mid rank of the feature, relative to the covariance between the
     class and the mid rank of the outcome. In particular, if the outcomes are pairwise distinct,
     then CPA = 21 (ρM + 1), where ρM is Spearman’s mid rank adjusted coefficient (25). If the
     outcomes are binary, then CPA = 12 (D + 1) in terms of Somers’ D. For a perfect feature,
     CPA = 1, ρM = 1 under pairwise distinct and D = 1 under binary outcomes. For a feature

                                                       18
that is independent of the outcome, CPA = 21 , ρM = 0 under pairwise distinct and D = 0
      under binary outcomes.
 (3) The numerical value of CPA admits an interpretation as a weighted probability of concordance
     for feature–outcome pairs, with weights that grow linearly in the class based distance between
     outcomes.

 (4) ROC movies, UROC curves, and CPA are purely rank based and, therefore, invariant under
     strictly increasing transformations. Specifically, if φ : R → R and ψ : R → R are strictly
     increasing, then the ROC movie, UROC curve, and CPA computed from

                               (φ(x1 ), ψ(y1 )), . . . , (φ(xn ), ψ(yn )) ∈ R × R                (28)

      are the same as the ROC movie, UROC curve, and CPA computed from the data in (8).
We iterate and emphasize that, as an immediate consequence of the final property, ROC movies,
UROC curves, and CPA assess the discrimination ability or potential predictive ability of a point
forecast, regression output, feature, marker, or test. Markedly different techniques are called for if
one seeks to assess a forecast’s actual value in any given applied problem (Ben Bouallègue et al.,
2015; Ehm et al., 2016).

5     Real data examples
In the following examples from survival analysis and numerical weather prediction the usage of
ROC movies, UROC curves, and CPA is demonstrated. We start by returning to the survival
example from Section 1, where the new set of tools frees researchers form the need to artificially
binarize the outcome. Then the use of CPA is highlighted in a study of recent progress in numerical
weather prediction (NWP), and in a comparison of the predictive performance of NWP models
and convolutional neural networks.

5.1    Survival data from Mayo Clinic trial
In the introduction, Figs. 1 and 2 serve to illustrate and contrast traditional ROC curves, ROC
movies and UROC curves. They are based on a classical dataset from a Mayo Clinic trial on
primary biliary cirrhosis (PBC), a chronic fatal disease of the liver, that was conducted between
1974 and 1984 (Dickson et al., 1989). The data are provided by various R packages, such as
SMPracticals and survival, and have been analyzed in textbooks (Fleming and Harrington,
1991; Davison, 2003). The outcome of interest is survival time past entry into the study. Patients
were randomly assigned to either a placebo or treatment with the drug D-penicillamine. However,
extant analyses do not show treatment effects (Dickson et al., 1989), and so we follow previous
practice and study treatment and placebo groups jointly.
    We consider two biochemical markers, namely, serum albumin and serum bilirubin concentra-
tion in mg/dl, for which higher and lower levels, respectively, are known to be indicative of earlier
disease stages, thus supporting survival. Hence, for the purposes of ROC analysis we reverse the
orientation of the serum bilirubin values. Given our goal of illustration, we avoid complications
and remove patient records with censored survival times, to obtain a dataset with n = 161 patient
records and m = 156 unique survival times. The proper handling of censoring is beyond the scope
of our study, and we leave this task to subsequent work. For a discussion and comparison of extant
approaches to handling censored data in the context of time-dependent ROC curves see Blanche
et al. (2013).

                                                   19
a              2m Temperature                          b                Wind speed                                           c              Precipitation

                                                                    0.95

           0.99                                                                                                                          0.9

                                                                    0.90
  CPA

                                                          CPA

                                                                                                                               CPA
           0.98                                                                                                                          0.8
                                                                    0.85

           0.97                                                     0.80                                                                 0.7

                     2008     2011          2014   2017                       2008         2011          2014          2017                        2008        2011          2014   2017
                                     Year                                                         Year                                                                Year

   d              2m Temperature                          e                Wind speed                                           f              Precipitation

                                                                    0.90                                                                 0.9

          0.950
                                                                    0.85

                                                                                                                                         0.8
C index

                                                          C index

                                                                                                                               C index
                                                                    0.80
          0.925

                                                                    0.75
                                                                                                                                         0.7
          0.900

                                                                    0.70

                     2008     2011          2014   2017                       2008         2011           2014         2017                        2008        2011          2014   2017
                                     Year                                                         Year                                                                Year

                                                                    HRES             D+1      D+2        D+3     D+4     D+5

                                                                    Persistence      D+1      D+2        D+3     D+4     D+5

Figure 7: Temporal evolution of CPA and the C index for forecasts from the ECMWF high-
resolution model at lead times of one to five days in comparison to the simplistic persistence
forecast in terms of CPA (a, b, c) and the C index (d, e, f). The weather variables considered are
(a, d) surface (2-meter) temperature, (b, e) surface wind speed and (c, f) 24-hour precipitation
accumulation. The measures refer to a domain that covers Europe and twelve-month periods that
correspond to January–December (solid and dotted lines), April–March, July–June and October–
September (dotted lines only), based on gridded forecast and observational data from January
2007 through December 2018.

    The traditional ROC curves in Fig. 1 are obtained by binarizing survival time at a threshold
of 1462 days, which is the survival time in the data record that gets closest to four years. The
ROC movies and UROC curves in Fig. 2 are generated directly from the survival times, without
any need to artificially pick a threshold. The CPA values for serum albumin and serum bilirubin
are 0.73 and 0.77, respectively, and contrary to the ranking in Fig. 1, where bilirubin was deemed
superior, based on outcomes that were artificially made binary. Our tools free researchers from
the need to binarize, and still they allow for an assessment at the binary level, if desired. For
example, the ROC curves and AUC values from Fig. 1 appear in the ROC movie at a threshold
value of 1462 days. In line with current uses of AUC in a gamut of applied settings, CPA is
particularly well suited to the purposes of feature screening and variable selection in statistical and
machine learning models (Guyon and Elisseeff, 2003). Here, AUC and CPA demonstrate that both
albumin and bilirubin contribute to prognostic models for survival (Dickson et al., 1989; Fleming
and Harrington, 1991).

                                                                                             20
Start ROC Movie

Figure 8: ROC movies, UROC curves, and CPA for ECMWF high-resolution (HRES) and per-
sistence forecasts of 24-hour precipitation accumulation over Europe at a lead time of five days in
calendar year 2018. In the ROC movies, the number at upper left shows the threshold at hand in
the unit of millimeter, the number at upper center the relative weight wc / maxl=1,...,m−1 wl from
(11), and the numbers at bottom right the respective AUC values.

5.2    Monitoring progress in numerical weather prediction (NWP)
Here we illustrate the usage of CPA in the assessment of recent progress in numerical weather pre-
diction (NWP), which has experienced tremendous advance over the past few decades (Bauer et al.,
2015; Alley et al., 2019; Ben Bouallègue et al., 2019). Specifically, we consider forecasts of surface
(2-meter) temperature, surface (10-meter) wind speed and 24-hour precipitation accumulation ini-
tialized at 00:00 UTC at lead times from a single day (24 hours) to five days (120 hours) ahead from
the high-resolution model operated by the European Centre for Medium-Range Weather Forecasts
(ECMWF Directorate, 2012), which is generally considered the leading global NWP model. The
forecast data are available at https://confluence.ecmwf.int/display/TIGGE. As observational
reference we take the ERA5 reanalysis product (Hersbach et al., 2018). We use forecasts and
observations from 279 × 199 = 55, 521 model grid boxes of size 0.25◦ × 0.25◦ each in a geographic
region that covers Europe from 25.0◦ W to 44.5◦ E in latitude and 25.0◦ N to 74.5◦ N in longitude.
The time period considered ranges from January 2007 to December 2018.
    In Fig. 7 we apply CPA and the C index to compare forecasts from the ECMWF high-resolution
run to a reference technique, namely, the persistence forecast. The persistence forecast is simply
the most recent available observation for the weather quantity of interest; as such, the forecast value
does not depend on the lead time. CPA and the C index are computed on rolling twelve-month
periods that correspond to January–December, April–March, July–June or October–September,
typically comprising n = 365 × 55, 521 = 20, 265, 165 individual forecast cases. The ECMWF
forecast has considerably higher CPA and C index than the persistence forecast for all lead times
and variables considered. For the persistence forecast the measures fluctuate around a constant
level; for the ECMWF forecast they improve steadily, attesting to continuing progress in NWP
(Bauer et al., 2015; Alley et al., 2019; Ben Bouallègue et al., 2019; Haiden et al., 2021).

                                                  21
To place these findings further into context, recall that CPA is a weighted average of AUC
values for binarized outcomes at individual threshold values, as have been used for performance
monitoring by weather centers (Ben Bouallègue et al., 2019; Haiden et al., 2021). The CPA measure
preserves the spirit and power of classical ROC analysis, and frees researchers from the need to
binarize real-valued outcomes. Results in terms of the C index are qualitatively similar, with the
numerical value of CPA being higher than for the C index.
    The ROC movies, UROC curves, and CPA values in Fig. 8 compare the ECMWF high-
resolution forecast to the persistence forecast for 24-hour precipitation accumulation at a lead
time of five days in calendar year 2018. As noted, this record comprises more than 20 million
individual forecast cases, and there are m = 35, 993 unique values of the outcome. We certainly
lack the patience to watch the full sequence of m − 1 screens in the ROC movie. A pragmatic
solution is to consider a subset C ⊆ {1, . . . , m − 1} of indices, so that ROCc is included in the ROC
movie (if and) only if c ∈ C. Specifically, we set positive integer parameters a ≤ m − 1 and b such
that the ROC movie comprises at least a and at most a + b curves. Let the integer s be defined
such that 1 + (a − 1)s ≤ m − 1 < 1 + as, and let Ca = {1, 1 + s, . . . , 1 + (a − 1)s}, so that |Ca | = a.
Let Cb = {c : nc ≥ n/b}; evidently, |Cb | ≤ b. Finally, let C = Ca ∪ Cb so that a ≤ |C| ≤ a + b. We
have made good experiences with choices of a = 400 and b = 100, which in Fig. 8 yield a ROC
movie with 401 screens.

5.3    WeatherBench: Convolutional neural networks (CNNs) vs. NWP
       models
As noted, operational weather forecasts are based on the output of global NWP models that
represent the physics of the atmosphere. However, the grid resolution of NWP models remains
limited due to finite computing resources (Bauer et al., 2015). Spurred by the ever increasing
popularity and successes of machine learning models, alternative, data-driven approaches are in
vigorous development, with convolutional neural networks (CNNs; LeCun et al., 2015) being a
particularly attractive starting point, due to their ease of adaptation to spatio-temporal data.
Rasp et al. (2020) introduce WeatherBench, a ready-to use benchmark dataset for the comparison
of data-driven approaches, such as CNNs and a classical linear regression (LR) based technique, to
NWP models, such as the aforementioned HRES model and simplified versions thereof, T63 and
T42, which run at successively coarser resolutions. Furthermore, WeatherBench supplies baseline
methods, including both the persistence forecast and climatological forecasts.
    As evaluation measure for the various types of point forecasts, WeatherBench uses the root mean
squared error (RMSE). In related studies, the RMSE is accompanied by the anomaly correlation
coefficient (ACC), i.e., the normalized product moment between the difference of the forecast at
hand and the climatological forecast, and the difference between the outcome and the climatological
forecast (Weyn et al., 2020). However, as noted by Rasp et al. (2020), results in terms of RMSE
and ACC tend to be very similar. Here we argue that a rank based measure, such as CPA or the
C index, would be a more suitable companion measure to RMSE than ACC.
    Figure 9 compares WeatherBench forecasts three days ahead for temperature at 850 hPa pres-
sure, which is at around 1.5 km height, in terms of RMSE (in Kelvin), CPA, and the C index. With
reference to Table 2 of Rasp et al. (2020), we consider the persistence forecast, the (direct) linear
regression (LR) forecast, the (direct) CNN forecast, the Operational IFS (HRES) forecast, and
successively coarser versions thereof (T63 and T42). The panels display the performance measures
as functions of latitude bands, from the South Pole at 90◦ S to the equator at 0◦ and the North Pole
at 90◦ N, for the WeatherBench final evaluation period of the years 2017 and 2018. The measures
are initially computed grid cell by grid cell, and then averaged across the grid cells in a latitude
band, which is compatible with the latitude based weighting that is employed in WeatherBench.

                                                   22
You can also read