Advanced Course in Statistics: an overview - Departamento de ...

Page created by Bryan Potter
 
CONTINUE READING
Advanced Course in Statistics:
        an overview

            Antonio Cuevas
    Departamento de Matemáticas
   Universidad Autónoma de Madrid
             January, 2020
Prerequisites
     I   I will assume that the course attendants have followed at least and
         introductory course in mathematical statistics and a basic course on
         probability. Anyway I will do my best to make the course as
         self-contained as possible. In case you need to recall some basic
         notions on mathematical statistics, please have a look at the slides
         of my undergraduate courses of Statistics I and Statistics II. Many
         other resources are freely available on internet.
     I   It is desirable some familiarity with basic notions of measure theory,
         functional analysis (Banach and Hilbert spaces, operators theory, Lp
         spaces,...) and stochastic processes.
     I   We will use for illustration purposes and practical examples
         the statistical software R. Some proposed exercises will require the
         use of R. While some familiarity with the use of this software is
         highly recommended, it is not strictly necessary in order to follow
         this course. Please, see the course web page for some additional
         information on the software R. Also, a very basic introduction to R
         can be found in the slides Statistics I .
The data

  In general terms, the aim of statistics is to obtain information from
  a data set (or sample)
                               x1 , . . . , xn
   These data come from the repeated observation of a phenomenon
  of interest.
  The sample space is defined as the set of all possible values of the
  magnitude x.
                       X = sample space
  In classical statistics X = R.
  In the so-called multivariate analysis X = Rd .
Descriptive statistics/Statistical Inference
     I   Descriptive statistics (Exploratory Data Analysis): the
         aim is summarizing (e.g., via mean, median and mode) and
         visualizing a data set
     I   Statistical inference: the data X1 , . . . , Xn are independent
         identically distributed observations drawn from a random
         variable X ,
                              X : (Ω, A, P) → (X , B).
         We will sometimes say that X represents the underlying
         population. The distribution of X (defined by
         P(B) = P(X ∈ B) for B ∈ B) is often assumed to depend on
         an unknown parameter θ taking values on a known parameter
         space Θ. We will sometimes denote P = Pθ
                                Θ = Parameter space
         The general purpose is to use the random sample X1 , . . . , Xn in order to
         make inference (hypothesis testing, point estimation, confidence
         intervals,...) about the (unknown) ”true” value of θ ∈ Θ.
The evolution of statistical theory

            Statistical                  X                    Θ             Time
              Theory
        Classical inference              R                  Θ⊂R             1920’s
       Multivariate analysis       Rd (n >> d)        Θ ⊂ Rk (n >> k)       1940’s
          Nonparametrics           Rd (n >> d)        A function space      1960’s
    High dimensional problems       Rd (n < d)             Θ ⊂ Rk           2000’s
     Functional Data Analysis     A function space   Rk or a funct. space   1990’s
    Object Oriented D. Analysis      A space of         Rk , or space       2000’s
                                      images              of images
General structure of the course

   Two parts:
     I   Statistics with functional data: the sample data are real
         functions xi = xi (t) defined on a compact interval.
     I   Nonparametric functional estimation: the data are real
         numbers (or vectors in Rd ) but the aim of the estimation is a
         function, for example a density or a regression function.
Statistics with functional data

   It is sometimes called Functional Data Analysis (FDA)
   The data
                  x1 = x1 (t), . . . , xn = xn (t), t ∈ [0, 1].
   are functions defined on some compact interval (say [0, 1]). The
   argument t corresponds often (but not necessarily) to the time
   instant in which the magnitude x(t) is measured.
   The functional data can be considered as random observations
   drawn from a stochastic process. The distribution of a stochastic
   process is a probability measure on the space of trajectories. So,
   we will need to use some probability theory on function spaces.
   In informal terms,

                 Random variables       Stochastic processes
                                      =
                 Classical statistics          FDA
Functional data: an example in cardiology
                                       ECG data
                  8

                  6

                  4

                  2

                  0

                  −2
                                                        Control group
                                                        Patients group
                  −4
                    0   10   20   30   40    50   60   70     80         90

   Figure: 2026 electrocardiograms. 1506 correspond to the control group (in
   blue) and 520 correspond to ischemia patients (in red)

   A possible application here would be as follows: given the ECG
   curve of a new coming patient (still not diagnosed regarding the
   ischemia condition), might be get, in view of such ECG curve, a
   quick, preliminary diagnosis for the patient?
Functional data: an example in climate studies (I)
Functional data: an example in climate studies (II)
   In the figure above, the blue line corresponds to the average of 38 curves; each
   curve is obtained (via linear interpolation) from the maximum daily
   temperatures (365 values per year) recorded on the Barcelona Airport (El
   Prat), during the period 1944-1981. The red line is the analogous average
   obtained from the 38 curves corresponding to the period 1982-2019. The
   February 29 data (corresponding to leap years) have been omitted. The missing
   values have been imputed by linear interpolation.

   Some interesting questions:
     I   If we assume that the temperatures in the first (resp. second)
         period are a sample of a process X (t) (resp. Y (t)) and we
         denote the respective mean functions m1 (t) = E(X (t)) and
         m2 (t) = E(Y (t)). There is enough statistical evidence (in
         view of the previous data) to conclude m1 6= m2 ? In other
         words, we would like to test the null hypothesis H0 : m1 = m2
         versus the alternative H1 : m1 6= m2 .
     I   Is there some useful information in the derivatives of the
         curves?
Functional data: an example in climate studies (III)
   The graph below (courtesy of J.E. Chacón) corresponds to temperatures
   recorded at Pittsburgh.
The troubles with FDA (I)
   To some extent, the progress of statistics has consisted on
   conquering more sophisticated sample and parameter spaces: from
   subsets of R or Rd to function or shapes spaces.
   This increase in generality entails some problems:
     I   Lack of a natural order in the sample space: no distribution
         function is available to characterize the distributions.
     I   Multiplicity of choices for the distance between two
         elements d(x1 , x2 )) (or kx1 − x2 k in the case of normed
         spaces)
     I   How to define the “population mean” µ in order to properly
         respond to the notion of “average” and to satisfy

                        EkX − µk2 = min EkX − ak2 ?
                                         a

     I   How to define some basic notions such as median, mode,
         quantiles, outliers in order to properly generalize the
         analogous notions for the case of numerical data?
The troubles with FDA (II)
    I   The closed, bounded sets are not in general compact in
        infinite-dimensional spaces. This entails some serious
        theoretical and practical consequences.
    I   There are some difficulties to define simple, easy to handle
        regression models for the case of general data.
    I   The need for pre-processing data which usually come in a
        discretized fashion.
    I   Lack of a natural translation-invariant measure (analogous
        to the Lebesgue measure). So, no general notion of density
        function (similar to that of the finite-dimensional cases) is
        available. Some other important tools in classical statistics,
        such as characteristic functions, can only be partially used,
        under severe limitations.
    I   In FDA, the non-invertibility of the covariance operators
        leads to some essential differences with the classical treatment
        of regression and classification models.
Some typical tools in FDA

    I   The Karhunen-Loève expansion: X (t) can be expressed in the
        form
                                   X∞
                           X (t) =     Zk ek (t),
                                    k=1

        where the ek are the eigenfunctions of the covariance operator
        of the process X (t).
    I   Depth measures.
    I   Dimension reduction procedures.
    I   Bochner integral.
    I   Exponential inequalities.
    I   Regularization and smoothing methods.
Some problems in FDA we will consider

    I   Some probability background: probability theory in
        infinite-dimensional spaces.
    I   Definition of centralization values: mean, median and mode.
        Estimation of these values.
    I   Practical use of functional data.
    I   Dimension reduction methods
    I   Depth notions.
    I   Supervised (or discrimination) and unsupervised (clustering)
        classification based on functional data.
    I   Functional regression models
    I   ANOVA models for functional data.
Nonparametric functional statistics (I)

   The general aim is to estimate from a sample X1 , . . . , Xn some
   function of interest which depends on the distribution of the Xi ’s.
   The term “nonparametric” comes from the fact that we don’t
   assume the membership of the target function to any family
   indexed for a finite-dimensional parameter.
Nonparametric functional statistics (II)
   Some typical examples:
     I   Estimation of the distribution function:
         Let X1 , , X2 , . . . , Xn , . . . be (iid) observations drawn from a real
         random variable X with distribution F . The empirical distribution
         function
                                                      n
                                                  1X
                                       Fn (t) =         I(−∞,t] (Xi ).
                                                  n
                                            i=1

         is a natural estimator of F . The study of the properties of such
         estimator is a major topic in statistics.
     I  Density estimation:
       Here the aim is to estimate the common density f of the Xi ’s. This
       topic has received a lot of attention since the 1960’s. Most density
       estimators are constructed from “smoothed” versions of Fn . They
       could be seen as “sophisticated versions” of the familiar histograms.
     I Estimation of the regression function E(Y |X = x):
       In this case we must have a sample of type (Xi , Yi ).
An example in Medicine

   Figure:       In fetal cardiology studies it is often of interest the analysis of the so-called “(average) short term
   variability”. This variable is called ASTV in the data set cardio included in the R-package ks; see also the website
   of the book by Chacón and Duong and the UCI Machine Learning Repository for further analysis and details on
   these data. The above graph shows a nonparametric density estimator of the ASTV based on a sample of 2126
   foetuses. The shape of this estimated density reveals some features (e.g., related to multimodality) of the
   distribution which will be necessarily hidden if we fit a usual parametric model (e.g. based on the normal model).
An example in Food Science

   Figure:      The above graph shows the estimated regression function m(x) = E(Y |X = x) for the variables X =
   weight in Kg. of a fish, Y = concentration of mercury in the meat of the fish. The regression curve has been
   obtained from a sample of 171 fishes captured in the rivers Lumber and Wacamaw (North Carolina, USA). Again,
   nonparametric procedures provide more flexibility, when compared with the standard parametric methods (based on
   linear or polynomial regression)
Nonparametric functional statistics (III)

     I   Set estimation:
         Te aim is estimating the (compact) support of a random
         variable X , with values in Rd , from a sample X1 , . . . , Xn .
         In other cases, the target of the estimation is a level set of
         type {x : f (x) ≥ c}, where f is the underlying density of the
         Xi ’s.
The R packages we will use
   The R software is free and easy to install. It is available for the
   usual platforms (Windows, Linux, Mac) from

                     http://www.r-project.org/

   This software consists of a “basic version” plus many additional
   packages which can be downloaded when needed. In particular, the
   package fda.usc is very useful for functional data analysis.

   In this link you can find some, quite complete, information on the
   R-packages currently available for FDA.

   The packages KernSmooth and ks include some standard
   procedures in nonparametric statistics.

   More details on software can be found in the course web page.
You can also read