Advanced Course in Statistics: an overview - Departamento de ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Advanced Course in Statistics:
an overview
Antonio Cuevas
Departamento de Matemáticas
Universidad Autónoma de Madrid
January, 2020Prerequisites
I I will assume that the course attendants have followed at least and
introductory course in mathematical statistics and a basic course on
probability. Anyway I will do my best to make the course as
self-contained as possible. In case you need to recall some basic
notions on mathematical statistics, please have a look at the slides
of my undergraduate courses of Statistics I and Statistics II. Many
other resources are freely available on internet.
I It is desirable some familiarity with basic notions of measure theory,
functional analysis (Banach and Hilbert spaces, operators theory, Lp
spaces,...) and stochastic processes.
I We will use for illustration purposes and practical examples
the statistical software R. Some proposed exercises will require the
use of R. While some familiarity with the use of this software is
highly recommended, it is not strictly necessary in order to follow
this course. Please, see the course web page for some additional
information on the software R. Also, a very basic introduction to R
can be found in the slides Statistics I .The data
In general terms, the aim of statistics is to obtain information from
a data set (or sample)
x1 , . . . , xn
These data come from the repeated observation of a phenomenon
of interest.
The sample space is defined as the set of all possible values of the
magnitude x.
X = sample space
In classical statistics X = R.
In the so-called multivariate analysis X = Rd .Descriptive statistics/Statistical Inference
I Descriptive statistics (Exploratory Data Analysis): the
aim is summarizing (e.g., via mean, median and mode) and
visualizing a data set
I Statistical inference: the data X1 , . . . , Xn are independent
identically distributed observations drawn from a random
variable X ,
X : (Ω, A, P) → (X , B).
We will sometimes say that X represents the underlying
population. The distribution of X (defined by
P(B) = P(X ∈ B) for B ∈ B) is often assumed to depend on
an unknown parameter θ taking values on a known parameter
space Θ. We will sometimes denote P = Pθ
Θ = Parameter space
The general purpose is to use the random sample X1 , . . . , Xn in order to
make inference (hypothesis testing, point estimation, confidence
intervals,...) about the (unknown) ”true” value of θ ∈ Θ.The evolution of statistical theory
Statistical X Θ Time
Theory
Classical inference R Θ⊂R 1920’s
Multivariate analysis Rd (n >> d) Θ ⊂ Rk (n >> k) 1940’s
Nonparametrics Rd (n >> d) A function space 1960’s
High dimensional problems Rd (n < d) Θ ⊂ Rk 2000’s
Functional Data Analysis A function space Rk or a funct. space 1990’s
Object Oriented D. Analysis A space of Rk , or space 2000’s
images of imagesGeneral structure of the course
Two parts:
I Statistics with functional data: the sample data are real
functions xi = xi (t) defined on a compact interval.
I Nonparametric functional estimation: the data are real
numbers (or vectors in Rd ) but the aim of the estimation is a
function, for example a density or a regression function.Statistics with functional data
It is sometimes called Functional Data Analysis (FDA)
The data
x1 = x1 (t), . . . , xn = xn (t), t ∈ [0, 1].
are functions defined on some compact interval (say [0, 1]). The
argument t corresponds often (but not necessarily) to the time
instant in which the magnitude x(t) is measured.
The functional data can be considered as random observations
drawn from a stochastic process. The distribution of a stochastic
process is a probability measure on the space of trajectories. So,
we will need to use some probability theory on function spaces.
In informal terms,
Random variables Stochastic processes
=
Classical statistics FDAFunctional data: an example in cardiology
ECG data
8
6
4
2
0
−2
Control group
Patients group
−4
0 10 20 30 40 50 60 70 80 90
Figure: 2026 electrocardiograms. 1506 correspond to the control group (in
blue) and 520 correspond to ischemia patients (in red)
A possible application here would be as follows: given the ECG
curve of a new coming patient (still not diagnosed regarding the
ischemia condition), might be get, in view of such ECG curve, a
quick, preliminary diagnosis for the patient?Functional data: an example in climate studies (I)
Functional data: an example in climate studies (II)
In the figure above, the blue line corresponds to the average of 38 curves; each
curve is obtained (via linear interpolation) from the maximum daily
temperatures (365 values per year) recorded on the Barcelona Airport (El
Prat), during the period 1944-1981. The red line is the analogous average
obtained from the 38 curves corresponding to the period 1982-2019. The
February 29 data (corresponding to leap years) have been omitted. The missing
values have been imputed by linear interpolation.
Some interesting questions:
I If we assume that the temperatures in the first (resp. second)
period are a sample of a process X (t) (resp. Y (t)) and we
denote the respective mean functions m1 (t) = E(X (t)) and
m2 (t) = E(Y (t)). There is enough statistical evidence (in
view of the previous data) to conclude m1 6= m2 ? In other
words, we would like to test the null hypothesis H0 : m1 = m2
versus the alternative H1 : m1 6= m2 .
I Is there some useful information in the derivatives of the
curves?Functional data: an example in climate studies (III) The graph below (courtesy of J.E. Chacón) corresponds to temperatures recorded at Pittsburgh.
The troubles with FDA (I)
To some extent, the progress of statistics has consisted on
conquering more sophisticated sample and parameter spaces: from
subsets of R or Rd to function or shapes spaces.
This increase in generality entails some problems:
I Lack of a natural order in the sample space: no distribution
function is available to characterize the distributions.
I Multiplicity of choices for the distance between two
elements d(x1 , x2 )) (or kx1 − x2 k in the case of normed
spaces)
I How to define the “population mean” µ in order to properly
respond to the notion of “average” and to satisfy
EkX − µk2 = min EkX − ak2 ?
a
I How to define some basic notions such as median, mode,
quantiles, outliers in order to properly generalize the
analogous notions for the case of numerical data?The troubles with FDA (II)
I The closed, bounded sets are not in general compact in
infinite-dimensional spaces. This entails some serious
theoretical and practical consequences.
I There are some difficulties to define simple, easy to handle
regression models for the case of general data.
I The need for pre-processing data which usually come in a
discretized fashion.
I Lack of a natural translation-invariant measure (analogous
to the Lebesgue measure). So, no general notion of density
function (similar to that of the finite-dimensional cases) is
available. Some other important tools in classical statistics,
such as characteristic functions, can only be partially used,
under severe limitations.
I In FDA, the non-invertibility of the covariance operators
leads to some essential differences with the classical treatment
of regression and classification models.Some typical tools in FDA
I The Karhunen-Loève expansion: X (t) can be expressed in the
form
X∞
X (t) = Zk ek (t),
k=1
where the ek are the eigenfunctions of the covariance operator
of the process X (t).
I Depth measures.
I Dimension reduction procedures.
I Bochner integral.
I Exponential inequalities.
I Regularization and smoothing methods.Some problems in FDA we will consider
I Some probability background: probability theory in
infinite-dimensional spaces.
I Definition of centralization values: mean, median and mode.
Estimation of these values.
I Practical use of functional data.
I Dimension reduction methods
I Depth notions.
I Supervised (or discrimination) and unsupervised (clustering)
classification based on functional data.
I Functional regression models
I ANOVA models for functional data.Nonparametric functional statistics (I) The general aim is to estimate from a sample X1 , . . . , Xn some function of interest which depends on the distribution of the Xi ’s. The term “nonparametric” comes from the fact that we don’t assume the membership of the target function to any family indexed for a finite-dimensional parameter.
Nonparametric functional statistics (II)
Some typical examples:
I Estimation of the distribution function:
Let X1 , , X2 , . . . , Xn , . . . be (iid) observations drawn from a real
random variable X with distribution F . The empirical distribution
function
n
1X
Fn (t) = I(−∞,t] (Xi ).
n
i=1
is a natural estimator of F . The study of the properties of such
estimator is a major topic in statistics.
I Density estimation:
Here the aim is to estimate the common density f of the Xi ’s. This
topic has received a lot of attention since the 1960’s. Most density
estimators are constructed from “smoothed” versions of Fn . They
could be seen as “sophisticated versions” of the familiar histograms.
I Estimation of the regression function E(Y |X = x):
In this case we must have a sample of type (Xi , Yi ).An example in Medicine Figure: In fetal cardiology studies it is often of interest the analysis of the so-called “(average) short term variability”. This variable is called ASTV in the data set cardio included in the R-package ks; see also the website of the book by Chacón and Duong and the UCI Machine Learning Repository for further analysis and details on these data. The above graph shows a nonparametric density estimator of the ASTV based on a sample of 2126 foetuses. The shape of this estimated density reveals some features (e.g., related to multimodality) of the distribution which will be necessarily hidden if we fit a usual parametric model (e.g. based on the normal model).
An example in Food Science Figure: The above graph shows the estimated regression function m(x) = E(Y |X = x) for the variables X = weight in Kg. of a fish, Y = concentration of mercury in the meat of the fish. The regression curve has been obtained from a sample of 171 fishes captured in the rivers Lumber and Wacamaw (North Carolina, USA). Again, nonparametric procedures provide more flexibility, when compared with the standard parametric methods (based on linear or polynomial regression)
Nonparametric functional statistics (III)
I Set estimation:
Te aim is estimating the (compact) support of a random
variable X , with values in Rd , from a sample X1 , . . . , Xn .
In other cases, the target of the estimation is a level set of
type {x : f (x) ≥ c}, where f is the underlying density of the
Xi ’s.The R packages we will use
The R software is free and easy to install. It is available for the
usual platforms (Windows, Linux, Mac) from
http://www.r-project.org/
This software consists of a “basic version” plus many additional
packages which can be downloaded when needed. In particular, the
package fda.usc is very useful for functional data analysis.
In this link you can find some, quite complete, information on the
R-packages currently available for FDA.
The packages KernSmooth and ks include some standard
procedures in nonparametric statistics.
More details on software can be found in the course web page.You can also read