One Millisecond Face Alignment with an Ensemble of Regression Trees

Page created by April Casey
 
CONTINUE READING
One Millisecond Face Alignment with an Ensemble of Regression Trees
One Millisecond Face Alignment with an Ensemble of Regression Trees

                                        Vahid Kazemi and Josephine Sullivan
                                         KTH, Royal Institute of Technology
                                      Computer Vision and Active Perception Lab
                                        Teknikringen 14, Stockholm, Sweden
                                              {vahidk,sullivan}@csc.kth.se

                         Abstract
    This paper addresses the problem of Face Alignment for
a single image. We show how an ensemble of regression
trees can be used to estimate the face’s landmark positions
directly from a sparse subset of pixel intensities, achieving
super-realtime performance with high quality predictions.
We present a general framework based on gradient boosting
for learning an ensemble of regression trees that optimizes
the sum of square error loss and naturally handles missing
or partially labelled data. We show how using appropriate
priors exploiting the structure of image data helps with ef-
ficient feature selection. Different regularization strategies
and its importance to combat overfitting are also investi-
gated. In addition, we analyse the effect of the quantity of
training data on the accuracy of the predictions and explore
the effect of data augmentation using synthesized data.
                                                                       Figure 1. Selected results on the HELEN dataset. An ensemble
                                                                       of randomized regression trees is used to detect 194 landmarks on
1. Introduction                                                        face from a single image in a millisecond.

    In this paper we present a new algorithm that performs                 The first revolves around the indexing of pixel intensi-
face alignment in milliseconds and achieves accuracy supe-             ties relative to the current estimate of the shape. The ex-
rior or comparable to state-of-the-art methods on standard             tracted features in the vector representation of a face image
datasets. The speed gains over previous methods is a con-              can greatly vary due to both shape deformation and nui-
sequence of identifying the essential components of prior              sance factors such as changes in illumination conditions.
face alignment algorithms and then incorporating them in               This makes accurate shape estimation using these features
a streamlined formulation into a cascade of high capacity              difficult. The dilemma is that we need reliable features to
regression functions learnt via gradient boosting.                     accurately predict the shape, and on the other hand we need
    We show, as others have [8, 2], that face alignment can            an accurate estimate of the shape to extract reliable features.
be solved with a cascade of regression functions. In our case          Previous work [4, 9, 5, 8] as well as this work, use an it-
each regression function in the cascade efficiently estimates          erative approach (the cascade) to deal with this problem.
the shape from an initial estimate and the intensities of a            Instead of regressing the shape parameters based on fea-
sparse set of pixels indexed relative to this initial estimate.        tures extracted in the global coordinate system of the image,
Our work builds on the large amount of research over the               the image is transformed to a normalized coordinate system
last decade that has resulted in significant progress for face         based on a current estimate of the shape, and then the fea-
alignment [9, 4, 13, 7, 15, 1, 16, 18, 3, 6, 19]. In particular,       tures are extracted to predict an update vector for the shape
we incorporate into our learnt regression functions two key            parameters. This process is usually repeated several times
elements that are present in several of the successful algo-           until convergence.
rithms cited and we detail these elements now.                             The second considers how to combat the difficulty of the

                                                                   1
One Millisecond Face Alignment with an Ensemble of Regression Trees
inference/prediction problem. At test time, an alignment al-      2.1. The cascade of regressors
gorithm has to estimate the shape, a high dimensional vec-
                                                                      To begin we introduce some notation. Let xi ∈ R2 be
tor, that best agrees with the image data and our model of
                                                                  the x, y-coordinates of the ith facial landmark in an image I.
shape. The problem is non-convex with many local optima.
                                                                  Then the vector S = (xT1 , xT2 , . . . , xTp )T ∈ R2p denotes the
Successful algorithms [4, 9] handle this problem by assum-
                                                                  coordinates of all the p facial landmarks in I. Frequently,
ing the estimated shape must lie in a linear subspace, which
                                                                  in this paper we refer to the vector S as the shape. We use
can be discovered, for example, by finding the principal
                                                                  Ŝ(t) to denote our current estimate of S. Each regressor,
components of the training shapes. This assumption greatly
                                                                  rt (·, ·), in the cascade predicts an update vector from the
reduces the number of potential shapes considered during
inference and can help to avoid local optima. Recent work         image and Ŝ(t) that is added to the current shape estimate
[8, 11, 2] uses the fact that a certain class of regressors are   Ŝ(t) to improve the estimate:
guaranteed to produce predictions that lie in a linear sub-
                                                                                    Ŝ(t+1) = Ŝ(t) + rt (I, Ŝ(t) )             (1)
space defined by the training shapes and there is no need
for additional constraints. Crucially, our regression func-       The critical point of the cascade is that the regressor rt
tions have these two elements.                                    makes its predictions based on features, such as pixel in-
    Allied to these two factors is our efficient regression       tensity values, computed from I and indexed relative to the
function learning. We optimize an appropriate loss func-          current shape estimate Ŝ(t) . This introduces some form of
tion and perform feature selection in a data-driven manner.       geometric invariance into the process and as the cascade
In particular, we learn each regressor via gradient boosting      proceeds one can be more certain that a precise semantic
[10] with a squared error loss function, the same loss func-      location on the face is being indexed. Later we describe
tion we want to minimize at test time. The sparse pixel set,      how this indexing is performed.
used as the regressor’s input, is selected via a combination          Note that the range of outputs expanded by the ensemble
of the gradient boosting algorithm and a prior probability on     is ensured to lie in a linear subspace of training data if the
the distance between pairs of input pixels. The prior distri-     initial estimate Ŝ(0) belongs to this space. We therefore do
bution allows the boosting algorithm to efficiently explore       not need to enforce additional constraints on the predictions
a large number of relevant features. The result is a cascade      which greatly simplifies our method. The initial shape can
of regressors that can localize the facial landmarks when         simply be chosen as the mean shape of the training data
initialized with the mean face pose.                              centered and scaled according to the bounding box output
    The major contributions of this paper are                     of a generic face detector.
 1. A novel method for alignment based on ensemble of                 To train each rt we use the gradient tree boosting algo-
    regression trees that performs shape invariant feature        rithm with a sum of square error loss as described in [10].
    selection while minimizing the same loss function dur-        We now give the explicit details of this process.
    ing training time as we want to minimize at test time.        2.2. Learning each regressor in the cascade
 2. We present a natural extension of our method that han-            Assume we have training data (I1 , S1 ), . . . , (In , Sn )
    dles missing or uncertain labels.                             where each Ii is a face image and Si its shape vector.
                                                                  To learn the first regression function r0 in the cascade we
 3. Quantitative and qualitative results are presented that
                                                                  create from our training data triplets of a face image, an
    confirm that our method produces high quality predic-
                                                                  initial shape estimate and the target update step, that is,
    tions while being much more efficient than the best                    (0)   (0)
    previous method (Figure 1).                                   (Iπi , Ŝi , ∆Si ) where

 4. The effect of quantity of training data, use of partially                        πi ∈ {1, . . . , n}                         (2)
    labeled data and synthesized data on quality of predic-                        (0)
                                                                                 Ŝi     ∈ {S1 , . . . , Sn }\Sπi and            (3)
    tions are analyzed.                                                            (0)               (0)
                                                                                ∆Si      = Sπi −   Ŝi                           (4)
2. Method
                                                                  for i = 1, . . . , N . We set the total number of these triplets to
   This paper presents an algorithm to precisely estimate         N = nR where R is the number of initializations used per
the position of facial landmarks in a computationally effi-       image Ii . Each initial shape estimate for an image is sam-
cient way. Similar to previous works [8, 2] our proposed          pled uniformly from {S1 , . . . , Sn } without replacement.
method utilizes a cascade of regressors. In the rest of this         From this data we learn the regression function r0 (see
section we describe the details of the form of the individual     algorithm 1), using gradient tree boosting with a sum of
components of the cascade and how we perform training.            square error loss. The set of training triplets is then updated
One Millisecond Face Alignment with an Ensemble of Regression Trees
(1)      (1)
to provide the training data, (Iπi , Ŝi , ∆Si ), for the next                2.3.1    Shape invariant split tests
regressor r1 in the cascade by setting (with t = 0)
                                                                              At each split node in the regression tree we make a decision
                    (t+1)          (t)                (t)                     based on thresholding the difference between the intensities
                  Ŝi          = Ŝi + rt (Iπi , Ŝi )                  (5)   of two pixels. The pixels used in the test are at positions u
                   (t+1)                    (t+1)                             and v when defined in the coordinate system of the mean
                 ∆Si           = Sπi −    Ŝi                           (6)
                                                                              shape. For a face image with an arbitrary shape, we would
                                                                              like to index the points that have the same position rela-
This process is iterated until a cascade of T regressors
                                                                              tive to its shape as u and v have to the mean shape. To
r0 , r1 , . . . , rT −1 are learnt which when combined give a
                                                                              achieve this, the image can be warped to the mean shape
sufficient level of accuracy.
                                                                              based on the current shape estimate before extracting the
    As stated each regressor rt is learned using the gradi-                   features. Since we only use a very sparse representation of
ent boosting tree algorithm. It should be remembered that                     the image, it is much more efficient to warp the location
a square error loss is used and the residuals computed in                     of points as opposed to the whole image. Furthermore, a
the innermost loop correspond to the gradient of this loss                    crude approximation of warping can be done using only a
function evaluated at each training sample. Included in                       global similarity transform in addition to local translations
the statement of the algorithm is a learning rate parame-                     as suggested by [2].
ter 0 < ν ≤ 1 also known as the shrinkage factor. Set-                           The precise details are as follows. Let ku be the index
ting ν < 1 helps combat over-fitting and usually results in                   of the facial landmark in the mean shape that is closest to u
regressors which generalize much better than those learnt                     and define its offset from u as
with ν = 1 [10].
                                                                                                         δxu = u − x̄ku
Algorithm 1 Learning rt in the cascade
                                   (t)       (t)                              Then for a shape Si defined in image Ii , the position in Ii
Have training data {(Iπi , Ŝi , ∆Si )}N
                                       i=1 and the learning                   that is qualitatively similar to u in the mean shape image is
rate (shrinkage factor) 0 < ν < 1                                             given by
 1. Initialise
                                                                                                                  1 T
                                                                                                  u0 = xi,ku +      R δxu                   (7)
                                             N
                                             X         (t)
                                                                                                                  si i
                        (t)                                         2
            f0 (I, Ŝ ) = arg min                   k∆Si     − γk
                                  γ∈R2p      i=1                              where si and Ri are the scale and rotation matrix of the sim-
                                                                              ilarity transform which transforms Si to S̄, the mean shape.
 2. for k = 1, . . . , K:                                                     The scale and rotation are found to minimize
     (a) Set for i = 1, . . . , N                                                              p
                                                                                               X
                                                                                                     kx̄j − (si Ri xi,j + ti )k2            (8)
                                  (t)                      (t)
                        rik =   ∆Si      −   fk−1 (Iπi , Ŝi )                                 j=1

     (b) Fit a regression tree to the targets rik giving a weak               the sum of squares between the mean shape’s facial land-
         regression function gk (I, Ŝ(t) ).                                  mark points, x̄j ’s, and those of the warped shape. v0 is sim-
                                                                              ilarly defined. Formally each split is a decision involving 3
     (c) Update
                                                                              parameters θ = (τ, u, v) and is applied to each training and
             fk (I, Ŝ(t) ) = fk−1 (I, Ŝ(t) ) + ν gk (I, Ŝ(t) )             test example as
                                                                                                           (
                         (t)
 3. Output rt (I, Ŝ ) = fK (I, Ŝ )         (t)                                                (t)         1   Iπi (u0 ) − Iπi (v0 ) > τ
                                                                                      h(Iπi , Ŝi , θ)   =                                  (9)
                                                                                                            0   otherwise

                                                                              where u0 and v0 are defined using the scale and rotation
                                                                                                          (t)
2.3. Tree based regressor                                                     matrix which best warp Ŝi to S̄ according to equation (7).
                                                                                  In practice the assignments and local translations are de-
   The core of each regression function rt is the tree based                  termined during the training phase. Calculating the similar-
regressors fit to the residual targets during the gradient                    ity transform, at test time the most computationally expen-
boosting algorithm. We now review the most important im-                      sive part of this process, is only done once at each level of
plementation details for training each regression tree.                       the cascade.
One Millisecond Face Alignment with an Ensemble of Regression Trees
2.3.2   Choosing the node splits                                    2.4. Handling missing labels
For each regression tree, we approximate the underlying                 The objective of equation (10) can be easily extended to
function with a piecewise constant function where a con-            handle the case where some of the landmarks are not la-
stant vector is fit to each leaf node. To train the regression      beled in some of the training images (or we have a mea-
tree we randomly generate a set of candidate splits, that is        sure of uncertainty for each landmark). Introduce variables
                                                                    wi,j ∈ [0, 1] for each training image i and each landmark j.
θ’s, at each node. We then greedily choose the θ ∗ , from
                                                                    Setting wi,j to 0 indicates that the landmark j is not labeled
these candidates, which minimizes the sum of square error.          in the ith image while setting it to 1 indicates that it is. Then
If Q is the set of the indices of the training examples at a        equation (10) can be updated to
node, this corresponds to minimizing                                                   X      X
                          X X                                            E(Q, θ) =                    (ri − µθ,s )T Wi (ri − µθ,s )
           E(Q, θ) =                  kri − µθ,s k2       (10)                       s∈{l,r} i∈Qθ,s
                        s∈{l,r} i∈Qθ,s
                                                                      where Wi is a diagonal matrix with the vector
where Qθ,l is the indices of the examples that are sent to the      (wi1 , wi1 , wi2 , wi2 , . . . , wip , wip )T on its diagonal and
left node due to the decision induced by θ, ri is the vector
                                                                                                 −1
of all the residuals computed for image i in the gradient
                                                                                 
                                                                                    X                      X
boosting algorithm and                                                µθ,s =               Wi                 Wi ri , for s ∈ {l, r}
                    1     X                                                      i∈Qθ,s           i∈Qθ,s
          µθ,s =                ri , for s ∈ {l, r}       (11)
                  |Qθ,s |                                                                                                             (13)
                          i∈Qθ,s

The optimal split can be found very efficiently because if             The gradient boosting algorithm must also be modified
one rearranges equation (10) and omits the factors not de-          to account of these weight factors. This can be done simply
pendent on θ then one can see that                                  by initializing the ensemble model with the weighted aver-
                                   X                                age of targets, and fitting regression trees to the weighted
  arg min E(Q, θ) = arg max           |Qθ,s | µTθ,s µθ,s            residuals in algorithm 1 as follows
         θ                      θ
                                    s∈{l,r}
                                                                                                (t)                   (t)
Here we only need to compute µθ,l when evaluating differ-                       rik = Wi (∆Si − fk−1 (Iπi , Ŝi ))                    (14)
ent θ’s, as µθ,r can be calculated from the average of the
targets at the parent node µ and µθ,l as follows
                                                                    3. Experiments
                          |Q|µ − |Qθ,l |µθ,l                            Baselines: To accurately benchmark the performance
                 µθ,r =                                             of our proposed method, an ensemble of regression trees
                                Qθ,r
                                                                    (ERT) we created two more baselines. The first is based
                                                                    on randomized ferns with random feature selection (EF)
2.3.3   Feature selection                                           and the other is a more advanced version of this with cor-
The decision at each node is based on thresholding the dif-         relation based feature selection (EF+CB) which is our re-
ference of intensity values at a pair of pixels. This is a rather   implementation of [2]. All the parameters are fixed for all
simple test, but it is much more powerful than single in-           three approaches.
tensity thresholding because of its relative insensitivity to           EF uses a straightforward implementation of random-
changes in global lighting. Unfortunately, the drawback of          ized ferns as the weak regressors within the ensemble and
using pixel differences is the number of potential split (fea-      is the fastest to train. We use the same shrinkage method as
ture) candidates is quadratic in the number of pixels in the        suggested by [2] to regularize the ferns.
mean image. This makes it difficult to find good θ’s with-              EF+CB uses a correlation based feature selection
out searching over a very large number of them. However,            method that projects the target outputs, ri ’s, onto a random
this limiting factor can be eased, to some extent, by taking        direction, w, and chooses the pairs of features (u, v) s.t.
the structure of image data into account. We introduce an           Ii (u0 ) − Ii (v0 ) has the highest sample correlation over the
exponential prior                                                   training data with the projected targets wT ri .
                                                                        Parameters: Unless specified, all the experiments are
                    P (u, v) ∝ e−λku−vk                     (12)
                                                                    performed with the following fixed parameter settings. The
over the distance between the pixels used in a split to en-         number of strong regressors, rt , in the cascade is T = 10
courage closer pixel pairs to be chosen.                            and each rt comprises of K = 500 weak regressors gk . The
   We found using this simple prior reduces the prediction          depth of the trees (or ferns) used to represent gk is set to
error on a number of face datasets. Figure 4 compares the           F = 5. At each level of the cascade P = 400 pixel loca-
features selected with and without this prior, where the size       tions are sampled from the image. To train the weak regres-
of the feature pool is fixed to 20 in both cases.                   sors, we randomly sample a pair of these P pixel locations
One Millisecond Face Alignment with an Ensemble of Regression Trees
(a) T = 0            (b) T = 1            (c) T = 2         (d) T = 3             (e) T = 10         (f) Ground truth
Figure 2. Landmark estimates at different levels of the cascade initialized with the mean shape centered at the output of a basic Viola &
Jones[17] face detector. After the first level of the cascade, the error is already greatly reduced.

according to our prior and choose a random threshold to cre-           ferns. Figure 3 shows the average error at different levels
ate a potential split as described in equation (9). The best           of the cascade which shows that ERT can reduce the error
split is then found by repeating this process S = 20 times,            much faster than other baselines. Note that we have also
and choosing the one that optimizes our objective. To create           provided the results of running EF+CB multiple times and
the training data to learn our model we use R = 20 different           taking the median of final predictions. The results show that
initializations for each training example.                             similar error rate to EF+CB can be achieved by our method
    Performance: The runtime complexity of the algorithm               with an order of magnitude less computation.
on a single image is constant O(T KF ). The complexity of                 We have also provided results for the widely used
the training time depends linearly on the number of train-             LFPW[1] dataset (Table 2). With our EF+CB baseline
ing data O(N DT KF S) where N is the number of training                we could not replicate the numbers reported by [2]. (This
data and D is dimension of the targets. In practice with a             could be due to the fact that we could not obtain the whole
single CPU our algorithm takes about an hour to train on               dataset.) Nevertheless our method surpasses most of the
the HELEN[12] dataset and at runtime it only takes about               previously reported results on this dataset taking only a frac-
one millisecond per image.                                             tion of the computational time needed by any other method.
    Database: Most of the experimental results reported are
for the HELEN[12] face database which we found to be the                         [1]   [2]    EF     EF+CB EF+CB (5) EF+CB (10) ERT
most challenging publicly available dataset. It consists of a             Error .040 .034 .051        .046     .043         .041     .038
total of 2330 images, each of which is annotated with 194
                                                                       Table 2. A comparison of the different methods when applied to
landmarks. As suggested by the authors we use 2000 im-
                                                                       the LFPW dataset. Please see the caption for table 1 for an expla-
ages for training data and the rest for testing.                       nation of the numbers.
    We also report final results on the popular LFPW[1]
database which consists of 1432 images. Unfortunately, we                 Feature Selection: Table 4 shows the effect of using
could only download 778 training images and 216 valid test             equation (12) as a prior on the distance between pixels used
images which makes our results not directly comparable to              in a split instead of a uniform prior on the final results. The
those previously reported on this dataset.                             parameter λ determines the effective maximum distance be-
    Comparison: Table 1 is a summary of our results com-               tween the two pixels in our features and was set to 0.1 in
pared to previous algorithms. In addition to our baselines,            our experiments. Selecting this parameter by cross valida-
we have also compared our results with two variations of               tion when learning each strong regressor, rt , in the cascade
Active Shape Models, STASM[14] and CompASM[12].                        could potentially lead to a more significant improvement.
                                                                       Figure 4 is a visualization of the selected pairs of features
         [14] [12]   EF   EF+CB EF+CB (5) EF+CB (10) ERT               when the different priors are used.
  Error .111 .091 .069     .062     .059        .055      .049
Table 1. A summary of the results of different algorithms on the                                     Uniform Exponential
HELEN dataset. The error is the average normalized distance                                  Error    .053       .049
of each landmark to its ground truth position. The distances are
                                                                       Table 3. The effect of using different priors for feature selection
normalized by dividing by the interocular distance. The number
                                                                       on the final average error. An exponential prior is applied on the
within the bracket represents the number of times the regression
                                                                       Euclidean distance between the two pixels defining a feature, see
algorithm was run with a random initialization. If no number is
                                                                       equation (12).
displayed then the method was initialized with the mean shape. In
the case of multiple estimations the median of the estimates was
chosen as the final estimate for the landmark.                             Regularization: When using the gradient boosting algo-
                                                                       rithm one needs to be careful to avoid overfitting. To obtain
   The ensemble of regression trees described in this pa-              lower test errors it is necessary to perform some form of
per significantly improves the results over the ensemble of            regularization. The simplest approach is shrinkage. This
One Millisecond Face Alignment with an Ensemble of Regression Trees
0.13                                                                                   0.11
                                                                                          EF                                                                                  EF
                                        0.12                                              EF+CB                                                                               EF+CB
                                                                                          EF+CB (5)                             0.1                                           EF+CB (5)
                                                                                          EF+CB (10)                                                                          EF+CB (10)
                                        0.11                                              ERT                                                                                 ERT
                                                                                                                               0.09

                                         0.1
                                                                                                                               0.08

                        Average error

                                                                                                               Average error
                                        0.09
                                                                                                                               0.07
                                        0.08
                                                                                                                               0.06
                                        0.07

                                                                                                                               0.05
                                        0.06

                                        0.05                                                                                   0.04

                                        0.04                                                                                   0.03
                                               1   2   3     4      5      6      7   8       9        10                             1   2   3   4     5      6      7   8       9        10
                                                                  Cascade level                                                                       Cascade level

                                                            (a) HELEN                                                                             (b) LFPW
Figure 3. A comparison of different methods on HELEN(a) and LFPW(b) dataset. EF is the ensemble of randomized ferns and EF+CB
is the ensemble of ferns with correlation based feature selection initialized with the mean shape. We also provide the results of taking the
median of results of various initializations (5 and 10) as suggested by [2]. The results show that the proposed ensemble of regression trees
(ERT) initialized with only the mean shape consistently outperforms the ensemble of ferns baseline and it can reach the same error rate
with much less computation.

                                                                                                                We achieved similar results using the averaging regular-
                                                                                                            ization compared to the more standard shrinkage method.
                                                                                                            However, regularization by averaging has the advantage of
                                                                                                            being more scalable, as it enables parallelization during
                                                                                                            training time which is especially important for solving large
                                                                                                            scale problems.
                                                                                                                Cascade: At each level of the cascade the second level
                                                                                                            regressors can only observe a fixed and sparse subset of the
                                                                                                            shape indexed features. Indexing the features based on the
               (a) Uniform prior                           (b) Exponential prior
                                                                                                            current estimate is a crude way of warping the image with a
Figure 4. Different features are selected if different priors are used.                                     small cost. Table 5 shows the final error rate with and with-
The exponential prior biases the selection towards pairs of pixels                                          out using the cascade. We found significant improvement
which are closer together.                                                                                  by using this iterative mechanism which is in line with pre-
                                                                                                            viously reported results [8, 2] (For a fair comparison here
involves setting the learning rate ν in the gradient boosting                                               we fixed the total number of observed features to 10 × 400
algorithm to less than 1 (Here we set ν = 0.1). Regular-                                                    points.)
ization can also be achieved by averaging the predictions
of multiple regression trees. This way, gk correspond to a
                                                                                                                                              # Trees 1 × 500 1 × 5000 10 × 500
random forest as opposed to one tree and we set ν = 1.
Therefore, at each iteration of the gradient boosting algo-                                                                                   Error        .085           .074                  .049
rithm instead of fitting one regression tree to the residuals,                                              Table 5. The above results show the importance of using a cascade
we fit multiple trees (10 in our experiments) and average the                                               of regressors as opposed to a single level ensemble.
results. (The total number of trees is fixed in all the cases.)
   In terms of the bias and variance trade off, the gradient                                                    Training Data: To test the performance of our method
boosting algorithm always decreases the bias but increases                                                  with respect to the number of training images, we trained
the variance. But regularizing by shrinkage or averaging                                                    different models from differently sized subsets of the train-
effectively reduces the variance by learning multiple over-                                                 ing data. Table 6 summarizes the final results and figure
lapping models.                                                                                             5 is a plot of the error at each level of the cascade. Using
                                                                                                            many levels of regressors is most useful when we have a
                                                                                                            large number of training examples.
                     Unregularized Shrinkage Averaging                                                          We repeated the same experiments with the total number
             Error                      .103               .049             .049                            of augmented examples fixed but varied the combination of
Table 4. A comparison of the results on the HELEN dataset when                                              initial shapes used to generate a training example from one
different forms of regularization are applied. We found similar                                             labelled face example and the number of annotated images
results when using either shrinkage or averaging given the same                                             used to learn the cascade (Table 7).
total number of trees in the ensemble.                                                                          Augmenting the training data using different initial
One Millisecond Face Alignment with an Ensemble of Regression Trees
# Examples 100 200 500 1000 2000                                                           dient boosting procedure described in this paper does not
              Error                           .090 .074 .059 .054 .049                                   take advantage of the correlation between landmarks. This
                                                                                                         issue could be addressed in a future work.
Table 6. Final error rate with respect to the number of training
examples. When creating training data for learning the cascade
regressors each labelled face image generated 20 training exam-
                                                                                                         4. Conclusion
ples by using 20 different labelled faces as the initial guess for the                                      We described how an ensemble of regression trees can
face’s shape.                                                                                            be used to regress the location of facial landmarks from a
                               0.11
                                                                                                         sparse subset of intensity values extracted from an input im-
                                                                                      100
                                                                                      200                age. The presented framework is faster in reducing the error
                                0.1                                                   500
                                                                                      1000
                                                                                      2000
                                                                                                         compared to previous work and can also handle partial or
                               0.09
                                                                                                         uncertain labels. While major components of our algorithm
               Average error

                               0.08
                                                                                                         treat different target dimensions as independent variables, a
                               0.07                                                                      natural extension of this work would be to take advantage of
                               0.06                                                                      the correlation of shape parameters for more efficient train-
                               0.05
                                                                                                         ing and a better use of partial labels.
                                                                                                            Acknowledgements: This work has been funded by
                               0.04
                                      1   2   3    4     5      6
                                                       Cascade level
                                                                       7   8      9          10
                                                                                                         the Swedish Foundation for Strategic Research within the
Figure 5. The average error at each level of the cascade is plotted                                      project VINST.
with respect to number of training examples used. Using many
levels of regressors is most useful when the number of training                                          References
examples is large.
                                                                                                          [1] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Ku-
                                                                                                              mar. Localizing parts of faces using a consensus of exem-
             # Examples       100 200 500 1000 2000                                                           plars. In CVPR, pages 545–552, 2011. 1, 5
             # Initial Shapes 400 200 80   40   20
                                                                                                          [2] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by ex-
             Error                                .062 .057 .054 .052 .049                                    plicit shape regression. In CVPR, pages 2887–2894, 2012.
Table 7. Here the effective number of training examples is fixed but                                          1, 2, 3, 4, 5, 6
we use different combinations of the number of training images                                            [3] T. F. Cootes, M. Ionita, C. Lindner, and P. Sauer. Robust and
and the number of initial shapes used for each labelled face image.                                           accurate shape model fitting using random forest regression
                                                                                                              voting. In ECCV, 2012. 1
                                                                                                          [4] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Ac-
shapes expands the dataset in terms of shape. Our results                                                     tive shape models-their training and application. Computer
show this type of augmentation does not fully compensate                                                      Vision and Image Understanding, 61(1):38–59, 1995. 1, 2
for a lack of annotated training images. Though the rate of                                               [5] D. Cristinacce and T. F. Cootes. Boosted regression active
improvement gained by increasing the number of training                                                       shape models. In BMVC, pages 79.1–79.10, 2007. 1
images quickly slows after the first few hundred images.                                                  [6] M. Dantone, J. Gall, G. Fanelli, and L. V. Gool. Real-time
                                                                                                              facial feature detection using conditional regression forests.
   Partial annotations: Table 8 shows the results of using
                                                                                                              In CVPR, 2012. 1
partially annotated data. 200 training examples are fully
                                                                                                          [7] L. Ding and A. M. Martı́nez. Precise detailed detection of
annotated and the rest are only partially annotated.                                                          faces and facial features. In CVPR, 2008. 1
                                                                                                          [8] P. Dollár, P. Welinder, and P. Perona. Cascaded pose regres-
     # Examples 200 200+1800(25%) 200+1800(50%) 2000                                                          sion. In CVPR, pages 1078–1085, 2010. 1, 2, 6
     Error                            .074         .067                        .061               .049    [9] G. J. Edwards, T. F. Cootes, and C. J. Taylor. Advances in
Table 8. Results of using partially labelled data. 200 examples are                                           active appearance models. In ICCV, pages 137–142, 1999.
always fully annotated. The values inside the parenthesis show the                                            1, 2
percentage of landmarks observed.                                                                        [10] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of
                                                                                                              statistical learning: data mining, inference, and prediction.
                                                                                                              New York: Springer-Verlag, 2001. 2, 3
    The results show that we can gain substantial improve-
                                                                                                         [11] V. Kazemi and J. Sullivan. Face alignment with part-based
ment by using partially labelled data. Yet the improvement                                                    modeling. In BMVC, pages 27.1–27.10, 2011. 2
displayed may not be saturated because we know that the                                                  [12] V. Le, J. Brandt, Z. Lin, L. D. Bourdev, and T. S. Huang.
underlying dimension of the shape parameters are much                                                         Interactive facial feature localization. In ECCV, pages 679–
lower than the dimension of the landmarks (194 × 2). There                                                    692, 2012. 5
is, therefore, potential for a more significant improvement                                              [13] L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via
with partial labels by taking explicit advantage of the corre-                                                component-based discriminative search. In ECCV, pages
lation between the position of landmarks. Note that the gra-                                                  72–85, 2008. 1
One Millisecond Face Alignment with an Ensemble of Regression Trees
Figure 6. Final results on the HELEN database.

[14] S. Milborrow and F. Nicolls. Locating facial features with       [17] P. A. Viola and M. J. Jones. Robust real-time face detection.
     an extended active shape model. In ECCV, pages 504–513,               In ICCV, page 747, 2001. 5
     2008. 5                                                          [18] X. Zhao, X. Chai, and S. Shan. Joint face alignment: Rescue
[15] J. Saragih, S. Lucey, and J. Cohn. Deformable model fitting           bad alignments with good ones by regularized re-fitting. In
     by regularized landmark mean-shifts. Internation Journal of           ECCV, 2012. 1
     Computer Vision, 91:200–215, 2010. 1                             [19] X. Zhu and D. Ramanan. Face detection, pose estimation,
                                                                           and landmark localization in the wild. In CVPR, pages 2879–
[16] B. M. Smith and L. Zhang. Joint face alignment with non-
                                                                           2886, 2012. 1
     parametric shape models. In ECCV, pages 43–56, 2012. 1
One Millisecond Face Alignment with an Ensemble of Regression Trees One Millisecond Face Alignment with an Ensemble of Regression Trees
You can also read