Quickly Boosting Decision Trees - Pruning Underachieving Features Early

Page created by Alicia Davis
 
CONTINUE READING
Quickly Boosting Decision Trees
                     - Pruning Underachieving Features Early -

Ron Appel                                                                                appel@caltech.edu
Caltech, Pasadena, CA 91125 USA
Thomas Fuchs                                                                             fuchs@caltech.edu
Caltech, Pasadena, CA 91125 USA
Piotr Dollár                                                                      pdollar@microsoft.com
Microsoft Research, Redmond, WA 98052 USA
Pietro Perona                                                                           perona@caltech.edu
Caltech, Pasadena, CA 91125 USA

                     Abstract                              1998; Ridgeway, 1999; Kotsiantis, 2007). This power-
                                                           ful combination (of Boosting and decision trees) is the
    Boosted decision trees are among the most
                                                           learning backbone behind many state-of-the-art meth-
    popular learning techniques in use today.
                                                           ods across a variety of domains such as computer vi-
    While exhibiting fast speeds at test time, rel-
                                                           sion, behavior analysis, and document ranking to name
    atively slow training renders them impracti-
                                                           a few (Dollár et al., 2012; Burgos-Artizzu et al., 2012;
    cal for applications with real-time learning
                                                           Asadi & Lin, 2013), with the added benefit of exhibit-
    requirements. We propose a principled ap-
                                                           ing fast speed at test time.
    proach to overcome this drawback. We prove
    a bound on the error of a decision stump               Learning speed is important as well. In active or real-
    given its preliminary error on a subset of             time learning situations such as for human-in-the-loop
    the training data; the bound may be used               processes or when dealing with data streams, classifiers
    to prune unpromising features early in the             must learn quickly to be practical. This is our motiva-
    training process. We propose a fast train-             tion: fast training without sacrificing accuracy. To this
    ing algorithm that exploits this bound, yield-         end, we propose a principled approach. Our method
    ing speedups of an order of magnitude at no            offers a speedup of an order of magnitude over prior
    cost in the final performance of the classifier.       approaches while maintaining identical performance.
    Our method is not a new variant of Boosting;
                                                           The contributions of our work are the following:
    rather, it is used in conjunction with exist-
    ing Boosting algorithms and other sampling              1. Given the performance on a subset of data, we
    methods to achieve even greater speedups.                  prove a bound on a stump’s classification error,
                                                               information gain, Gini impurity, and variance.
                                                            2. Based on this bound, we propose an algorithm
1. Introduction                                                guaranteed to produce identical trees as classical
Boosting is one of the most popular learning tech-             algorithms, and experimentally show it to be one
niques in use today, combining many weak learners to           order of magnitude faster for classification tasks.
form a single strong one (Schapire, 1990; Freund, 1995;     3. We outline an algorithm for quickly Boosting de-
Freund & Schapire, 1996). Shallow decision trees are           cision trees using our quick tree-training method,
commonly used as weak learners due to their simplicity         applicable to any variant of Boosting.
and robustness in practice (Quinlan, 1996; Breiman,
                                                           In the following sections, we discuss related work,
Proceedings of the 30 th International Conference on Ma-   inspect the tree-boosting process, describe our algo-
chine Learning, Atlanta, Georgia, USA, 2013. JMLR:         rithm, prove our bound, and conclude with experi-
W&CP volume 28. Copyright 2013 by the author(s).
                                                           ments on several datasets, demonstrating our gains.
Quickly Boosting Decision Trees

2. Related Work                                            (Coates et al., 2009) and for medical imaging using
                                                           Probabilistic Boosting Trees (PBT) (Birkbeck et al.,
Many variants of Boosting (Freund & Schapire, 1996)        2011). Although these methods offer speedups in their
have proven to be competitive in terms of prediction       own right, we focus on the single-core paradigm.
accuracy in a variety of applications (Bühlmann &
Hothorn, 2007), however, the slow training speed of        Regardless of the subsampling heuristic used, once a
boosted trees remains a practical drawback. Accord-        subset of features or data points is obtained, weak
ingly, a large body of literature is devoted to speeding   learners are trained on that subset in its entirety. Con-
up Boosting, mostly categorizable as: methods that         sequently, each of the aforementioned strategies can be
subsample features or data points, and methods that        viewed as a two-stage process; in the first, a smaller
speed up training of the trees themselves.                 set of features or data points is collected, and in the
                                                           second, decision trees are trained on that entire subset.
In many situations, groups of features are highly cor-
related. By carefully choosing exemplars, an entire set    We propose a method for speeding up this second
of features can be pruned based on the performance         stage; thus, our approach can be used in conjunc-
of its exemplar. (Dollár et al., 2007) propose cluster-   tion with all the prior work mentioned above for even
ing features based on their performances in previous       greater speedup. Unlike the aforementioned methods,
stages of boosting. (Kégl & Busa-Fekete, 2009) parti-     our approach provides a performance guarantee: the
tion features into many subsets, deciding which ones       boosted tree offers identical performance to one with
to inspect at each stage using adversarial multi-armed     classical training.
bandits. (Paul et al., 2009) use random projections
to reduce the dimensionality of the data, in essence       3. Boosting Trees
merging correlated features.
                                                           A boosted P classifier (or regressor) having the form
Other approaches subsample the data. In Weight-            H(x) = t αt ht (x) can be trained by greedily min-
trimming (Friedman, 2000), all samples with weights        imizing a loss function L; i.e. by optimizing scalar
smaller than a certain threshold are ignored. With         αt and weak learner ht (x) at each iteration t. Be-
Stochastic Boosting (Friedman, 2002), each weak            fore training begins, each data sample xi is assigned a
learner is trained on a random subset of the data. For     non-negative weight wi (which is derived from L). Af-
very large datasets or in the case of on-line learning,    ter each iteration, misclassified samples are weighted
elaborate sampling methods have been proposed, e.g.        more heavily thereby increasing the severity of mis-
Hoeffding trees (Domingos & Hulten, 2000) and Fil-         classifying them in following iterations. Regardless of
ter Boost (Domingo & Watanabe, 2000; Bradley &             the type of Boosting used (i.e. AdaBoost, LogitBoost,
Schapire, 2007). To this end, probabilistic bounds can     L2 Boost, etc.), each iteration requires training a new
be computed on the error rates given the number of         weak learner given the sample weights. We focus on
samples used (Mnih et al., 2008). More recently, Lam-      the case when the weak learners are shallow trees.
inating (Dubout & Fleuret, 2011) trades off number of
features for number of samples considered as training
                                                           3.1. Training Decision Trees
progresses, enabling constant-time Boosting.
                                                           A decision tree hTREE (x) is composed of a stump hj (x)
Although all these methods can be made to work in
                                                           at every non-leaf node j. Trees are commonly grown
practice, they provide no performance guarantees.
                                                           using a greedy procedure as described in (Breiman
A third line of work focuses on speeding up training       et al., 1984; Quinlan, 1993), recursively setting one
of decision trees. Building upon the C4.5 tree-training    stump at a time, starting at the root and working
algorithm of (Quinlan, 1993), using shallow (depth-D)      through to the lower nodes. Each stump produces a
trees and quantizing features values into B  N bins       binary decision; it is given an input x ∈ RK , and is
leads to an efficient O(D×K×N ) implementation where       parametrized with a polarity p ∈ {±1}, a threshold
K is the number of features, and N is the number of        τ ∈ R, and a feature index k ∈ {1, 2, ..., K}:
samples (Wu et al., 2008; Sharp, 2008).                                                                  
                                                                            hj (x) ≡ pj sign x[kj ] − τj
Orthogonal to all of the above methods is the use
of parallelization; multiple cores or GPUs. Recently,      [where x[k] indicates the kth feature (or dimension) of x]
(Svore & Burges, 2011) distributed computation over
cluster nodes for the application of ranking; however,     We note that even in the multi-class case, stumps are
reporting lower accuracies as a result. GPU imple-         trained in a similar fashion, with the binary decision
mentations of GentleBoost exist for object detection       discriminating between subsets of the classes. How-
Quickly Boosting Decision Trees

ever, this is transparent to the training routine; thus,          Clearly, Zm is greater or equal to the sum of any m
we only discuss binary stumps.                                    other sample weights. As we increase m, the m-subset
                                                                  includes more samples, and accordingly, its mass Zm
In the context of classification, the goal in each stage
                                                                  increases (although at a diminishing rate).
of stump training is to find the optimal parameters
that minimize ε, the weighted classification error:               In Figure 1, we plot Zm /Z for progressively increas-
                                                                  ing m-subsets, averaged over multiple iterations. An
             1 X                           X
        ε≡         wi 1{h(xi )6=yi } , Z ≡   wi                   interesting empirical observation can be made about
             Z
          [where 1{...} is the indicator function]                Boosting: a large fraction of the overall weight is ac-
                                                                  counted for by just a few samples. Since training time
In this paper, we focus on classification error; however,         is dependent on the number of samples and not on
other types of split criteria (i.e. information gain, Gini        their cumulative weight, we should be able to leverage
impurity, or variance) can be derived in a similar form;          this fact and train using only a subset of the data. The
please see supplementary materials for details. For               more non-uniform the weight distribution, the more we
binary stumps, we can rewrite the error as:                       can leverage; Boosting is an ideal situation where few
                                                                  samples are responsible for most of the weight.
             1 hX                    X                i
     ε(k) =          wi 1{yi =+p} +      wi 1{yi =−p}

                                                                  Relative Subset Mass (Zm /Z)
            Z                                                                                    100%
               xi [k]≤τ               xi [k]>τ
                                                                                                           AdaBoost
                                                                                                 80%       LogitBoost
In practice, this error is minimized by selecting the                                                      L2 Boost
single best feature k ∗ from all of the features:                                                60%
                                                       ∗                                         40%
        {p∗ , τ ∗ , k ∗ } = argmin ε(k) ,   ε∗ ≡ ε(k       )
                            p,τ,k                                                                20%

In current implementations of Boosting (Sharp, 2008),                                             0%
                                                                                                        0.3%       1%     3%       10% 30%   100%
feature values can first be quantized into B bins by                                                           Relative Subset Size (m/N)
linearly distributing them in [1, B] (outer bins corre-
sponding to the min/max, or to a fixed number of                  Figure 1. Progressively increasing m-subset mass using
standard deviations from the mean), or by any other               different variants of Boosting. Relative subset mass Zm /Z
quantization method. Not surprisingly, using too few              exceeds 90% after only m ≈ 7% of the total number of
bins reduces threshold precision and hence overall per-           samples N in the case of AdaBoost or m/N ≈ 20% in the
formance. We find that B = 256 is large enough to                 case of L2 Boost.
incur no loss in practice.
                                                                  The same observation was made by (Friedman et al.,
Determining the optimal threshold τ ∗ requires accu-              2000), giving rise to the idea of weight-trimming, which
mulating each sample’s weight into discrete bins cor-             proceeds as follows. At the start of each Boosting it-
responding to that sample’s feature value xi [k]. This            eration, samples are sorted in order of weight (from
procedure turns out to still be quite costly: for each            largest to smallest). For that iteration, only the sam-
of the K features, the weight of each of the N samples            ples belonging to the smallest m-subset such that
has to be added to a bin, making the stump training               Zm /Z ≥ η are used for training (where η is some prede-
operation O(K ×N ) – the very bottleneck of training              fined threshold); all the other samples are temporarily
boosted decision trees.                                           ignored – or trimmed. Friedman et al. claimed this to
In the following sections, we examine the stump-                  “dramatically reduce computation for boosted models
training process in greater detail and develop an in-             without sacrificing accuracy.” In particular, they pre-
tuition for how we can reduce computational costs.                scribed η = 90% to 99% “typically”, but they left open
                                                                  the question of how to choose η from the statistics of
3.2. Progressively Increasing Subsets                             the data (Friedman et al., 2000).
                                                                  Their work raises an interesting question: Is there a
Let us assume that at the start of each Boosting itera-
                                                                  principled way to determine (and train on only) the
tion, the data samples are sorted in order of decreasing
                                                                  smallest subset after which including more samples
weight, i.e: wi ≥ wj ∀ i < j . Consequently, we define
                                                                  does not alter the final stump? Indeed, we can prove
Zm , the mass of the heaviest subset of m data points:
                                                                  that a relatively small m-subset contains enough in-
                                                                  formation to set the optimal stump, and thereby save
                   X
             Zm ≡      wi [note: ZN ≡ Z]
                     i≤m                                          a lot of computation.
Quickly Boosting Decision Trees

3.3. Preliminary Errors                                                                   1. Most of the overall weight is accounted for by the
                                                                                             first few samples. (This can be seen by comparing
Definition: given a feature k and an m-subset, the
                             (k)                                                             the top and bottom axes in the figure)
best preliminary error εm is the lowest achievable
training error if only the data points in that subset                                     2. Feature errors increasingly stabilize as m → N
were considered. Equivalently, this is the reported er-
ror if all samples not in that m-subset are trimmed.                                      3. The best performing features at smaller m-subsets
                                                                                             (i.e. Zm /Z < 80%) can end up performing quite
                                1 hX                    X                  i                 poorly once all of the weights are accumulated
              ε(k)
               m ≡                   wi 1{yi =+p(k) } +   wi 1{yi =−p(k) }                   (dashed orange curve)
                               Zm               m                    m
                                     i≤m                    i≤m
                                         (k)
                                 xi [k]≤τm                      (k)
                                                        xi [k]>τm                         4. The optimal feature is among the best features for
                                                                                             smaller m-subsets as well (solid green curve)
 [where p(k)    (k)
         m and τm are optimal preliminary parameters]

                                                                                     From the full training run shown in Figure 2, we note
The emphasis on preliminary indicates that the subset                                (in retrospect) that using an m-subset with m ≈ 0.2N
does not contain all data points, i.e: m < N , and the                               would suffice in determining the optimal feature. But
emphasis on best indicates that no choice of polarity                                how can we know a priori which m is good enough?
p or threshold τ can lead to a lower preliminary error
using that feature.
                                                                                                   100%
As previously described, given a feature k, a stump                                                80%
is trained by accumulating sample weights into bins.
                                                                                     Probability

We can view this as a progression; initially, only a few                                           60%
samples are binned, and as training continues, more                                                40%                              optimum among:
                                                (k)
and more samples are accounted for; hence, εm may                                                                                   preliminary top 1
                                                                                                   20%                              preliminary top 10
be computed incrementally which is evident in the                                                                                   preliminary top 10%
smoothness of the traces.                                                                           0%
                                                                                                          60%     70%         80%        90%          100%
                                          Relative Subset Size m/N                                              Relative Subset Mass Zm /Z

                                  6.5%         9.3%      10.7%        29.8%   100%
                         50%                                                         Figure 3. Probability that the optimal feature is among
                                                                                     the K top-performing features. The red curve corresponds
Preliminary Error ε(k)

                                                                                     to K = 1, the purple to K = 10, and the blue to K = 1% of
                   m

                         45%
                                                                                     all features. Note the knee-point around Zm /Z ≈ 75% at
                                                                                     which the optimal feature is among the 10 preliminary-best
                         40%                                                         features over 95% of the time.

                         35%                                                         Averaging over many training iterations, in Figure 3,
                                                                                     we plot the probability that the optimal feature is
                                                                                     among the top-performing features when trained on
                         30%
                                  60%           70%       80%       90%       100%   only an m-subset of the data. This gives us an idea as
                                         Relative Subset Mass Zm /Z                  to how small m can be while still correctly predicting
                                                                                     the optimal feature.
Figure 2. Traces of the best preliminary errors over in-
                                                                                     From Figure 3, we see that the optimal feature is
creasing Zm /Z. Each curve corresponds to a different fea-
ture. The dashed orange curve corresponds to a feature
                                                                                     not amongst the top performing features until a large
that turns out to be misleading (i.e. its final error dras-                          enough m-subset is used – in this case, Zm /Z ≈ 75%.
tically worsens when all the sample weights are accumu-                              Although “optimality” is not quite appropriate to use
lated). The thick green curve corresponds to the optimal                             in the context of greedy stage-wise procedures (such as
feature; note that it is among the best performing features                          boosted trees), consistently choosing sub-optimal pa-
even when training with relatively few samples.                                      rameters at each stump empirically leads to substan-
                                                                                     tially poorer performance, and should be avoided. In
Figure 2 shows the best preliminary error for each of                                the following section, we outline our approach which
the features in a typical experiment. By examining                                   determines the optimal stump parameters using the
Figure 2, we can make four observations:                                             smallest possible m-subset.
Quickly Boosting Decision Trees

4. Pruning Underachieving Features                          Quick Stump Training
Figure 3 suggests that the optimal feature can often        1. Train each feature only using data in a relatively
be estimated at a fraction of the computational cost            small m-subset.
using the following heuristic:                              2. Sort the features based on their preliminary er-
                                                                rors (from best to worst)
 Faulty Stump Training                                      3. Continue training one feature at a time on pro-
                                                                                                      (k)
                                                                gressively larger subsets, updating εm
 1. Train each feature only using samples in the m-                 • if it is underachieving, prune immediately.
      subset where Zm /Z ≈ 75%.                                     • if it trains to completion, save it as best-so-far.
 2. Prune all but the 10 best performing features.          4. Finally, report the best performing feature (and
 3. For each of un-pruned feature, complete training             corresponding parameters).
      on the entire data set.
 4. Finally, report the best performing feature (and
      corresponding parameters).                           4.1. Subset Scheduling
                                                           Deciding which schedule of m-subsets to use is a sub-
This heuristic does not guarantee to return the opti-      tlety that requires further explanation. Although this
mal feature, since premature pruning can occur in step     choice does not effect the optimality of the trained
2. However, if we were somehow able to bound the er-       stump, it may effect speedup. If the first “relatively
ror, we would be able to prune features that would         small” m-subset (as prescribed in step 1) is too small,
provably underachieve (i.e. would no longer have any       we may lose out on low-error features leading to less-
chance of being optimal in the end).                       harsh pruning. If it is too large, we may be doing
                                                           unnecessary computation. Furthermore, since the cal-
                                                           culation of preliminary error does incur some (albeit,
Definition: a feature k is denoted underachieving
                                                           low) computational cost, it is impractical to use every
if it is guaranteed to perform worse than the best-so-
                                                           m when training on progressively larger subsets.
far feature k 0 on the entire training data.
                                                           To address this issue, we implement a simple sched-
Proposition 1: for a feature k, the following bound        ule: The first m-subset is determined by the param-
holds (proof given in Section 4.2): given two subsets      eter ηkp such that Zm /Z ≈ ηkp . M following subsets
(where one is larger than the other), the product of       are equally spaced out between ηkp and 1. Figure 4
subset mass and preliminary error is always greater        shows a parameter sweep over ηkp and M , from which
for the larger subset:                                     we fix ηkp = 90% and M = 20 and use this setting for
                                                           all of our experiments.
            m≤n       ⇒ Zm ε(k)     (k)
                            m ≤ Zn εn                (1)
                                                              1
                                                              5
Let us assume that the best-so-far error ε0 has been         10
determined over a few of the features (and the param-      M 20
                                                             50
eters that led to this error have been stored). Hence,      100
this is an upper-bound for the error of the stump cur-      200
rently being trained. For the next feature in the queue,       0%     10%   20%   30%   40%   50%   60%   70%   80%   90% 100%
                                                                                              ηkp
even after a smaller (m < N )-subset, then:
                                                           Figure 4. Computational cost of training boosted trees
   Zm ε(k)
       m ≥ Zε
             0
                    ⇒ Zε(k) ≥ Zε0      ⇒ ε(k) ≥ ε0         over a range of ηkp and M , averaged over several types
                                                           of runs (with varying numbers and depths of trees). Red
                   (k)                                     corresponds to higher cost, blue to lower cost. The lowest
Therefore, if: Zm εm ≥ Zε0 then feature k is under-        computational cost is achieved at ηkp = 90% and M = 20.
achieving and can safely be pruned. Note that the
lower the best-so-far error ε0 , the harsher the bound;
                                                           Using our quick stump training procedure, we deter-
consequently, it is desirable to train a relatively low-
                                                           mine the optimal parameters without having to con-
error feature early on.
                                                           sider every sample for each feature. By pruning un-
Accordingly, we propose a new method based on com-         derachieving features, a lot of computation is saved.
paring feature performance on subsets of data, and         We now outline the full Boosting procedure using our
consequently, pruning underachieving features:             quick training method:
Quickly Boosting Decision Trees

 Quickly Boosting Decision Trees                                       Further, by summing over a larger subset (n ≥ m),
                                                                       the resulting sum can only increase:
 1. Initialize weights (sorted in decreasing order).                                 X                     X
 2. Train decision tree ht (one node at a time) using                     Zm ε(k)
                                                                               m ≤      wi 1{yi =+p(k) } +   wi 1{yi =−p(k) }
                                                                                                   n                        n
      the Quick Stump Training method.                                                i≤n
                                                                                          (k)
                                                                                                           i≤n
                                                                                                               (k)
                                                                                  xi [k]≤τn            xi [k]>τn
 3. Perform standard Boosting steps:
      (a) determine optimal αt (i.e. using line-search).                                                              (k)
                                                                       But the right-hand side is equivalent to Zn εn ; hence:
      (b) update sample weights given the misclassifica-
      tion error of ht and the variant of Boosting used.                                    Zm ε(k)     (k)
                                                                                                m ≤ Zn εn
      (c) if more Boosting iterations are needed, sort                                                               Q.E.D.

      sample weights in decreasing order, increment it-
                                                                       For similar proofs using information gain, Gini impu-
      eration number t, and goto step 2.
                                                                       rity, or variance minimization as split criteria, please
                                                                       see the supplementary material.
We note that sorting the weights in step 3(c) above
is an O(N ) operation. Given an initially sorted set,                  5. Experiments
Boosting updates the sample weights based on whether
the samples were correctly classified or not. All cor-                 In the previous section, we proposed an efficient
rectly classified samples are weighted down, but they                  stump training algorithm and showed that it has a
maintain their respective ordering. Similarly, all mis-                lower expected computational cost than the traditional
classified samples are weighted up, also maintaining                   method. In this section, we describe experiments that
their respective ordering. Finally, these two sorted lists             are designed to assess whether the method is practical
are merged in O(N ).                                                   and whether it delivers significant training speedup.
                                                                       We train and test on three real-world datasets and
We now give a proof for the bound that our method is
                                                                       empirically compare the speedups.
based on, and in the following section, we demonstrate
its effectiveness in practice.
                                                                       5.1. Datasets
4.2. Proof of Proposition 1                                            We trained Ada-Boosted ensembles of shallow decision
                              (k)                                      trees of various depths, on the following three datasets:
As previously defined, εm is the preliminary weighted
classification error computed using the feature k on                    1. CMU-MIT Faces dataset (Rowley et al., 1996);
samples in the m-subset; hence:                                            8.5·103 training and 4.0 ·103 test samples, 4.3·103
                                                                           features used are the result of convolutions with
                                                                           Haar-like wavelets (Viola & Jones, 2004), using
                 X                          X
   Zm ε(k)
       m =             wi 1{yi =+p(k) } +       wi 1{yi =−p(k) }
               i≤m
                                    m
                                            i≤m
                                                           m
                                                                           2000 stumps as in (Viola & Jones, 2004).
                   (k)                          (k)
           xi [k]≤τm                    xi [k]>τm
                                                                        2. INRIA Pedestrian dataset (Dalal & Triggs, 2005);
                                                                           1.7 · 104 training and 1.1 · 104 test samples, 5.1 ·
                                                 (k)       (k)
Proposition 1:          m≤n         ⇒ Zm εm ≤ Zn εn                        103 features used are Integral Channel Features
                                                                           (Dollár et al., 2009). The classifier has 4000
           (k)                                                             depth-2 trees as in (Dollár et al., 2009).
Proof: εm is the best achievable preliminary error
                                         (k) (k)
on the m-subset (and correspondingly, {pm , τm } are                    3. MNIST Digits (LeCun & Cortes, 1998); 6.0 · 104
the best preliminary parameters); therefore:                               training and 1.0 ·104 test samples, 7.8·102 features
            X                           X                                  used are grayscale pixel values. The ten-class clas-
 Zm ε(k)
     m ≤          wi 1{yi =+p} +            wi 1{yi =−p}     ∀ p, τ        sifier uses 1000 depth-4 trees based on (Kégl &
            i≤m                       i≤m                                  Busa-Fekete, 2009).
                   0                         0
          xi [k]≤τ                  xi [k]>τ

                                                           (k)   (k)   5.2. Comparisons
Hence, switching the optimal parameters {pm , τm }
                                   (k) (k)
for potentially sub-optimal ones {pn , τn } (note the                  Quick Boosting can be used in conjunction with all
subtle change in indices):                                             previously mentioned heuristics to provide further
                                                                       gains in training speed. We report all computational
  Zm ε(k)
                 X                          X                          costs in units proportional to Flops, since running time
      m ≤              wi 1{yi =+p(k) } +        wi 1{yi =−p(k) }
               i≤m
                                    n
                                            i≤m
                                                            n          (in seconds) is dependent on compiler optimizations
                   (k)
           xi [k]≤τn                            (k)
                                        xi [k]>τn                      which are beyond the scope of this work.
Quickly Boosting Decision Trees

                                                          (1)                                                                                                               (2)
                                                                                                                                                                                           AdaBoost

                                                                                                                             ROC] (a)
                          −2                                                                                                                                                               Weight−Trimming 99%
                         10
                                                                                                                                                                                           Weight−Trimming 90%
                                                                                                                                                                                           LazyBoost 90%
                                                                                                                                             −2
   (a)

                          −4
                         10                                                                                                             10                                                 LazyBoost 50%

                                                                                                                      Over ROC]
                                                                                                                                                                                           StochasticBoost 50%

                                                                                                                 [AreaOver
                          −6
                  Loss

                         10
              Loss
         Training

                                                                                                           [Area
                          −8                                                                                                                 −3
                         10                                                                                                             10
   Training

                                                                                                      Test Error
                                                                                              Test Error
                          −10   AdaBoost
                         10
                                Weight−Trimming 99%
                                Weight−Trimming 90%
                          −12                                                                                                                −4
                         10     LazyBoost 90%                                                                                           10
                                LazyBoost 50%
                                StochasticBoost 50%
                                        9                                  10                                                                           8               9                  10
                                       10                              10                                                                              10             10                  10
                                                      Computational Cost                                                                                             Computational Cost
                                                                                                                                                                                           AdaBoost

                                                                                                                             ROC] (b)
                          −10
                         10                                                                                                                                                                Weight−Trimming 99%
                                                                                                                                                                                           Weight−Trimming 90%
                                                                                                                                                                                           LazyBoost 90%
                          −20
   (b)

                         10                                                                                                                                                                LazyBoost 50%

                                                                                                                      Over ROC]
                                                                                                                                         −2                                                StochasticBoost 50%
                                                                                                                                        10

                                                                                                                 [AreaOver
                          −30
                         10
              Loss
         Training Loss

                                                                                                           [Area
                          −40
                         10
   Training

                                                                                                      Test Error
                          −50
                         10
                                                                                              Test Error
                                AdaBoost                                                                                                 −3
                                                                                                                                        10
                                Weight−Trimming 99%
                          −60
                         10     Weight−Trimming 90%
                                LazyBoost 90%
                                LazyBoost 50%
                          −70
                         10     StochasticBoost 50%
                                                 10                              11                                                                          9               10                  11
                                             10                             10                                                                              10           10                    10
                                                      Computational Cost                                                                                             Computational Cost
                                                                                                                                                                                           AdaBoost
                                                                                                                             ROC] (c)

                                                                                                                                                                                           Weight−Trimming 99%
                          −10                                                                                                                                                              Weight−Trimming 90%
                         10
                                                                                                                                                                                           LazyBoost 90%
                                                                                                                                                                                           LazyBoost 50%
   (c)

                                                                                                                      Over ROC]

                                                                                                                                                                                           StochasticBoost 50%
                                                                                                                 [AreaOver

                          −20
                         10
              LossLoss

                                                                                                                                         −2
         Training

                                                                                                                                        10
                                                                                                           [Area

                          −30
                         10
   Training

                                                                                                      Test Error
                                                                                              Test Error

                                AdaBoost
                          −40
                         10     Weight−Trimming 99%
                                Weight−Trimming 90%
                                LazyBoost 90%
                                LazyBoost 50%
                          −50
                         10     StochasticBoost 50%
                                             9                                  10                                                                 8                   9                        10
                                            10                              10                                                                    10                  10                   10
                                                      Computational Cost                                                                                             Computational Cost
                                             Computational Cost                                                                                                  Computational Cost

Figure 5. Computational cost versus training loss (top plots) and versus test error (bottom plots) for the various heuristics
on three datasets: (a1,2 ) CMU-MIT Faces, (b1,2 ) INRIA Pedestrians, and (c1,2 ) MNIST Digits [see text for details]. Dashed
lines correspond to the “quick” versions of the heuristics (using our proposed method) and solid lines correspond to the
original heuristics. Test error is defined as the area over the ROC curve.

In Figure 5, we plot the computation cost versus train-                                      each weak learner). To these six heuristics, we apply
ing loss and versus test error. We compare vanilla (no                                       our method to produce six “quick” versions.
heuristics) AdaBoost, Weight-Trimming with η = 90%
                                                                                             We further note that our goal in these experiments is
and 99% [see Section 3.2], LazyBoost 90% and 50%
                                                                                             not to tweak and enhance the performance of the clas-
(only 90% or 50% randomly selected features are used
                                                                                             sifiers, but to compare the performance of the heuris-
to train each weak learner), and StochasticBoost (only
                                                                                             tics with and without our proposed method.
a 50% random subset of the samples are used to train
Quickly Boosting Decision Trees

                                    16

                                                                                                Relative Computational Speedup
 Relative Reduction in Test Error

                                                                                                                                 16

                                     8

                                                                                                                                 8

                                     4
                                                                                                                                 4

                                     2
                                                                                                                                 2

                                     1                                                                                           1
                                         AdaBoost WT:99% WT:90% Lazy:90% Lazy:50% Stoch:50%                                           AdaBoost WT:99% WT:90% Lazy:90% Lazy:50% Stoch:50%

Figure 6. Relative (x-fold) reduction in test error (area                                      Figure 7. Relative speedup in training time due to our
over the ROC) due to our method, given a fixed computa-                                        proposed method to achieve the desired minimal error
tional cost. Triplets of bars of the same color correspond to                                  rate. Triplets of bars of the same color correspond to the
the three datasets: CMU-MIT Faces, INRIA Pedestrians,                                          three datasets. CMU-MIT Faces, INRIA Pedestrians, and
and MNIST Digits.                                                                              MNIST Digits.

5.3. Results and Discussion                                                                    As discussed in Section 3.2, weight-trimming acts sim-
                                                                                               ilarly to our proposed method in that it prunes fea-
From Figure 5, we make several observations. “Quick”                                           tures, although it does so naively - without adhering
versions require less computational costs (and produce                                         to a provable bound. This results in a speed-up (at
identical classifiers) as their slow counterparts. From                                        times almost equivalent to our own), but also leads to
the training loss plots (5a1 , 5b1 , 5c1 ), we gauge the                                       classifiers that do not perform as well as those trained
speed-up offered by our method; often around an or-                                            using the other heuristics.
der of magnitude. Quick-LazyBoost-50% and Quick-
StochasticBoost-50% are the least computationally-
intensive heuristics, and vanilla AdaBoost always                                              6. Conclusions
achieves the smallest training loss and attains the low-                                       We presented a principled approach for speeding up
est test error in two of the three datasets.                                                   training of boosted decision trees. Our approach is
The motivation behind this work was to speed up                                                built on a novel bound on classification or regression
training such that: (i) for a fixed computational bud-                                         error, guaranteeing that gains in speed do not come at
get, the best possible classifier could be trained, and                                        a loss in classification performance.
(ii) given a desired performance, a classifier could be                                        Experiments show that our method is able to reduce
trained with the least computational cost.                                                     training cost by an order of magnitude or more, or
For each dataset, we find the lowest-cost heuristic and                                        given a computational budget, is able to train classi-
set that computational cost as our budget. We then                                             fiers that reduce errors on average by two-fold or more.
boost as many weak learners as our budget permits for                                          Our ideas may be applied concurrently with other
each of the heuristics (with and without our method)                                           techniques for speeding up Boosting (e.g. subsampling
and compare the test errors achieved, plotting the rela-                                       of large datasets) and do not limit the generality of the
tive gains in Figure 6. For most of the heuristics, there                                      method, enabling the use of Boosting in applications
is a two to eight-fold reduction in test error, whereas                                        where fast training of classifiers is key.
for weight-trimming, we see less of a benefit. In fact,
for the second dataset, Weight-Trimming-90% runs at
the same cost with and without our speedup.
Conversely, in Figure 7, we compare how much less
computation is required to achieve the best test er-
ror rate by using our method for each heuristic. Most                                          Acknowledgements
heuristics see an eight to sixteen-fold reduction in com-
putational cost, whereas for weight-trimming, there is                                         This work was supported by NSERC 420456-
still a speedup, albeit again, a much smaller one (be-                                         2012, MURI-ONR N00014-10-1-0933, ARO/JPL-
tween one and two-fold).                                                                       NASA Stennis NAS7.03001, and Moore Foundation.
Quickly Boosting Decision Trees

References                                                      Friedman, J., Hastie, T., and Tibshirani, R. Additive lo-
                                                                  gistic regression: a statistical view of boosting. In The
Asadi, N. and Lin, J. Training efficient tree-based mod-          Annals of Statistics, 2000. 3
  els for document ranking. In European Conference on
  Information Retrieval (ECIR), 2013. 1                         Friedman, J. H. Greedy function approximation: A gradi-
                                                                  ent boosting machine. In The Annals of Statistics, 2000.
Birkbeck, N., Sofka, M., and Zhou, S. K. Fast boosting
                                                                  2
  trees for classification, pose detection, and boundary de-
  tection on a GPU. In Computer Vision and Pattern              Friedman, J. H. Stochastic gradient boosting. In Compu-
  Recognition Workshops (CVPRW), 2011. 2                          tational Statistics & Data Analysis, 2002. 2
Bradley, J. K. and Schapire, R. E. Filterboost: Regression      Kégl, B. and Busa-Fekete, R. Accelerating adaboost using
  and classification on large datasets. In Neural Informa-        UCB. KDD-Cup 2009 competition, 2009. 2, 6
  tion Processing Systems (NIPS), 2007. 2
                                                                Kotsiantis, S. B. Supervised machine learning: A review
Breiman, L. Arcing classifiers. In The Annals of Statistics,
                                                                  of classification techniques. informatica 31:249–268. In-
  1998. 1
                                                                  formatica, 2007. 1
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone,
                                                                LeCun, Y. and Cortes, C. The MNIST database of hand-
  C. J. Classification and Regression Trees. Chapman &
                                                                  written digits, 1998. 6
  Hall, New York, NY, 1984. 2
Bühlmann, P. and Hothorn, T. Boosting algorithms: Reg-         Mnih, V., Szepesvári, C., and Audibert, J. Empirical bern-
  ularization, prediction and model fitting. In Statistical      stein stopping. In International Conference on Machine
  Science, 2007. 2                                               Learning (ICML), 2008. 2

Burgos-Artizzu, X.P., Dollár, P., Lin, D., Anderson, D.J.,     Paul, B., Athithan, G., and Murty, M.N. Speeding up
  and Perona, P. Social behavior recognition in continuous        adaboost classifier with random projection. In Interna-
  videos. In Computer Vision and Pattern Recognition              tional Conference on Advances in Pattern Recognition
  (CVPR), 2012. 1                                                 (ICAPR), 2009. 2

Coates, A., Baumstarck, P., Le, Q., and Ng, A. Y. Scalable      Quinlan, R. J. C4.5: Programs for Machine Learning. Mor-
  learning for object detection with GPU hardware. In             gan Kaufmann, 1993. 2
  Intelligent Robots and Systems (IROS), 2009. 2
                                                                Quinlan, R. J. Bagging, boosting, and C4.5. In National
Dalal, N. and Triggs, B. Histograms of oriented gradients         Conference on Artificial Intelligence, 1996. 1
  for human detection. In Computer Vision and Pattern
  Recognition (CVPR), 2005. 6                                   Ridgeway, G. The state of boosting. In Computing Science
                                                                  and Statistics, 1999. 1
Dollár, P., Tu, Z., Tao, H., and Belongie, S. Feature mining
  for image classification. In Computer Vision and Pattern      Rowley, H., Baluja, S., and Kanade, T. Neural network-
  Recognition (CVPR), June 2007. 2                                based face detection. In Computer Vision and Pattern
                                                                  Recognition (CVPR), 1996. 6
Dollár, P., Tu, Z., Perona, P., and Belongie, S. Integral
  channel features. In British Machine Vision Conference        Schapire, R. E. The strength of weak learnability. In Ma-
  (BMVC), 2009. 6                                                 chine Learning, 1990. 1

Dollár, P., Appel, R., and Kienzle, W. Crosstalk cascades      Sharp, T. Implementing decision trees and forests on a gpu.
  for frame-rate pedestrian detection. In European Con-           In European Conference on Computer Vision (ECCV),
  ference on Computer Vision (ECCV), 2012. 1                      2008. 2, 3
Domingo, C. and Watanabe, O. Scaling up a boosting-             Svore, K. M. and Burges, C. J. Large-scale learning to
  based learner via adaptive sampling. In Pacific-Asia            rank using boosted decision trees. Scaling Up Machine
  Conference on Knowledge Discovery and Data Mining,              Learning: Parallel and Distributed Approaches, 2011. 2
  2000. 2
                                                                Viola, P. A. and Jones, M. J. Robust real-time face detec-
Domingos, P. and Hulten, G. Mining high-speed data                tion. International Journal of Computer Vision (IJCV),
  streams. In International Conference on Knowledge Dis-          2004. 6
  covery and Data Mining, 2000. 2
                                                                Wu, J., Brubaker, S. C., Mullin, M. D., and Rehg,
Dubout, C. and Fleuret, F. Boosting with maximum adap-           J. M. Fast asymmetric learning for cascade face de-
  tive sampling. In Neural Information Processing Systems        tection. Pattern Analysis and Machine Intelligence
  (NIPS), 2011. 2                                                (PAMI), 2008. 2
Freund, Y. Boosting a weak learning algorithm by major-
  ity. In Information and Computation, 1995. 1
Freund, Y. and Schapire, R. E. Experiments with a new
  boosting algorithm. In Machine Learning International
  Workshop, 1996. 1, 2
You can also read