Accelerated learning for Restricted Boltzmann Machine with momentum term

Page created by Eddie Sims
 
CONTINUE READING
ICSEng '14 International Conference on Systems Engineering (DRAFT VERSION)

                       Accelerated learning for Restricted Boltzmann
                              Machine with momentum term

                         Szymon Zaręba, Adam Gonczarek, Jakub M. Tomczak, Jerzy Świątek

                               Institute of Computer Science, Wroclaw University of Technology
                                            Wyb. Wyspiańskiego 27, 50-370, Wrocław
                       Email:{szymon.zareba,jakub.tomczak,adam.gonczarek,jerzy.swiatek}@pwr.wroc.pl

                           Abstract. Restricted Boltzmann Machines are generative models which
                           can be used as standalone feature extractors, or as a parameter ini-
                           tialization for deeper models. Typically, these models are trained using
                           Contrastive Divergence algorithm, an approximation of the stochastic
                           gradient descent method. In this paper, we aim at speeding up the con-
                           vergence of the learning procedure by applying the momentum method
                           and the Nesterov’s accelerated gradient technique. We evaluate these two
                           techniques empirically using the image dataset MNIST.

                           Keywords: Deep learning, Contrastive Divergence, stochastic gradient
                           descent, Nesterov’s momentum

                   1     Introduction
                   Deep learning has recently become a field of interest due to its ability of auto-
                   matic high-level features extraction [1]. Deep architectures are powerful models
                   that achieve high performance on difficult pattern recognition problems, such as
                   image analysis [2], motion tracking [3], speech recognition [4], and other appli-
                   cation, e.g., collaborative filtering [5], text analysis [6].
                       Typically, building blocks of deep models are Restricted Boltzmann Machines
                   (RBM). RBM are generative models with hidden variables which aim at mod-
                   elling a distribution of visible variables. In the case of classification problems,
                   RBM are used as standalone feature extractors, or as a parameter initialization
                   for deeper models. However, when RBM are trained in an unsupervised fashion,
                   there is no guarantee they provide discriminative features. To address this prob-
                   lem information about class label can be involved in the RBM, which leads to
                   the Classification Restricted Boltzmann Machine (ClassRBM) [12].
                       Although the representational power of deep models, including RBM, is very
                   tempting, their broad applicability was limited until recent past because of the
                   difficulty of learning deep architectures. Deep learning became a topic of high
                   interest thanks to the breakthrough learning algorithm called Contrastive Diver-
                   gence [13]. The Contrastive Divergence allows to efficiently perform approximate
                   stochastic gradient descent learning procedure. Since the Contrastive Divergence
                   serves as an elemental learning procedure, different techniques can be applied
                   in order to speed up the convergence or to obtain more reliable estimates. The
most common approach is to modify the learning objective by adding an extra
 regularization term, e.g., `2 regularizer (weight decay) to keep the parameters
 below some threshold, or sparse regularizer to enforce sparse activation of hid-
 den units [14,15,16]. Another regularization procedure is based on randomly
 dropping a subset of hidden units (or alternatively subset of weight parameters)
 during one updating epoch. This leads to techniques such as dropout or its ex-
 tensions [17,18,19,20]. A different approach is to apply optimization techniques,
 e.g., momentum method [15] or the Nesterov’s accelerated gradient (Nesterov’s
 momentum) [21].
     In this paper, we aim at accelerating learning of RBM by applying the mo-
 mentum term and the Nesterov’s momentum. We formulate the following re-
 search questions:

Q1: Does the application of the momentum term or the Nesterov’s momentum
    speed up the convergence of learning?
Q2: Does the application of the momentum term or the Nesterov’s momentum
    increase the classification accuracy?
Q3: Does the Nesterov’s momentum perform better than the momentum term?

 In order to verify these issues we carry out experiments on the image dataset
 MNIST.

 2    Classification using Restricted Boltzmann Machine

 Let x ∈ {0, 1}D be the D-dimensional vector of visible variables, h ∈ {0, 1}M be
 the M -dimensional vector of hidden variables, and y be the vector coding the
 class label using 1-of-K coding scheme, i.e., yk = 1 iff observation x belongs to
 the class k, where k = 1, . . . , K. Joint dependency between observed and hidden
 variables, (x, y, h), is described by the following energy function:

              E(x, y, h|θ) = −bT x − cT h − dT y − hT Wx − hT Uy                   (1)

 where xi , yk and hj are binary state of observed unit i, k-th entry in the vector
 coding class label, and the binary state of hidden unit j, respectively. Further, bi ,
 dk and cj denote bias parameters associated with xi , yk and hj , respectively. Fi-
 nally, Wji is the weight parameter modeling the relationship between xi and hj ,
 whereas Ujk is the weight between hj and yk . For brevity, let θ = {b, c, d, W, U}
 denotes the model parameters. ClassRBM defines the joint probability distribu-
 tion over observed and hidden variables as follows:
                                             1 −E(x,y,h|θ)
                            p(x, y, h|θ) =     e           ,                       (2)
                                             Z
 where Z is the partition function obtained by summing out all possibles pairs of
                                              e−E(x,y,h|θ) .
                                      PPP
 observed and hidden variables, Z =
                                        x    y   h
The inputs, hidden, and label variables are conditionally independent given
the other variables, and the conditional probabilities can be written as follows:
                      p(xi = 1|h, θ) = sigm(bi + hT W·i )                            (3)
                   p(hj = 1|x, y, θ) = sigm(cj + Wj· x + Ujl )                       (4)
where sigm(·) is the logistic sigmoid function, and Wi· , W·j denote rows and
columns of matrix W, respectively. Finally, l denotes index such that l = {k :
yk = 1}.
    It turns out that calculation of the exact form of the conditional probability
distribution p(y|x, θ) is tractable:

                                          M
                                          Q
                               exp(dl )        (1 + exp(cj + Wj· x + Ujl ))
                                          j=1
          p(yl = 1|x, θ) =   K                M
                                                                                 .   (5)
                             P                Q
                                   exp(dk )       (1 + exp(cj + Wj· x + Ujk ))
                             k=1              j=1

   Further, in the paper, we refer learning joint distribution p(x, y|θ) of Class-
RBM to as generative ClassRBM, while conditional distribution p(y|x, θ) of
ClassRBM – discriminative ClassRBM.

3     Learning
Let D be the training set containing N observation-label pairs, D = {(xn , yn )}.
In generative learning we aim at minimizing the following negative log-likelihood
function:
                               N
                               X
                 LG (θ) = −          ln p(xn , yn |θ)
                               n=1
                               N
                               X                          N
                                                          X
                        =−           ln p(yn |xn , θ) −         ln p(xn |θ)          (6)
                               n=1                        n=1

   Therefore, the generative learning objective consists of two components,
                                          N
                                          P
namely, the supervised learning objective    ln p(yn |xn , θ), where we fit pa-
                                                    n=1
rameters to predict label given the observation, and the unsupervised objective
 N
 P
   ln p(xn |θ), which can be seen as a data-dependent regularizer.
n=1
    Hence, omitting the last component in (6) leads to the following objective:
                                          N
                                          X
                         LD (θ) = −             ln p(yn |xn , θ).                    (7)
                                          n=1

Notice that we obtained the negative conditional log-likelihood which is the
natural objective in discriminative learning.
4     Classical momentum term and Nesterov’s momentum

Application of the classical momentum term to modify the search direction is a
simple method for increasing the rate of convergence [15]. In general, the idea
is to modify the update step of the gradient-based method by adding previous
value of the gradient:

                            v(new) = αv(old) − η∆(θ),                           (8)

where ∆(θ) is the step dependent on current parameters value, v denotes veloc-
ity, i.e., the accumulated change of the parameters, α is the momentum param-
eter determining the influence of the previous velocity included in calculating
new velocity.
    Recently, it has been shown that the modified version of the momentum
term, which is based on the Nesterov’s accelerated gradient technique, gives
even better results [21]. The Nesterov’s accelerated gradient (henceforth called
Nesterov’s momentum) is based on the idea to include velocity in the objective
function which follows the gradient calculation. Such approach allows to avoid
instabilities and respond to inappropriately chosen direction. The Nesterov’s
momentum is calculated as follows:

                       v(new) = αv(old) − η∆(θ + αv(old) ).                     (9)

4.1    Learning generative ClassRBM

In ClassRBM learning one optimizes the objective function (6) with respect
to the parameters θ = {W, U, b, c, d}. However, gradient-based optimization
methods cannot be directly applied because exact gradient calculation is in-
tractable. Fortunately, we can adopt Constrastive Divergence algorithm which
approximates exact gradient using sampling methods.
   In fact, the Contrastive Divergence algorithm aims at minimizing the differ-
ence between two Kullback-Leibler divergences [14]:

                             KL(Q||P ) − KL(Pτ ||P )                           (10)

where Q is the empirical probability distribution, and Pτ is the probability dis-
tribution over visible variables after τ steps of the Gibbs sampler, P is the true
distribution.
    The approximation error vanishes for τ → ∞, i.e., when Pτ becomes station-
ary distribution, and KL(Pτ |P ) = 0 with probability 1. Therefore, it would be
beneficial to choose a large value of τ , however, it has been noticed that choosing
τ = 1 works well in practice [1]. The procedure of the Contrastive Divergence
for the generative ClassRBM is presented in Algorithm 1.

    Algorithm 1: Contrastive Divergence algorithm for generative ClassRBM
    Input: data D, learning rate η, momentum parameter α
Output: parameters θ
  for each example (x, y) do
     1. Generate samples using Gibbs sampling:
         Set (x(0) , y(0) ) := (x, y).
                                              (0)             (0)
          Calculate probabilities ĥ                , where ĥj     := p(hj = 1|x(0) , y(0) , θ temp ).
                                 (0)                                       (0)
          Generate sample h̄ for given probabilities ĥ .
      for t = 0 to τ − 1 do
                                                       (t+1)               (t)
             Calculate probabilities x̂(t+1) , where x̂i     := p(xi = 1|h̄ , θ temp ).
                                                                   (t+1)
             Generate sample x̄(t+1) for given probabilities x̂          .
                                                       (t+1)
             Calculate probabilities ŷ(t+1) , where ŷk     := p(yk = 1|x̄(t) , θ temp ).
             Generate sample ȳ(t+1) for given probabilities ŷ(t+1) .
                                                     (t+1)
              Calculate probabilities ĥ         , where
                (t+1)
              ĥj     := p(hj = 1|x̂(t+1) , ŷ(t+1) , θ temp ).
                                        (t+1)                                     (t+1)
             Generate sample h̄       for given probabilities ĥ                          .
      end for
      2. Compute step ∆(θ temp ). In particular:
          ∆b := x(0) − x̄(τ )
                  (0)    (τ )
          ∆c := ĥ − ĥ
          ∆d := y − ȳ(τ )
                  (0)
                     (0)               (τ )
         ∆W := ĥ x(0)T − ĥ x̄(τ )T
                   (0)         (τ )
         ∆U := ĥ y(0)T − ĥ ȳ(τ )T
     3. Update momentum term v(new) := αv(old) − η∆(θ temp ) and set
         v(old) := v(new) .
     4. Update parameters:
         θ := θ + v(new)
         and set θ temp := θ in case of the classical momentum (8) or
         θ temp := θ + αv(old) in case of the Nesterov’s momentum (9).
  end for

4.2   Learning discriminative ClassRBM
In discriminative learning we need to minimize the objective (7) wrt the param-
eters θ = {W, U, b, c, d}. Unlike the generative model, here the gradient can be
calculated analytically. Hence, the parameters of the discriminative ClassRBM
can be determined using the stochastic gradient descent algorithm. The learning
procedure for the discriminative case is presented in Algorithm 2.

 Algorithm 2: Stochastic gradient algorithm for discriminative ClassRBM
 Input: data D, learning rate η, momentum parameter α
 Output: parameters θ
 for each example (x, y) do
    1. Compute step ∆(θ temp ). In particular:
Set l := {k : yk = 1}.
             Calculate probabilities ŷ, where ŷk := p(yk = 1|x, θ temp ).
             Calculate σ, where σjk = sigm(ctemp
                                               j     + Wj·temp
                                                               x + Ujktemp
                                                                           ).
                     K
                     P
             ∆c :=      ŷk σ ·k − ŷl 1
                    k=1
             ∆d := ŷ − y
                      K
                         ŷk σ ·k xT − σ ·l xT
                      P
             ∆W :=
                      k=1
           ∆U := σ ·l (ŷ − 1)T
       2. Update momentum term v(new) := αv(old) − η∆(θ temp ) and set
           v(old) := v(new) .
       3. Update parameters:
           θ := θ + v(new)
           and set θ temp := θ in case of the classical momentum (8) or
           θ temp := θ + αv(old) in case of the Nesterov’s momentum (9).
    end for

5       Experiments

5.1      Details

Dataset. We evaluate the presented learning techniques on the image corpora
MNIST1 . The MNIST dataset contains 50,000 training, 10,000 validation, and
10,000 test images (28 × 28 pixels) of ten hand-written digits (from 0 to 9).

Learning details. We performed learning using Constrastive Divergence with
mini-batches of 10 examples. Both generative and disriminative ClassRBM were
trained with and without momentum and the Nesterov’s momentum. The learn-
ing rate was set to η = 0.005 and η = 0.05 for generative and disriminative
ClassRBM, respectively. The momentum parameter was set to α ∈ {0.5, 0.9}.
The number of iterations over the training set was determined using early stop-
ping according to the validation set classification error, with a look ahead of 15
iterations. The experiment was repeated five times.

Evaluation methodology. We use the classification accuracy as an evaluation
metric. We compare the learning procedure using Contrastive Divergence (CD),
stochastic gradient descent (SGD) and learning with the classical momentum
term (CM) and the Nesterov’s momentum (N). The classification accuracy (ex-
pressed in %) is calculated on test set.

    1
        http://yann.lecun.com/exdb/mnist/
5.2     Results and Discussion

The results (mean values and standard deviations over five repetitions) for the
generative ClassRBM are presented in Table 1 while for the discriminative Class-
RBM in Table 2. The mean classification accuracy for generative ClassRBM is
presented in Figure 1 and for disriminative ClassRBM – in Figure 2.2
    We notice that the application of the momentum term and the Nesterov’s
momentum indeed increases the speed of the convergence (see Figure 1 and
2). This phenomenon is especially noticeable in the case of the disriminative
ClassRBM and α = 0.9 (Figure 2). The Nesterov’s momentum performs slightly
better than the momentum term in terms of the speed of convergence only during
the first epochs (see Figure 2).
    On the other hand, for larger number of hidden units, i.e., for M larger
than 400 in the case of the generative ClassRBM and M larger than 100 for the
discriminative ClassRBM, application of the momentum term and the Nesterov’s
momentum resulted in the increase of the classification accuracy and more stable
outcomes (smaller standard deviations). However, the Nesterov’s momentum is
slightly more robust than the momentum in terms of the standard deviations
(see Table 1 and 2). This effect can be explained in the following way. The
Nesterov’s momentum calculates a partial update of parameters first, and then
computes gradient of the objective wrt to partially updated parameters. Such
procedure allows to change the velocity quicker if undesirable increase in the
objective occurs. Our result is another empirical justification of this phenomenon,
previously reported in [21].

Table 1: Classification accuracy for generative ClassRBM with the Contrastive
Divergence (CD) and the classical momentum term (CM) and the Nesterov’s
momentum (N).

                        CD       CM α = 0.5    CM α = 0.9    N α = 0.5    N α = 0.9
Hidden units        mean std     mean std      mean std      mean std     mean std
      9             54.79 4.15   60.79 4.47    63.60 1.33    63.69 2.09   63.04 3.13
     25             81.21 0.33   80.32 1.26    82.21 1.01    80.51 0.48   81.76 0.60
    100             90.81 0.36   90.74 0.32    90.70 0.48    90.67 0.30   91.11 0.50
    400             94.02 0.29   94.34 0.27    94.51 0.35    94.54 0.20   94.61 0.23
    900             95.08 0.23   95.62 0.08    95.53 0.42    95.45 0.19   95.25 0.23

   2
       In both cases the number of hidden units was equal 900.
Table 2: Classification accuracy for discriminative ClassRBM with the stochas-
tic gradient descent (SGD) and the classical momentum term (CM) and the
Nesterov’s momentum (N).

                   SGD       CM α = 0.5   CM α = 0.9   N α = 0.5    N α = 0.9
Hidden units    mean std     mean std     mean std     mean std     mean std
      9         92.39 0.14   92.34 0.26   91.91 0.32   92.30 0.46   91.41 0.35
     25         95.74 0.32   95.49 0.36   95.05 0.14   95.51 0.29   95.15 0.09
    100         97.36 0.09   97.35 0.17   97.46 0.05   97.40 0.06   97.67 0.11
    400         97.63 0.07   97.67 0.02   97.96 0.05   97.77 0.10   97.86 0.06
    900         97.52 0.11   97.72 0.01   97.85 0.14   97.73 0.04   97.82 0.02

Fig. 1: Convergence of the generative ClassRBM with the Contrastive Divergence
(CD), and with the classical momentum term (CM) and Nesterov’s momentum
(N) with α = 0.5 and α = 0.9 for the classification accuracy measured on the
test set.
Fig. 2: Convergence of the discriminative ClassRBM with the stochastic gradient
descent (SGD), and with the classical momentum term (CM) and Nesterov’s
momentum (N) with α = 0.5 and α = 0.9 for the classification accuracy mea-
sured on the test set.

6   Conclusions
In this paper, we have presented the accelerated learning of the Restricted Boltz-
mann Machine with the classical momentum and the Nesterov’s momentum
term. We have applied the outlined learning procedure to generative and dis-
criminative Classification Restricted Boltzmann Machine. In order to evaluate
the approach and verify stated research questions we have performed experi-
ments using MNIST image corpora for different number of hidden units. The
obtained results shows that the application of the momentum term and the Nes-
terov’s momentum indeed accelerates convergence of learning and increase the
classification accuracy. However, our comparative analysis does not indicate the
superiority of the Nesterov’s momentum over the momentum term; these two
techniques behave alike.

References
 1. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine
    Learning 2(1) (2009) 1–127
 2. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with
    neural networks. Science 313(5786) (2006) 504–507
 3. Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary
    latent variables. In Schölkopf, B., Platt, J.C., Hoffman, T., eds.: NIPS, MIT Press
    (2006) 1345–1352
 4. Mohamed, A.R., Hinton, G.E.: Phone recognition using restricted boltzmann ma-
    chines. In: ICASSP, IEEE (2010) 4354–4357
5. Salakhutdinov, R., Mnih, A., Hinton, G.E.: Restricted boltzmann machines for
    collaborative filtering. In Ghahramani, Z., ed.: ICML. Volume 227 of ACM Inter-
    national Conference Proceeding Series., ACM (2007) 791–798
 6. Salakhutdinov, R., Hinton, G.E.: Replicated softmax: an undirected topic model.
    In Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A., eds.:
    NIPS, Curran Associates, Inc. (2009) 1607–1614
 7. Neapolitan, R.E.: Probabilistic reasoning in expert systems - theory and algo-
    rithms. Wiley (1990)
 8. Pearl, J.: Probabilistic reasoning in intelligent systems - networks of plausible infer-
    ence. Morgan Kaufmann series in representation and reasoning. Morgan Kaufmann
    (1989)
 9. Hopfield, J.J.: Neural networks and physical systems with emergent collective
    computational abilities. Proceedings of the National Academy of Sciences of the
    United States of America 79(8) (1982) 2554–2558
10. Hopfield, J.J.: The effectiveness of neural computing. In: IFIP Congress. (1989)
    503–507
11. Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann
    Machines. Cognitive Science 9(1) (1985) 147–169
12. Larochelle, H., Bengio, Y.: Classification using discriminative restricted boltzmann
    machines. In Cohen, W.W., McCallum, A., Roweis, S.T., eds.: ICML. Volume 307
    of ACM International Conference Proceeding Series., ACM (2008) 536–543
13. Hinton, G.E.: Training products of experts by minimizing contrastive divergence.
    Neural Computation 14(8) (2002) 1771–1800
14. Fischer, A., Igel, C.: An introduction to Restricted Boltzmann Machines. In
    Álvarez, L., Mejail, M., Déniz, L.G., Jacobo, J.C., eds.: CIARP. Volume 7441 of
    Lecture Notes in Computer Science., Springer (2012) 14–36
15. Hinton, G.E.: A practical guide to training restricted boltzmann machines. In:
    Neural Networks: Tricks of the Trade (2nd ed.). (2012) 599–619
16. Swersky, K., Chen, B., Marlin, B.M., de Freitas, N.: A tutorial on stochastic
    approximation algorithms for training restricted boltzmann machines and deep
    belief nets. In: ITA, IEEE (2010) 80–89
17. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Im-
    proving neural networks by preventing co-adaptation of feature detectors. CoRR
    abs/1207.0580 (2012)
18. Wager, S., Wang, S., Liang, P.: Dropout training as adaptive regularization. CoRR
    abs/1307.1493 (2013)
19. Wan, L., Zeiler, M.D., Zhang, S., LeCun, Y., Fergus, R.: Regularization of neural
    networks using dropconnect. In: ICML (3). (2013) 1058–1066
20. Wang, S., Manning, C.D.: Fast dropout training. In: ICML (2). (2013) 118–126
21. Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of ini-
    tialization and momentum in deep learning. In: ICML (3). Volume 28 of JMLR
    Proceedings., JMLR.org (2013) 1139–1147
You can also read