Three New Graphical Models for Statistical Language Modelling

Page created by Floyd Oconnor
 
CONTINUE READING
Three New Graphical Models for Statistical Language Modelling

Andriy Mnih                                                                                amnih@cs.toronto.edu
Geoffrey Hinton                                                                           hinton@cs.toronto.edu
Department of Computer Science, University of Toronto, Canada

                      Abstract                                in the discrete case. Representing discrete structures
    The supremacy of n-gram models in statis-                 such as words using continuous-valued distributed rep-
    tical language modelling has recently been                resentations and then assigning probability to these
    challenged by parametric models that use                  structures based on their representations automati-
    distributed representations to counteract the             cally introduces smoothing into the density estima-
    difficulties caused by data sparsity. We pro-             tion problem making the data sparsity problem less
    pose three new probabilistic language models              severe. Recently, significant progress in statistical lan-
    that define the distribution of the next word             guage modelling has been made by using models that
    in a sequence given several preceding words               rely on such distributed representations. Feed-forward
    by using distributed representations of those             neural networks that operate on real-valued vector rep-
    words. We show how real-valued distributed                resentations of words have been both the most popu-
    representations for words can be learned at               lar and most successful models of this type (Bengio
    the same time as learning a large set of                  et al., 2003; Morin & Bengio, 2005; Emami et al.,
    stochastic binary hidden features that are                2003). A number of techniques have been proposed
    used to predict the distributed representation            to address the main drawback of these models – their
    of the next word from previous distributed                long training times (Bengio & Senécal, 2003; Schwenk
    representations. Adding connections from                  & Gauvain, 2005; Morin & Bengio, 2005). Hierarchi-
    the previous states of the binary hidden fea-             cal alternatives to feed-forward networks, which are
    tures improves performance as does adding                 faster to train and use, have been considered (Morin
    direct connections between the real-valued                & Bengio, 2005; Blitzer et al., 2005b), but they do not
    distributed representations. One of our mod-              perform quite as well. Recently, a stochastic model
    els significantly outperforms the very best n-            with hidden variables has been proposed for language
    gram models.                                              modelling (Blitzer et al., 2005a). Unlike the previous
                                                              models, it uses distributed representations that consist
                                                              of stochastic binary variables as opposed to real num-
1. Introduction                                               bers. Unfortunately, this model does not scale well to
                                                              large vocabulary sizes due to the difficulty of inference
One of the main tasks of statistical language modelling       in the model.
is learning probability distributions of word sequences.
                                                              In this paper, we describe three new probabilistic lan-
This problem is usually reduced to learning the con-
                                                              guage models that use distributed word representa-
ditional distribution of the next word given a fixed
                                                              tions to define the distribution of the next word in
number of preceding words, a task at which n-gram
                                                              a sequence given several preceding words. We start
models have been very successful (Chen & Goodman,
                                                              with an undirected graphical model that uses a large
1996). Density estimation for discrete distributions
                                                              number of hidden binary variables to capture the de-
is inherently difficult because there is no simple way
                                                              sired conditional distribution. Then we augment it
to do smoothing based on input similarity. Since all
                                                              with temporal connections between hidden units to in-
discrete values are equally similar (or dissimilar) as-
                                                              crease the number of preceding words taken into ac-
signing similar probabilities to similar inputs, which
                                                              count without significantly increasing the number of
is typically done for continuous inputs, does not work
                                                              model parameters. Finally, we investigate a model
Appearing in Proceedings of the 24 th International Confer-   that predicts the distributed representation for the
ence on Machine Learning, Corvallis, OR, 2007. Copyright      next word using a linear function of the distributed
2007 by the author(s)/owner(s).                               representations of the preceding words, without using
Three New Graphical Models for Statistical Language Modelling

any additional latent variables.                               requires nNw Nh free parameters which can be unac-
                                                               ceptably large for vocabularies of even moderate size.
2. The Factored Restricted Boltzmann                           Another weakness of this model is that each word is
                                                               associated with a different set of parameters for each
   Machine Language Model                                      of the n positions it can occupy.
Our goal is to design a probabilistic model for word se-       Both of these drawbacks can be addressed by intro-
quences that uses distributed representations for words        ducing distributed representations (i.e. feature vec-
and captures the dependencies between words in a se-           tors) for words. Generalization is made easier by shar-
quence using stochastic hidden variables. The main             ing feature vectors across all sequence positions, and
choice to be made here is between directed and undi-           defining all of the interactions involving a word via
rected interactions between the hidden variables and           its feature vector. This type of parameterization has
the visible variables representing words. Blitzer et al.       been used in feed-forward neural networks for mod-
(2005a) have proposed a model with directed interac-           elling symbolic relations (Hinton, 1986) and for statis-
tions for this task. Unfortunately, training their model       tical language modelling (Bengio et al., 2003).
required exact inference, which is exponential in the
number of hidden variables. As a result, only a very           We represent each word using a real-valued feature
small number of hidden variables can be used, which            vector of length Nf and make the energy depend on
greatly limits the expressive power of the model.              the word only through its feature vector. Let R be an
                                                               Nw ×Nf matrix with row i being the feature vector for
In order to be able to handle a large number of hid-           the ith word in the dictionary. Then the feature vector
den variables, we use a Restricted Boltzmann Ma-               for word wi is given by viT R. Using this notation, we
chine (RBM) that has undirected interactions between           define the joint energy of a sequence of words w1 , ..., wn
multinomial visible units and binary hidden units.             along with the configuration of the hidden units h as
While maximum likelihood learning in RBMs is in-
tractable, RBMs can be trained efficiently using con-                                        n
                                                                                             X
trastive divergence learning (Hinton, 2002) and the               E(wn , h; w1:n−1 ) = − (         viT RWi )h
learning rule is unaffected when binary units are re-                                        i=1

placed by multinomial ones.                                                             − bTh h − bTr RT vn − bTv vn .   (2)
Assuming that the words we are dealing with come               Here matrix Wi specifies the interaction between the
from a finite dictionary of size Nw , we model the ob-         vector of hidden variables and the feature vector for
served words as multinomial random variables that              the visible variable vi . The vector bh contains bi-
take on Nw values.                                             ases for the hidden units, while the vectors bv and
To define an RBM model for the probability distribu-           br contain biases for words and word features re-
tion of the next word in a sentence given the word’s           spectively.2 To simplify the notation we do not ex-
context w1:n−1 , which is the previous n − 1 words             plicitly show the dependence of the energy functions
w1 , ..., wn−1 , we must first specify the energy function     and probability distributions on model parameters.
for a joint configuration of the visible and hidden units.     In other words, we write P (wn |w1:n−1 ) instead of
The simplest choice is probably                                P (wn |w1:n−1 , Wi , R, ...).
                                      n
                                      X                        Defining these interactions on the Nf -dimensional fea-
           E0 (wn , h; w1:n−1 ) = −         viT Gi h,   (1)    ture vectors instead of directly on the Nw -dimensional
                                      i=1                      visible variables leads to a much more compact param-
where vi is a binary column vector of length Nw with 1         eterization of the model, since typically Nf is much
in the with position and zeros everywhere else, and h is       smaller than Nw . Using the same feature matrix R
a column vector of length Nh containing the configura-         for all visible variables forces it to capture position-
tion of the hidden variables. Here matrix Gi specifies         invariant information about words as well as further re-
the interaction between the multinomial1 visible unit          ducing the number of model parameters. With 18,000
vi and the binary hidden units. For simplicity, we have           2
                                                                    The inclusion of per-word biases contradicts our philos-
ignored the bias terms – the ones that depend only on          ophy of using only the feature vectors to define the energy
h or only on vn . Unfortunately, this parameterization         function. However, since the number of these bias param-
                                                               eters is small compared to the total number of parameters
    1
      Technically, we do not have to assume any distribution   in the model, generalization is not negatively affected. In
for w1 , ..., wn−1 since we always condition on these random   fact the inclusion of per-word biases does not seem to af-
variables. As a result, the energy function can depend on      fect generalization but does speed up the early stages of
them in an arbitrary, nonlinear, way.                          learning.
Three New Graphical Models for Statistical Language Modelling

                                                            normalization is performed only over vn , resulting in
                                                            a sum containing only Nw terms. Moreover, in some
                                                            applications, such as speech recognition, we are inter-
                                                            ested in ratios of probabilities of words from a short
                                                            list and as a result we do not have to compute the
                                                            normalizing constant at all (Bengio et al., 2003).
                                                            The unnormalized probability of the next word can be
                                                            efficiently computed using the formula

                                                                    P (wn |w1:n−1 ) ∝ exp(bTr RT vn + bTv vn )
                                                                                      Y
                                                                                         (1 + exp(Ti )),                    (5)
Figure 1. a) The diagram for the Factored RBM and the                                      i
Temporal Factored RBM. The dashed part is included only
for the TFRBM. b) The diagram for the log-bilinear model.   where Ti is the total input to the hidden unit i when
                                                            w1 , ..., wn is the input to the model.

words, 1000 hidden units and a context of size 2            Since the normalizing constant for this distribution can
(n = 3), the use of 100-dimensional feature vectors         be computed in time linear in the dictionary size, ex-
reduces the number of parameters by a factor of 25,         act inference in this model has time complexity linear
from 54 million to a mere 2.1 million. As can be seen       in the number of hidden variables and the dictionary
from Eqs. 1 and 2, the feature-based parameteriza-          size. This compares favourably with complexity of ex-
tion constrains each of the visible-hidden interaction      act inference in the latent variable model proposed in
matrices Gi to be a product of two low-rank matrices        (Blitzer et al., 2005a), which is exponential in the num-
R and Wi , while the original parameterization does         ber of hidden variables.
not constrain Gi in any way.
                                                            2.2. Learning
The joint conditional distribution of the next word and
the hidden configuration h is defined in terms of the       The model can be trained on a dataset D of word
energy function in Eq. 2 as                                 sequences using maximum likelihood learning. The
                                                            log-likelihood of the dataset (assuming IID sequences)
                        1
  P (wn , h|w1:n−1 ) =     exp(−E(wn , h; w1:n−1 )), (3)    simplifies to
                        Zc
                                                                                  X
where Zc =
                 P P                                                      L(D) =     log P (wn |w1:n−1 ),      (6)
                    wn   h exp(−E(wn , h; w1:n−1 )) is a
context-dependent normalization term. The condi-
tional distribution of the next word given its context,     where P is defined by the model and the sum is over
which is the distribution we are ultimately interested      all word subsequences w1 , ..., wn of length n in the
in, can be obtained from the joint by marginalizing         dataset. L(D) can be maximized w.r.t. model param-
over the hidden variables:                                  eters using gradient ascent. The contributions made
                                                            by a subsequence w1 , ..., wn from D to the derivatives
                     1 X                                    of L(D) are given by
  P (wn |w1:n−1 ) =         exp(−E(wn , h; w1:n−1 )). (4)
                     Zc
                          h

Thus, we obtain a conditional model, which does not                ∂
try to model the distribution of w1:n−1 since we always              log P (wn |w1:n−1 ) =
                                                                  ∂R
condition on those variables.                                                   * n                  +
                                                                                  X
                                                                                          T T      T
                                                                                     vi h Wi + vn br   −
2.1. Making Predictions                                                              i=1
                                                                                 *   n
                                                                                                                +D
One attractive property of RBMs is that the probabil-                                X
                                                                                           vi hT WiT + vn bTr           ,   (7)
ity of a configuration of visible units can be computed
                                                                                     i=1                         M
up to a multiplicative constant in time linear in the
number of hidden units (Hinton, 2002). The normaliz-
ing constant, however, is usually infeasible to compute              ∂
because it is a sum of the exponential number of terms                  log P (wn |w1:n−1 ) = RT vi hT          D
                                                                                                                    −
                                                                    ∂Wi
(in the number of visible units). In the proposed lan-
guage model, though, this computation is easy because                                         R T v i hT        M
                                                                                                                    ,       (8)
Three New Graphical Models for Statistical Language Modelling

where h·iD and h·iM denote expectations w.r.t. dis-           updates for the hidden units, which is common prac-
tributions P (h|w1:n ) and P (vn , h|w1:n−1 ) respectively.   tice when training RBMs (Hinton, 2002). Using these
The derivative of the log likelihood of the dataset w.r.t.    mean-field updates instead of stochastic ones reduces
each parameter is then simply the sum of the contri-          the noise in the parameter derivatives allowing larger
butions by all n-word subsequences in the dataset.            learning rates to be used.
Computing these derivatives exactly can be computa-
tionally expensive because computing an expectation           3. The Temporal Factored RBM
w.r.t. P (vn , h|w1:n−1 ) takes O(Nw Nh ) time for each
                                                              The language model proposed above, like virtually all
context. One alternative is to approximate the expec-
                                                              statistical language models, is based on the assump-
tation using a Monte Carlo method by generating sam-
                                                              tion that given a word’s context, which is the n − 1
ples from P (vn , h|w1:n−1 ) and averaging the expression
                                                              words that immediately precede it, the word is con-
we are interested in over them. Unfortunately, gener-
                                                              ditionally independent of all other preceding words.
ating an exact sample from P (vn , h|w1:n−1 ) is just as
                                                              This assumption, which is clearly false, is made in or-
expensive as computing the original expectation. In-
                                                              der keep the number of model parameters relatively
stead of using exact sampling, we can generate sam-
                                                              small. In n-grams, for example, the number of pa-
ples from the distribution using a Markov Chain Monte
                                                              rameters is exponential in context size, which makes
Carlo method such as Gibbs sampling which involves
                                                              n-grams unsuitable for handling large contexts. While
starting vn and h in some random configuration and
                                                              the dependence of the number of model parameters on
alternating between sampling vn and h from their re-
                                                              context size is usually linear for models that use dis-
spective conditional distributions given by
                                                              tributed representations for words, for larger context
       P (wn |h, w1:n−1 ) ∝                                   sizes this number might still be very large.
               exp((hT WnT + bTr )RT vn + bTv vn ),     (9)   Ideally, a language model should be able to take ad-
                                                              vantage of indefinitely large contexts without needing
                                                              a very large number of parameters. We propose a sim-
                            n
                            X                                 ple extension to the factored RBM language model
      P (h|w1:n ) ∝ exp((         viT RWi + bTh )h).   (10)   to achieve that goal (at least in theory) following
                            i=1
                                                              Sutskever and Hinton (2007). Suppose we want to
However, a large number of alternating updates might          predict word wt+n from w1 , ..., wt+n−1 for some large
have to be performed to obtain a single sample from           t. We can apply a separate instance of our model (with
the joint distribution.                                       the same parameters) to words wτ , ..., wτ +n−1 for each
Fortunately, there is an approximate learning pro-            τ in {1, ..., t}, obtaining a distributed representation of
cedure called Contrastive Divergence (CD) learning            the ith n-tuple of words in the hidden state hτ of the
which is much more efficient. It is obtained by mak-          τ th instance of the model.
ing two changes to the MCMC-based learning method.            In order to propagate context information forward
First, instead of starting vn in a random configuration,      through the sequence towards the word we want to
we initialize vn with the state corresponding to wn .         predict, we introduce directed connections from hτ to
Second, instead of running the Markov chain to con-           hτ +1 and compute the hidden state of model τ + 1
vergence, we perform three alternating updates (first         using the inputs from the hidden state of model τ as
h, then vn , and then h again). While the resulting           well as its visible units. By introducing the dependen-
configuration (vn , h) (called a “confabulation” or “re-      cies between the hidden states of successive instances
construction”) is not a sample from P (vn , h|w1:n−1 ),       and specifying these dependencies using a shared pa-
it has been shown empirically that learning still works       rameter matrix A we make the distribution of wt+n
well when confabulations are used instead of samples          under the model depend on all previous words in the
from P (vn , h|w1:n−1 ) in the learning rules given above     sequence.
(Hinton, 2002).
In this paper, we train all our models that have hid-         3.1. Making Predictions
den variables using CD learning. In some cases, we            Exact inference in the resulting model is intractable
use a version of CD that, instead of sampling vn from         – it takes time exponential in the number of hidden
P (wn |h, w1:n−1 ) to obtain a binary vector with a single    variables (Nh ) in the model being instantiated. How-
1 in wn th position, sets vn to the vector of probabilities   ever, since predicting the next word given its (near)
given by P (wn |h, w1:n−1 ). This can be viewed as us-        infinite context is an online problem we take the filter-
ing mean-field updates for visible units and stochastic
Three New Graphical Models for Statistical Language Modelling

ing approach to making this prediction, which requires             on in the confabulation produced by model instance
storing only the last n − 1 words in the sequence. In              τ + 1. See (Sutskever & Hinton, 2007) for a more
contrast, exact inference requires that the whole con-             detailed description of learning in temporal RBMs.
text be stored, which might be infeasible or undesir-
able.                                                              4. A Log-Bilinear Language Model
Unfortunately, exact filtering is also intractable in this
                                                                   An alternative to using stochastic binary variables for
model. To get around this problem, we treat the hid-
                                                                   modelling the conditional distribution of the next word
den state hτ as fixed at pτ when inferring the distribu-
                                                                   given several previous words is to directly parameterize
tion P (hτ +1 |w1:τ +n ), where pτj = P (hτj = 1|w1:τ +n−1 )
                                                                   the distribution and thus avoid introducing stochastic
Then, given a sequence of words w1 , ..., wt+n−1 we in-
                                                                   hidden variables altogether. As before, we start by
fer the posterior over the hidden states of model in-
                                                                   specifying the energy function:
stances using the following recursive procedure. The
posterior for the first model instance is given by Eq.
10. Given the (factorial) posterior for model instance                                                           !
                                                                                                 n−1
τ , the posterior for model instance τ + 1 is computed                                           X
                                                                          E(wn ; w1:n−1 ) = −          viT RCi       R T vn
as
                                                                                                  i=1
        τ +1                                                                                     T T
 P (h          |w1:τ +n ) ∝                                                                 −   br R v n   − bTv vn .         (13)
                      X n
                exp((      viT RWi + (bh + Apτ )T )hτ +1 ). (11)   Here Ci specifies the interaction between the feature
                      i=1
                                                                   vector of wi and the feature vector of wn , while br
Thus, computing the posterior for model instance τ +               and bv specify the word biases as in Eq. 2. Just like
1 amounts to adding Apτ to that model’s vector of                  the energy function for the factored RBM, this energy
biases for hidden units and performing inference in the            function defines a bilinear interaction. However, in the
resulting model using Eq. 10.                                      FRBM energy function the interaction is between the
                                                                   word feature vectors and the hidden variables, whereas
Finally, the predictive distribution over wn is obtained
                                                                   in this model the interaction is between the feature
by applying the procedure described in Section 2.1 to
                                                                   vectors for the context words and the feature vector
model instance t after shifting its biases appropriately.
                                                                   for the predicted word. Intuitively, the model pre-
                                                                   dicts a feature vector for the next word by computing
3.2. Learning
                                                                   a linear function of the context word feature vectors.
Maximum likelihood learning in the temporal FRBM                   Then it assigns probabilities to all words in the vo-
model is intractable because it requires performing ex-            cabulary based on the similarity of their feature vec-
act inference. Since we would like to be able to train             tors to the predicted feature vector as measured by
models with large numbers of hidden variables we have              the dot product. The resulting predictive distribution
to resort to an approximate algorithm.                             is given by PP(wn |w1:n−1 ) = Z1c exp(−E(wn ; w1:n−1 )),
                                                                   where Zc = wn exp(−E(wn ; w1:n−1 )).
Instead of performing exact inference we simply ap-
ply the filtering algorithm from the previous section to           This model is similar to the energy-based model pro-
compute the approximate posterior for the model. For               posed in (Bengio et al., 2003). However, our model
each model instance the algorithm produces a vector                uses a bilinear energy function while their energy func-
which, when added to that model’s vector of hidden                 tion is a one-hidden-layer neural network.
unit biases, makes the model posterior be the posterior
produced by the filtering operation. Then we compute               4.1. Learning
the parameter updates for each model instance sepa-
                                                                   Training the above model is considerably simpler and
rately using the usual CD learning rule and average
                                                                   faster than training the FRBM models because no
them over all instances.
                                                                   stochastic hidden variables are involved. The gradients
The temporal parameters are learned using the follow-              required for maximum likelihood learning are given by
ing rule applied to each training sequence separately:
                            t−1
                            X
                   ∆A ∝           (pτ +1 − p̂τ +1 )T pτ .   (12)         ∂
                                                                            log P (wn |w1:n−1 ) = RT vi vnT R        D
                                                                                                                         −
                            τ =1                                        ∂Ci
Here p̂τi +1 is the probability of the hidden unit i being                                        RT vi vnT R        M
                                                                                                                         ,    (14)
Three New Graphical Models for Statistical Language Modelling

                                                                      and one temporal FRBM, as well as two log-bilinear
Table 1. Perplexity scores for the models trained on the
                                                                      models. All of our models used 100-dimensional fea-
10M word training set. The mixture test score is the per-
plexity obtained by averaging the model’s predictions with            ture vectors and both FRBM models had 1000 stochas-
those of the Kneser-Ney 6-gram model. The first four                  tic hidden units.
models use 100-dimensional feature vectors. The FRBM                  The models were trained using mini-batches of 1000
models have 1000 stochastic hidden units. GTn and KNn
                                                                      examples each. For the non-temporal models with
refer to back-off n-grams with Good-Turing and modified
                                                                      n = 3, each training case consisted of a two-word con-
Kneser-Ney discounting respectively.
                                                                      text and a vector of probabilities specifying the dis-
      Model               Context       Model          Mixture        tribution of the next word for this context. These
       type                size       test score      test score      probabilities were precomputed on the training set and
      FRBM                   2          169.4           110.6         stored in a sparse array. During training, the non-
  Temporal FRBM              2          127.3            95.6         temporal FRBM, vn was initialized with the precom-
    Log-bilinear             2          132.9           102.2         puted probability vector instead of the usual binary
    Log-bilinear             5          124.7            96.5
   Back-off GT3              2          135.3
                                                                      vector with a single 1 indicating the single “correct”
   Back-off KN3              2          124.3                         next word for this instance of the context. The use of
   Back-off GT6              5          124.4                         probability vectors as inputs along with mean field up-
   Back-off KN6              5          116.2                         dates for the visible unit, as described in Section 2.2,
                                                                      allowed us to use relatively high learning rates. Since
   ∂                                                                  precomputing and storing the probability vectors for
     log P (wn |w1:n−1 ) =
  ∂R *                                                                all 5-word contexts turned out to be non-trivial, the
                                                                      model with a context size of 5 was trained directly on
       n−1
                                            +
       X
                T           T   T         T
           (vn vi RCi + vi vn RCi ) + vn br   −                       6-word sequences.
         i=1
      *n−1                                            +D              All the parameters of the non-temporal models except
                                                                      for biases were initialized to small random values. Per-
       X
               (vn viT RCi + vi vnT RCiT ) + vn bTr        .   (15)
         i=1
                                                                      word bias parameters bv were initialized based on word
                                                       M
                                                                      frequencies in the training set. All other bias parame-
Note that averaging over the model distribution in this               ters were initialized to zero.
case is the same as averaging over the predictive distri-             The temporal FRBM model was initialized by copy-
bution over vn . As a result, these gradients are faster              ing the parameters from the trained FRBM and ini-
to compute than the gradients for a FRBM.                             tializing the temporal parameters (A) to zero. During
                                                                      training, stochastic updates were used for visible units
5. Experimental Results                                               and vn was always initialized to a binary vector rep-
                                                                      resenting the actual next word in the sequences (as
We evaluated our models using the Associated Press                    opposed to a distribution over words).
News (APNews) dataset consisting of a text stream
of about 16 million words. The dataset has been pre-                  For all models we used weight decay of 10−4 for word
processed by replacing proper nouns and rare words                    representations and 10−5 for all other weights. No
with the special “proper noun” and “unknown word”                     weight decay was applied to network biases. Weight
symbols respectively, while keeping all the punctua-                  decay values as well as other learning parameters were
tion marks, resulting in 17964 unique words. A more                   chosen using the validation set. Each model was
detailed description of preprocessing can be found in                 trained until its performance on a subset of the val-
(Bengio et al., 2003).                                                idation set stopped improving. We did not observe
                                                                      overfitting in any of our models, which suggests that
We performed our experiments in two stages. First,                    using models with more parameters might lead to im-
we compared the performance of our models to that of                  proved performance.
n-gram models on a smaller subset of the dataset to
determine which type of our models showed the most                    We compared our models to n-gram models estimated
promise. Then we performed a more thorough compar-                    using Good-Turing and modified Kneser-Ney discount-
ison between the models of that type and the n-gram                   ing. Training and testing of the n-gram models was
models using the full dataset.                                        performed using programs from the SRI Language
                                                                      Modelling toolkit (Stolcke, 2002). To make the com-
In the first experiment, we used a 10 million word                    parison fair, the n-gram models treated punctuation
training set, a 0.5 million word validation set, and a                marks (including full stops) as if they were ordinary
0.5 million word test set. We trained one non-temporal
Three New Graphical Models for Statistical Language Modelling

words, since that is how they were treated by the net-
                                                           Table 2. Perplexity scores for the models trained on the
work models.
                                                           14M word training set. The mixture test score is the per-
Models were compared based on their perplexity on          plexity obtained by averaging the model’s predictions with
the test set. Perplexity is a standard performance         those of the Kneser-Ney 5-gram model. The log-bilinear
measure for probabilistic language models, which is        models use 100-dimensional feature vectors.
computed as                                                       Model           Context       Model       Mixture
                                                                   type            size       test score   test score
                  1 X
        P = exp(−     log P (wn |w1:n−1 )),        (16)        Log-bilinear          5          117.0         97.3
                  Nw                                           Log-bilinear         10          107.8         92.1
                        1:n
                                                               Back-off KN3          2          129.8
where the sum is over all subsequences of length n in          Back-off KN5          4          123.2
                                                               Back-off KN6          5          123.5
the dataset, N is the number of such subsequences,
                                                               Back-off KN9          8          124.6
and P (wn |w1:n−1 ) is the probability under the model
of the nth word in the subsequence given the previous      grams with Good-Turing discounting. For this experi-
n − 1 words. To make the comparison between models         ment we used the full APNews dataset which was split
of different context size fair, given a test sequence of   into a 14 million word training set, 1 million word val-
length L, we ignored the first C words and tested the      idation set, and 1 million word test set.
models at predicting words C +1, ..., L in the sequence,
where C was the largest context size among the models      We trained two log-bilinear models: one with a con-
compared.                                                  text of size 5, the other with a context of size 10. The
                                                           training parameters were the same as in the first ex-
For each network model we also computed the perplex-       periment with the exception of the learning rate, which
ity for a mixture of that model with the best n-gram       was annealed to a lower value than before.3 The re-
model (modified Kneser-Ney 6-gram). The predictive         sults of the second experiments are summarized in Ta-
distribution for the mixture was obtained simply by        ble 2. The perplexity scores clearly show that the log-
averaging the predictive distributions produced by the     bilinear models outperform the n-gram models. Even
network and the n-gram model (giving them equal            the smaller log-bilinear model, with a context of size
weight). The resulting model perplexities are given        5, outperforms all the n-gram models. The table also
in Table 1.                                                shows using n-grams with a larger context size does not
The results show that three of the four network models     necessarily lead to better results. In fact, the best per-
we tested are competitive with n-gram models. Only         formance on this dataset was achieved for the context
the non-temporal FRBM is significantly outperformed        size of 4. Log-bilinear models, on the other hand, do
by all n-gram models. Adding temporal connections to       benefit from larger context sizes: increasing the con-
it, however, to obtain a temporal FRBM, improves the       text size from 5 to 10 decreases the model perplexity
model dramatically as indicated by a 33% drop in per-      by 8%.
plexity, suggesting that the temporal connections do       In our second experiment, we used the same training,
increase the effective context size of the model. The      validation, and test sets as in (Bengio et al., 2003),
log-bilinear models perform quite well: their scores are   which means that our results are directly compara-
on par with Good-Turing n-grams with the same con-         ble to theirs4 . In (Bengio et al., 2003), a neural lan-
text size.                                                 guage model with a context size of 5 is trained but its
Averaging the predictions of any network model with        individual score on the test set is not reported. In-
the predictions of the best n-gram model produced          stead the score of 109 is reported for the mixture of
better predictions than any single model, which sug-       a Kneser-Ney 5-gram model and the neural language
gests that the network and n-gram models complement            3
                                                                 This difference is one of the reasons for the log-bilinear
each other in at least some cases. The best results        model with a context of size 5 performing considerably bet-
were obtained by averaging with the temporal network       ter in the second experiment than in the first one. The
model, resulting in 21% reduction in perplexity over       increased training set size and the different test set are
                                                           unlikely be the only reasons because the n-gram models
the best n-gram model.
                                                           actually performed slightly better in the first experiment.
                                                               4
Since the log-bilinear models performed well and, com-           Due to limitations of the SRILM toolkit in dealing with
pared to the FRBMs, were fast to train, we used only       very long strings, instead of treating the dataset as a single
                                                           string, we broke up each of the test, training, and valida-
log-bilinear models in the second experiment. Simi-        tions sets into 1000 strings of equal length. This might
larly, we chose to use n-grams with Kneser-Ney dis-        explain why the n-gram scores we obtained are slightly
counting as they significantly outperformed the n-         different from the scores reported in (Bengio et al., 2003).
Three New Graphical Models for Statistical Language Modelling

model. That score is significantly worse than the score    Blitzer, J., Globerson, A., & Pereira, F. (2005a).
of 97.3 obtained by the corresponding mixture in our         Distributed latent variable models of lexical co-
experiments. Moreover, our best single model has a           occurrences. Proceedings of the Tenth International
score of 107.8 which means it performs at least as well      Workshop on Artificial Intelligence and Statistics.
as their mixture.
                                                           Blitzer, J., Weinberger, K., Saul, L., & Pereira, F.
                                                             (2005b). Hierarchical distributed representations for
6. Discussion and Future Work                                statistical language modeling. Advances in Neu-
We have proposed three new probabilistic models for          ral Information Processing Systems 18. Cambridge,
language one of which achieves state-of-the-art perfor-      MA: MIT Press.
mance on the APNews dataset. In two of our models          Chen, S. F., & Goodman, J. (1996). An empirical
the interaction between the feature vectors for the con-     study of smoothing techniques for language model-
text words and the feature vector for the next word is       ing. Proceedings of the Thirty-Fourth Annual Meet-
controlled by binary hidden variables. In the third          ing of the Association for Computational Linguistics
model the interaction is modelled directly using a bi-       (pp. 310–318). San Francisco: Morgan Kaufmann
linear interaction between the feature vectors. This         Publishers.
third model appears to be preferable to the other two
models because it is considerably faster to train and      Emami, A., Xu, P., & Jelinek, F. (2003). Using a
make predictions with. We plan to combine these two         connectionist model in a syntactical based language
interaction types in a single model, which, due to its      model. Proceedings of ICASSP (pp. 372–375).
greater representational power, should perform better
than any of our current models.                            Hinton, G. E. (1986). Learning distributed representa-
                                                             tions of concepts. Proceedings of the Eighth Annual
Unlike other language models, with the exception of          Conference of the Cognitive Science Society (pp. 1–
the energy based model in (Bengio et al., 2003), our         12). Amherst, MA.
models use distributed representations not only for
context words, but for the word being predicted as         Hinton, G. E. (2002). Training products of experts by
well. This means that our models should be able to           minimizing contrastive divergence. Neural Compu-
generalize well even for datasets with very large vo-        tation, 14, 1711–1800.
cabularies. We intend to test this hypothesis by com-
                                                           Hinton, G. E., Osindero, S., & Teh, Y. W. (2006).
paring our models to n-gram models on the APNews
                                                             A fast learning algorithm for deep belief networks.
dataset without removing rare words first.
                                                             Neural Computation, 18, 1527–1554.
In this paper we have used only a single layer of hid-
den units, but now that we have shown how to use           Morin, F., & Bengio, Y. (2005). Hierarchical prob-
factored matrices to take care of the very large num-       abilistic neural network language model. AIS-
ber of free parameters, it would be easy to make use        TATS’05 (pp. 246–252).
of the greedy, multilayer learning algorithm for RBMs      Schwenk, H., & Gauvain, J.-L. (2005). Training neu-
that was described in (Hinton et al., 2006).                 ral network language models on very large corpora.
                                                             Proceedings of Human Language Technology Confer-
Acknowledgements                                             ence and Conference on Empirical Methods in Nat-
                                                             ural Language Processing (pp. 201–208). Vancouver,
We thank Yoshua Bengio for providing us with the             Canada.
APNews dataset. This research was funded by grants
from NSERC, CIAR and CFI.                                  Stolcke, A. (2002). SRILM – an extensible language
                                                             modeling toolkit. Proceedings of the International
References                                                   Conference on Spoken Language Processing (pp.
                                                             901–904). Denver.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C.
  (2003). A neural probabilistic language model. Jour-     Sutskever, I., & Hinton, G. E. (2007).         Learn-
  nal of Machine Learning Research, 3, 1137–1155.            ing multilevel distributed representations for high-
                                                             dimensional sequences. AISTATS’07.

Bengio, Y., & Senécal, J.-S. (2003). Quick training of
  probabilistic neural nets by importance sampling.
  AISTATS’03.
You can also read