Conditional Generators of Words Definitions

Page created by Lonnie Klein
 
CONTINUE READING
Conditional Generators of Words Definitions

                                               Artyom Gadetsky           Ilya Yakubovskiy       Dmitry Vetrov
                                          National Research University         Joom       National Research University
                                          Higher School of Economics yakubovskiy@joom.com Higher School of Economics
                                         artygadetsky@yandex.ru                            Samsung-HSE Laboratory
                                                                                           vetrovd@yandex.ru

                                                             Abstract                           modeling as the evaluation task. In definition
                                                                                                modeling vector representations of words are used
                                             We explore recently introduced defini-             for conditional generation of corresponding word
                                             tion modeling technique that provided the
arXiv:1806.10090v1 [cs.CL] 26 Jun 2018

                                                                                                definitions. The primary motivation is that high-
                                             tool for evaluation of different distributed       quality word embedding should contain all useful
                                             vector representations of words through            information to reconstruct the definition. The im-
                                             modeling dictionary definitions of words.          portant drawback of Noraset et al. (2017) defini-
                                             In this work, we study the problem of              tion models is that they cannot take into account
                                             word ambiguities in definition modeling            words with several different meanings. These
                                             and propose a possible solution by em-             problems are related to word disambiguation task,
                                             ploying latent variable modeling and soft          which is a common problem in natural language
                                             attention mechanisms. Our quantitative             processing. Such examples of polysemantic words
                                             and qualitative evaluation and analysis of         as “bank“ or “spring“ whose meanings can only
                                             the model shows that taking into account           be disambiguated using their contexts. In such
                                             words ambiguity and polysemy leads to              cases, proposed models tend to generate defini-
                                             performance improvement.                           tions based on the most frequent meaning of the
                                                                                                corresponding word. Therefore, building models
                                         1   Introduction
                                                                                                that incorporate word sense disambiguation is an
                                         Continuous representations of words are used in        important research direction in natural language
                                         many natural language processing (NLP) applica-        processing.
                                         tions. Using pre-trained high-quality word em-            In this work, we study the problem of word
                                         beddings are most effective if not millions of         ambiguity in definition modeling task. We pro-
                                         training examples are available, which is true for     pose several models which can be possible so-
                                         most tasks in NLP (Kumar et al., 2016; Karpa-          lutions to it. One of them is based on recently
                                         thy and Fei-Fei, 2015). Recently, several unsu-        proposed Adaptive Skip Gram model (Bartunov
                                         pervised methods were introduced to learn word         et al., 2016), the generalized version of the orig-
                                         vectors from large corpora of texts (Mikolov et al.,   inal SkipGram Word2Vec, which can differ word
                                         2013; Pennington et al., 2014; Joulin et al., 2016).   meanings using word context. The second one
                                         Learned vector representations have been shown         is the attention-based model that uses the context
                                         to have useful and interesting properties. For ex-     of a word being defined to determine components
                                         ample, Mikolov et al. (2013) showed that vec-          of embedding referring to relevant word meaning.
                                         tor operations such as subtraction or addition re-     Our contributions are as follows: (1) we intro-
                                         flect semantic relations between words. Despite        duce two models based on recurrent neural net-
                                         all these properties it is hard to precisely evalu-    work (RNN) language models, (2) we collect new
                                         ate embeddings because analogy relation or word        dataset of definitions, which is larger in number
                                         similarity tasks measure learned information indi-     of unique words than proposed in Noraset et al.
                                         rectly.                                                (2017) and also supplement it with examples of the
                                            Quite recently Noraset et al. (2017) introduced     word usage (3) finally, in the experiment section
                                         a more direct way for word embeddings evalu-           we show that our models outperform previously
                                         ation. Authors suggested considering definition        proposed models and have the ability to generate
definitions depending on the meaning of words.         learning models. Therefore, learning high-quality
                                                       vector representations is the important task.
2     Related Work
                                                       3.1   Skip-gram
2.1    Constructing Embeddings Using
       Dictionary Definitions                          One of the most popular and frequently used vec-
                                                       tor representations is Skip-gram model. The orig-
Several works utilize word definitions to learn em-
                                                       inal Skip-gram model consists of grouped word
beddings. For example, Hill et al. (2016) use defi-
                                                       prediction tasks. Each task is formulated as a pre-
nitions to construct sentence embeddings. Authors
                                                       diction of the word v given word w using their in-
propose to train recurrent neural network produc-
                                                       put and output representations:
ing an embedding of the dictionary definition that
is close to an embedding of the corresponding                                     exp(inTw outv )
word. The model is evaluated with the reverse                 p(v|w, θ) = PV                             (2)
                                                                                            T
dictionary task. Bahdanau et al. (2017) suggest                                v 0 =1 exp(inw outv 0 )

using definitions to compute embeddings for out-       where θ and V stand for the set of input and out-
of-vocabulary words. In comparison to Hill et al.      put word representations, and dictionary size re-
(2016) work, dictionary reader network is trained      spectively. These individual prediction tasks are
end-to-end for a specific task.                        grouped in a way to independently predict all ad-
2.2    Definition Modeling                             jacent (with some sliding window) words y =
                                                       {y1 , . . . yC } given the central word x:
Definition modeling was introduced in Noraset
et al. (2017) work. The goal of the definition                      p(y|x, θ) =
                                                                                  Y
                                                                                        p(yj |x, θ)      (3)
model p(D|w∗ ) is to predict the probability of                                    j
words in the definition D = {w1 , . . . , wT } given
the word being defined w∗ . The joint probability      The joint probability of the model is written as fol-
is decomposed into separate conditional probabil-      lows:
ities, each of which is modeled using the recurrent
                                                                                  N
neural network with soft-max activation, applied                                  Y
                                                                   p(Y |X, θ) =         p(yi |xi , θ)    (4)
to its logits.
                                                                                  i=1
                      T
                                                       where (X, Y ) = {xi , yi }N i=1 are training pairs of
                      Y
                ∗
          p(D|w ) =         p(wt |wi
with Skip-gram AdaGram assumes several mean-                One important property of the model is an abil-
ings for each word and therefore keeps several           ity to disambiguate words using context. More
vectors representations for each word. They in-          formally, after training on data D = {xi , yi }N
                                                                                                        i=1
troduce latent variable z that encodes the index of      we may compute the posterior probability of word
meaning and extend (2) to p(v|z, w, θ). They use         meaning given context and take the word vector
hierarchical soft-max approach rather than nega-         with the highest probability.:
tive sampling to overcome computing denomina-
tor.                                                         p(z = k|x, y, θ) ∝
                                                                                                                 (8)
                                                                                  Z
                        Y
 p(v|z = k, w, θ) =             σ(ch(n)inTwk outn )          ∝ p(y|x, z = k, θ)       p(z = k|β, x)q(β)dβ
                      n∈path(v)
                                                  (5)    This knowledge about word meaning will be
Here inwk stands for input representation of word        further utilized in one of our models as
w with meaning index k and output representa-            disambiguation(x|y).
tions are associated with nodes in a binary tree,
where leaves are all possible words in model vo-         4     Models
cabulary with unique paths from the root to the
                                                         In this section, we describe our extension to orig-
corresponding leaf. ch(n) is a function which re-
                                                         inal definition model. The goal of the extended
turns 1 or -1 to each node in the path(·) depending
                                                         definition model is to predict the probability of a
on whether n is a left or a right child of the previ-
                                                         definition D = {w1 , . . . , wT } given a word being
ous node in the path. Huffman tree is often used
                                                         defined w∗ and its context C = {c1 , . . . , cm } (e.g.
for computational efficiency.
                                                         example of use of this word). As it was motivated
   To automatically determine the number of
                                                         earlier, the context will provide proper information
meanings for each word authors use the con-
                                                         about word meaning. The joint probability is also
structive definition of Dirichlet process via stick-
                                                         decomposed in the conditional probabilities, each
breaking representation (p(z = k|w, β)), which is
                                                         of which is provided with the information about
commonly used prior distribution on discrete la-
                                                         context:
tent variables when the number of possible values
is unknown (e.g. infinite mixtures).                                               T
                                                                                   Y
                                                                         ∗
                                                                 p(D|w , C) =            p(wt |wi
similar vectors representations with smoothed                   Split             train          val         test
meanings due to theoretical guarantees on a num-             #Words             33,128        8,867        8,850
ber of learned components. To overcome this                 #Entries            97,855       12,232       12,232
problem and to get rid of careful tuning of this            #Tokens          1,078,828      134,486      133,987
hyper-parameter we introduce following model:             Avg length             11.03        10.99        10.95

      ht = g([a∗ ; vt ], ht−1 )                                     Table 1: Statistics of new dataset
       ∗      ∗
      a =v    mask                               (11)
                Pm
                     AN N (ci )
      mask = σ(W i=1            + b)
                     m
  where        is an element-wise product, σ is a
logistic sigmoid function and AN N is attention
neural network, which is a feed-forward neural
network. We motivate these updates by the fact,
that after learning Skip-gram model on a large cor-
pus, vector representation for each word will ab-
sorb information about every meaning of the word.
Using soft binary mask dependent on word context
we extract components of word embedding rele-
vant to corresponding meaning. We refer to this          Figure 1: Perplexities of S+I Attention model
model as Input Attention (I-Attention).                  for the case of pre-training (solid lines) and for
                                                         the case when the model is trained from scratch
4.3    Attention SkipGram
                                                         (dashed lines).
For attention-based model, we use different em-
beddings for context words. Because of that, we
pre-train attention block containing embeddings,         5.2     Pre-training
attention neural network and linear layer weights        It is well-known that good language model can of-
by optimizing a negative sampling loss function in       ten improve metrics such as BLEU for a particu-
the same manner as the original Skip-gram model:         lar NLP task (Jozefowicz et al., 2016). According
                                                         to this, we decided to pre-train our models. For
              0T
       log σ(vw  v )                                     this purpose, WikiText-103 dataset (Merity et al.,
                O wI
                                                         2016) was chosen. During pre-training we set v ∗
           k
           X
                                      0T
                                                 (12)    (eq. 10) to zero vector to make our models purely
       +         Ewi ∼Pn (w) [log σ(−vw  v )]
                                        i wI
                                                         unconditional. Embeddings for these language
           i=1
                                                         models were initialized by Google Word2Vec vec-
   where vw0 , v         0
            O    wI and vwi are vector representa-       tors1 and were fine-tuned. Figure 1 shows that
tion of ”positive” example, anchor word and nega-        this procedure helps to decrease perplexity and
tive example respectively. Vector vwI is computed        prevents over-fitting. Attention Skip-gram vectors
using embedding of wI and attention mechanism            were also trained on the WikiText-103.
proposed in previous section.
                                                         5.3     Results
5     Experiments                                        Both our models are LSTM networks (Hochre-
5.1    Data                                              iter and Schmidhuber, 1997) with an embedding
                                                         layer. The attention-based model has own em-
We collected new dataset of definitions using Ox-
                                                         bedding layer, mapping context words to vector
fordDictionaries.com (2018) API. Each entry is a
                                                         representations. Firstly, we pre-train our mod-
triplet, containing the word, its definition and ex-
                                                         els using the procedure, described above. Then,
ample of the use of this word in the given meaning.
                                                         we train them on the collected dataset maximiz-
It is important to note that in our data set words can
                                                         ing log-likelihood objective using Adam (Kingma
have one or more meanings, depending on the cor-
                                                         and Ba, 2014). Also, we anneal learning rate by
responding entry in the Oxford Dictionary. Table
                                                           1
1 shows basic statistics of the new dataset.                   https://code.google.com/archive/p/word2vec/
Word        Context                              Definition
         star        she got star treatment               a person who is very important
                                                          a small circle of a celestial object
          star       bright star in the sky
                                                          or planet that is seen in a circle
       sentence      sentence in prison                   an act of restraining someone or something
       sentence      write up the sentence                a piece of text written to be printed
         head        the head of a man                    the upper part of a human body
         head        he will be the head of the office    the chief part of an organization, institution, etc
                     they never reprinted the             a written or printed version of
        reprint
                     famous treatise                      a book or other publication
                     the woman was raped on
         rape                                             the act of killing
                     her way home at night
                     he pushed the string through
       invisible                                          not able to be seen
                     an inconspicuous hole
        shake        my faith has been shaken             cause to be unable to think clearly
                     the nickname for the u.s.
      nickname                                            a name for a person or thing that is not genuine
                     constitution is ‘old ironsides ’

Table 2: Examples of definitions generated by S + I-Attention model for the words and contexts from the
test set.

   Model                       PPL            BLEU         algorithm (τ = 0.1). Table 2 shows the examples.
   S+G+CH+HE (1)              45.62        11.62 ± 0.05    The source code and dataset will be freely avail-
   S+G+CH+HE (2)              46.12              -         able 3 .
   S+G+CH+HE (3)              46.80              -
   S + I-Adaptive (2)         46.08        11.53 ± 0.03
                                                           6       Conclusion
   S + I-Adaptive (3)         46.93              -
   S + I-Attention (2)        43.54        12.08 ± 0.02
   S + I-Attention (3)        44.9               -         In the paper, we proposed two definition models
                                                           which can work with polysemantic words. We
Table 3: Performance comparison between best               evaluate them using perplexity and measure the
model proposed by Noraset et al. (2017) and our            definition generation accuracy with BLEU score.
models on the test set. Number in brackets means           Obtained results show that incorporating informa-
number of LSTM layers. BLEU is averaged across             tion about word senses leads to improved met-
3 trials.                                                  rics. Moreover, generated definitions show that
                                                           even implicit word context can help to differ word
                                                           meanings. In future work, we plan to explore in-
a factor of 10 if validation loss doesn’t decrease         dividual components of word embedding and the
per epochs. We use original Adaptive Skip-gram             mask produced by our attention-based model to
vectors as inputs to S+I-Adaptive, which were ob-          get a deeper understanding of vectors representa-
tained from the official repository2 . We compare          tions of words.
different models using perplexity and BLEU score
on the test set. BLEU score is computed only for
models with the lowest perplexity and only on the          Acknowledgments
test words that have multiple meanings. The re-
sults are presented in Table 3. We see that both           This work was partly supported by Samsung Re-
models that utilize knowledge about meaning of             search, Samsung Electronics, Sberbank AI Lab
the word have better performance than the com-             and the Russian Science Foundation grant 17-71-
peting one. We generated definitions using S + I-          20072.
Attention model with simple temperature sampling
  2                                                            3
      https://github.com/sbos/AdaGram.jl                           https://github.com/agadetsky/pytorch-definitions
References                                                  Andriy Mnih and Geoffrey E Hinton. 2009. A scal-
                                                              able hierarchical distributed language model. In Ad-
Dzmitry Bahdanau, Tom Bosc, Stanislaw Jastrzebski,            vances in Neural Information Processing Systems
  Edward Grefenstette, Pascal Vincent, and Yoshua             21, pages 1081–1088. Curran Associates, Inc.
  Bengio. 2017. Learning to compute word embed-
  dings on the fly. arXiv preprint arXiv:1706.00286.        Thanapon Noraset, Chen Liang, Larry Birnbaum, and
                                                              Doug Downey. 2017. Definition modeling: Learn-
Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin,
                                                              ing to define word embeddings in natural language.
  and Dmitry Vetrov. 2016. Breaking sticks and ambi-
                                                              In 31st AAAI Conference on Artificial Intelligence,
  guities with adaptive skip-gram. In Artificial Intelli-
                                                              AAAI 2017, pages 3259–3266. AAAI press.
  gence and Statistics, pages 130–138.
                                                            OxfordDictionaries.com. 2018.      Oxford University
Felix Hill, KyungHyun Cho, Anna Korhonen, and
                                                              Press.
  Yoshua Bengio. 2016. Learning to understand
  phrases by embedding the dictionary. Transactions         Jeffrey Pennington, Richard Socher, and Christopher
  of the Association for Computational Linguistics,            Manning. 2014. Glove: Global vectors for word
  4:17–30.                                                     representation. In Proceedings of the 2014 confer-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long            ence on empirical methods in natural language pro-
  short-term memory. Neural Comput., 9(8):1735–                cessing (EMNLP), pages 1532–1543.
  1780.
Matthew D. Hoffman, David M. Blei, Chong Wang,
 and John Paisley. 2013. Stochastic variational in-
 ference. Journal of Machine Learning Research,
 14:1303–1347.
Armand Joulin, Edouard Grave, Piotr Bojanowski,
  Matthijs Douze, Hérve Jégou, and Tomas Mikolov.
  2016. Fasttext.zip: Compressing text classification
  models. arXiv preprint arXiv:1612.03651.
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam
  Shazeer, and Yonghui Wu. 2016.           Exploring
  the limits of language modeling. arXiv preprint
  arXiv:1602.02410.
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-
  semantic alignments for generating image descrip-
  tions. In Proceedings of the IEEE conference
  on computer vision and pattern recognition, pages
  3128–3137.
Diederik P. Kingma and Jimmy Ba. 2014.           Adam:
  A method for stochastic optimization.          CoRR,
  abs/1412.6980.
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit
  Iyyer, James Bradbury, Ishaan Gulrajani, Victor
  Zhong, Romain Paulus, and Richard Socher. 2016.
  Ask me anything: Dynamic memory networks for
  natural language processing. In Proceedings of The
  33rd International Conference on Machine Learn-
  ing, volume 48 of Proceedings of Machine Learning
  Research, pages 1378–1387. PMLR.
Stephen Merity, Caiming Xiong, James Bradbury, and
   Richard Socher. 2016. Pointer sentinel mixture
   models. arXiv preprint arXiv:1609.07843.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
  rado, and Jeff Dean. 2013. Distributed representa-
  tions of words and phrases and their composition-
  ality. In C. J. C. Burges, L. Bottou, M. Welling,
  Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
  vances in Neural Information Processing Systems
  26, pages 3111–3119. Curran Associates, Inc.
You can also read