Still a Pain in the Neck: Evaluating Text Representations on Lexical Composition

Page created by Valerie Davidson
 
CONTINUE READING
Still a Pain in the Neck:
                                                      Evaluating Text Representations on Lexical Composition

                                                              Vered Shwartz                           Ido Dagan
                                                       Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel
                                                        vered1986@gmail.com                   dagan@cs.biu.ac.il

                                                               Abstract                        olives while baby oil is made for babies (Shwartz
                                                                                               and Waterson, 2018).
                                              Building meaningful phrase representations          There has been a line of attempts to learn com-
                                              is challenging because phrase meanings are       positional phrase representations (e.g. Mitchell
arXiv:1902.10618v2 [cs.CL] 19 May 2019

                                              not simply the sum of their constituent          and Lapata, 2010; Baroni and Zamparelli, 2010;
                                              meanings. Lexical composition can shift the
                                                                                               Wieting et al., 2017; Poliak et al., 2017), but many
                                              meanings of the constituent words and in-
                                              troduce implicit information. We tested a        of these are tailored to a specific type of phrase or
                                              broad range of textual representations for       to a fixed number of constituent words, and they
                                              their capacity to address these issues. We       all disregard the surrounding context. Recently,
                                              found that, as expected, contextualized word     contextualized word representations boasted dra-
                                              representations perform better than static       matic performance improvements on a range of
                                              word embeddings, more so on detecting            NLP tasks (Peters et al., 2018; Radford et al.,
                                              meaning shift than in recovering implicit        2018; Devlin et al., 2019). Such models serve
                                              information, in which their performance is
                                                                                               as a function for computing word representations
                                              still far from that of humans. Our evalua-
                                              tion suite, consisting of six tasks related to   in a given context, making them potentially more
                                              lexical composition effects, can serve future    capable to address meaning shift. These models
                                              research aiming to improve representations.      were shown to capture some world knowledge (e.g
                                                                                               Zellers et al., 2018), which may potentially help
                                                                                               with uncovering implicit information.
                                         1   Introduction                                         In this paper we test how well various text
                                         Modeling the meaning of phrases involves ad-          representations address these composition-related
                                         dressing semantic phenomena that pose non-            phenomena. Methodologically, we follow recent
                                         trivial challenges for common text representations,   work that applied “black-box” testing to assess
                                         which derive a phrase representation from those       various capacities of distributed representations
                                         of its constituent words. One such phenomenon is      (e.g. Adi et al., 2017; Conneau et al., 2018). We
                                         meaning shift, which happens when the meaning         construct an evaluation suite with six tasks related
                                         of the phrase departs from the meanings of its con-   to the above two phenomena, as shown in Fig-
                                         stituent words. This is especially common among       ure 1, and develop generic models that rely on pre-
                                         verb-particle constructions (carry on), idiomatic     trained representations. We test six representa-
                                         noun compounds (guilt trip) and other multi-word      tions, including static word embeddings (Mikolov
                                         expressions (MWE, lexical units that form a dis-      et al., 2013; Pennington et al., 2014; Bojanowski
                                         tinct concept), making them “a pain in the neck”      et al., 2017) and contextualized word embeddings
                                         for NLP applications (Sag et al., 2002).              (Peters et al., 2018; Radford et al., 2018; Devlin
                                                                                               et al., 2019). Our contributions are as follows:
                                            A second phenomenon is common for both
                                         MWEs and free word combinations such as noun          1. We created a unified framework that tests the
                                         compounds and adjective-noun compositions. It            capacity of representations to address lexical
                                         happens when the composition introduces an im-           composition via classification tasks, focusing
                                         plicit meaning that often requires world knowl-          on detecting meaning shift and recovering im-
                                         edge to uncover. For example, that hot refers to         plicit meaning. We test six representations and
                                         the temperature of tea but to the manner of de-          provide an in depth analysis of the results.
                                         bate (Hartung, 2015), or that olive oil is made of    2. We relied on existing annotated datasets used
|                 |
                                                                              We rely on existing tasks and datasets, but
              MWE
                          VPC Classification |                 |   Phrase
                          LVC Classification |                 |    Type    make substantial changes and augmentations in
phrase type

                                             |                 |            order to create a uniform framework. First, we
                                             |                 |
                                                                            cast all tasks as classification tasks. Second, we
              Free word

                                             |                 |
                            NC Literality    | NC Relations    |   Phrase   add sentential contexts where the original datasets
                                             | AN Attributes   |    Type
                                             |                 |            annotate the phrases out-of-context, by extract-
                              meaning        |     implicit    |    both    ing averaged-length sentences (15-20 words) from
                               shift         |     meaning     |
                                        phenomenon
                                                                            English Wikipedia (January 2018 dump) in which
                                                                            the target phrase appears. We assume that the an-
Figure 1: A map of our tasks according to type of                           notation does not depend on the context, an as-
phrase (MWE/free word combination) and the phe-                             sumption that holds in most cases, judging by the
nomenon they test (meaning shift/implicit meaning).                         human performance scores (Section 6). Finally,
                                                                            we split each dataset to roughly 80% train and
   in various tasks, and re-casted them to fit our                          10% for each of the validation and test sets, un-
   classification framework. We additionally an-                            der lexical constraints as detailed for each task.
   notated a sample from each test set to confirm
   the data validity and estimate the human per-                            2.1   Recognizing Verb-Particle Constructions
   formance on each task.                                                   A verb-particle construction (VPC) consists of a
3. We provide the classification framework, in-                             head verb and a particle, typically in the form of an
   cluding data and code and available at https:                            intransitive preposition, which changes the verb’s
   //github.com/vered1986/lexcomp,                                          meaning (e.g. carry on vs. carry).
   which would allow testing future models for
   their capacity to address lexical composition.                           Task Definition. Given a sentence s that in-
                                                                            cludes a verb V followed by a preposition P, the
   Our results confirm that the contextualized word                         goal is to determine whether it is a VPC or not.
embeddings perform better than the static ones.
In particular, we show that indeed modeling con-                            Data. We use the dataset of Tu and Roth (2012)
text contributes to recognizing meaning shift: on                           which consists of 1,348 sentences from the British
such tasks, contextualized models performed on                              National Corpus (BNC), each containing a verb
par with humans.                                                            V and a preposition P annotated to whether it is
                                                                            a VPC or not. The dataset is focused on 23 dif-
   Conversely, despite hopes of filling missing
                                                                            ferent phrasal verbs derived from six of the most
information with world knowledge provided by
                                                                            frequently used verbs (take, make, have, get, do,
the contextualized representations, the signal they
                                                                            give), and their combination with common prepo-
yield for recovering implicit information is much
                                                                            sitions or particles. To reduce label bias, we split
weaker, and the gap between the best perform-
                                                                            the dataset lexically by verb, i.e. the train, test, and
ing model and the human performance on such
                                                                            validation sets contain distinct verbs in their V and
tasks remains substantial. We expect that improv-
                                                                            P combinations.
ing the ability of such representations to reveal im-
plicit meaning would require more than a language
                                                                            2.2   Recognizing Light Verb Constructions
model training objective. In particular, one future
direction is a richer training objective that simul-                        The meaning of a light verb construction (LVC,
taneously models multiple co-occurrences of the                             e.g. make a decision) is mainly derived from its
constituent words across different texts, as is com-                        noun object (decision), while the meaning of its
monly done in noun compound interpretation (e.g                             main verb (make) is “light” (Jespersen, 1965). As
Ó Séaghdha and Copestake, 2009; Shwartz and                                 a rule of thumb, an LVC can be replaced by the
Waterson, 2018; Shwartz and Dagan, 2018).                                   verb usage of its direct object noun (decide) with-
                                                                            out changing the meaning of the sentence.
2                Composition Tasks
                                                                            Task Definition. Given a sentence s that in-
We experimented with six tasks that address the                             cludes a potential light verb construction (“make
meaning shift and implicit meaning phenomena,                               an easy decision”), the goal is to determine
summarized in Table 1 and detailed below.                                   whether it is an LVC or not.
Train/val/test
          Task                Data Source                                    Input                  Output           Context
                                                         Size
                                                                          sentence s
 VPC Classification        Tu and Roth (2012)        919/209/220                                 is VP a VPC?          O
                                                                         VP = w1 w2
                                                                           sentence s             is the span
 LVC Classification        Tu and Roth (2011)        1521/258/383                                                      O
                                                                        span = w1 ... wk           an LVC?
                                                                            sentence s
                           Reddy et al. (2011)                                                         is w
      NC Literality                                  2529/323/138         NC = w1 w2                                   A
                              Tratz (2011)                                                       literal in NC?
                                                                      target w ∈ {w1 , w2 }
                                                                          sentence s
                          SemEval 2013 Task 4                                                      does p
      NC Relations                                   1274/162/130        NC = w1 w2                                    A
                         (Hendrickx et al., 2013)                                               explicate NC?
                                                                         paraphrase p
                                                                          sentence s
                                                                                                does p describe
      AN Attributes     HeiPLAS (Hartung, 2015)      837/108/106         AN = w1 w2                                    A
                                                                                              the attribute in AN?
                                                                         paraphrase p
                              STREUSLE
      Phrase Type                                    3017/372/376          sentence s           label per token        O
                       (Schneider and Smith, 2015)

Table 1: A summary of the composition tasks included in our evaluation suite. In the context column, O means the
context is part of the original dataset, while A is used for datasets in which the context was added in this work.

Data. We use the dataset of Tu and Roth (2011),              amples.
which contains 2,162 sentences from BNC in
which a potential light verb construction was                Task Adaptation. We add sentential contexts
found (with the same 6 common verbs as in Sec-               from Wikipedia, keeping up to 10 sentences per
tion 2.1), annotated to whether it is an LVC in a            example. We downsample from the literal exam-
given context or not. We split the dataset lexically         ples to balance the dataset, allowing for a ratio of
by the verb.                                                 up to 4 literal to non-literal examples. We split the
                                                             dataset lexically by head, i.e. if w1 w2 is in one
2.3     Noun Compound Literality                             set, there are no w’1 w2 NCs in the other sets.1
Task Definition. Given a noun compound NC =
                                                             2.4      Noun Compound Relations
w1 w2 in a sentence s, and a target word w ∈ {w1 ,
w2 }, the goal is to determine whether the meaning           Task Definition. Given a noun compound NC =
of w in NC is literal. For instance, market has a            w1 w2 in a sentential context s, and a paraphrase
literal meaning in flea market but flea does not.            p, the goal is to determine whether p describes
                                                             the semantic relation between w1 and w2 or not.
Data. We use the dataset of Reddy et al. (2011)              For example, “part that makes up body” is a valid
which consists of 90 noun compounds along with               paraphrase for body part, but “replacement part
human judgments about the literality of each con-            bought for body” is not.
stituent word. Scores are given in a scale of 0-5,
0 being non-literal and 5 being literal. For each            Data. We use the data from SemEval 2013 Task
noun compound and each of its constituents we                4 (Hendrickx et al., 2013). The dataset consists
consider examples with a score ≥ 4 as literal, and           of 356 noun compounds annotated with 12,446
≤ 2 as non-literal, ignoring the middle range. We            human-proposed free text paraphrases.
obtain 72 literal and 72 non-literal examples.
                                                             Task Adaptation. The goal of the SemEval task
   To increase the dataset size we augment it with
                                                             was to generate a list of free-form paraphrases for
literal examples from the Tratz (2011) dataset of
                                                             a given noun compound, which explicate the im-
noun compound classification. Compounds in this
                                                             plicit semantic relation between its constituent. To
dataset are annotated to the semantic relation that
                                                             match with our other tasks, we cast the task as a
holds between w1 and w2 . Most relations (except
                                                             binary classification problem where the input is a
for lexicalized, which we ignore), define the
                                                             noun compound NC and a paraphrase p, and the
meaning of NC as some trivial combination of w1
and w2 , allowing us to regard both words as literal.            1
                                                                   We chose to split by head rather than by modifier based
This method produces additional 3,061 literal ex-            on the majority baseline that achieved better performance.
goal is to predict whether p is a correct description   reduce the chance that the negative attribute is in
of NC.                                                  fact a valid attribute for AN, we compute the Wu-
   The positive examples for an NC w1 w2 are triv-      Palmer similarity (Wu and Palmer, 1994) between
ially derived from the original data by sampling        the original and negative attribute, taking only at-
up to 5 of its paraphrases and creating a (NC, p,       tributes whose similarity to the original attribute is
TRUE) example for each paraphrase p. The same           below a threshold (0.4).
number of negative examples is then created us-            Similarly to the previous task, we attach a con-
ing negative sampling from the paraphrase tem-          text sentence from Wikipedia to each example. Fi-
plates of other noun compounds w’1 w2 and w1            nally, we split the dataset lexically by adjective,
w’2 in the dataset that share a constituent word        i.e. if A N is in one set, there are no A N’ exam-
with NC. For example, “replacement part bought          ples in the other sets.
for body” is a negative example constructed from
the paraphrase template “replacement [w2 ] bought       2.6    Identifying Phrase Type
for [w1 ]” which appeared for car part. We require      The last task consists of multiple phrase types and
one shared constituent in order to form more flu-       addresses detecting both meaning shift and im-
ent paraphrases (which would otherwise be easily        plicit meaning.
classifiable as negative). To reduce the chances
of creating negative examples which are in fact         Task Definition. The task is defined as sequence
valid paraphrases, we only consider negative para-      labeling to BIO tags. Given a sentence, each word
phrases whose verbs never occurred in the positive      is classified to whether it is part of a phrase, and
paraphrase set for the given NC.                        the specific type of the phrase.
   We add sentential contexts from Wikipedia, ran-      Data. We use the STREUSLE corpus
domly selecting one sentence per example, and           (Supersense-Tagged Repository of English
split the dataset lexically by head.                    with a Unified Semantics for Lexical Expressions,
2.5   Adjective-Noun Attributes                         Schneider and Smith, 2015). The corpus contains
                                                        texts from the web reviews portion of the English
Task Definition. Given an adjective-noun com-
                                                        Web Treebank, along with various semantic anno-
position AN in a sentence s, and an attribute AT,
                                                        tations, from which we use the BIO annotations.
the goal is to determine whether AT is implic-
                                                        Each token is labeled with a tag, a B-X tag marks
itly conveyed in AN. For example, the attribute
                                                        the beginning of a span of type X, I occurs inside
temperature is conveyed in hot water, but not
                                                        a span, and O outside of it. B labels mark specific
in hot argument (emotionality).
                                                        types of phrase.2
Data. We use the HeiPLAS data set (Hartung,
2015), which contains 1,598 adjective-noun com-         Task Adaptation. We are interested in a sim-
positions annotated with their implicit attribute       pler version of the annotations. Specifically, we
meaning. The data was extracted from WordNet            exclude the discontinuous spans (e.g. a span
and manually filtered. The label set consists of        like “turn the [TV] off ” would not be consid-
attribute synsets in WordNet which are linked to        ered as part of a phrase). The corpus distin-
adjective synsets.                                      guishes between “strong” MWEs (fixed or id-
                                                        iomatic phrases) and “weak” MWEs (ad hoc com-
Task Adaptation. Since the dataset is small and         positional phrases). The weak MWEs are untyped,
the number of labels is large (254), we recast the      hence we label them as COMP (compositional).
task as a binary classification task. The input to
the task is an AN and a paraphrase created from         3     Representations
the template “[A] refers to the [AT] of [N]” (e.g.
“loud refers to the volume of thunder”). The goal       We experimented with 6 common word represen-
is to predict whether the paraphrase is correct or      tations from two different families detailed below.
                                                            2
not with respect to the given AN.                             Sorted by frequency: noun phrase, weak (compositional)
   We create up to 3 negative instances for each        MWE, verb-particle construction, verbal idioms, preposi-
                                                        tional phrase, auxiliary, adposition, discourse / pragmatic ex-
positive instance by replacing AT in another at-        pression, inherently adpositional verb, adjective, determiner,
tribute that appeared with either A or N. For ex-       adverb, light verb construction, non-possessive pronoun, full
ample, (hot argument, temperature, False). To           verb or copula, conjunction.
output
                    training objective                               corpus (#words)                         basic unit
                                                                                                 dimension
                                                       word embeddings
       WORD 2 VEC   Predicting surrounding words                   Google News (100B)              300         word
        G LOV E     Predicting co-occurrence probability       Wikipedia + Gigaword 5 (6B)         300         word
       FAST T EXT   Predicting surrounding words           Wikipedia + UMBC + statmt.org (16B)     300       subword
                                                contextualized word embeddings
         ELM O      Language model                              1B Word Benchmark (1B)             1024      character
      O PENAI GPT   Language model                                BooksCorpus (800M)               768       subword
         BERT       Masked language model (Cloze)            BooksCorpus + Wikipedia (3.3B)        768       subword

         Table 2: Architectural differences of the specific pre-trained representations used in this paper.

Table 2 summarizes the differences between the                   ing at the character-level allows using morpho-
pretrained models used in this paper.                            logical clues to form robust representations for
                                                                 out-of-vocabulary words, unseen in training. The
Word Embeddings. Word embedding models
                                                                 OpenAI GPT (Generative Pre-Training, Rad-
provide a fixed d-dimensional vector for each
                                                                 ford et al., 2018) has a similar training objec-
word in the vocabulary. Their training is based on
                                                                 tive, but the underlying encoder is a transformer
the distributional hypothesis, according to which
                                                                 (Vaswani et al., 2017). It uses subwords as the
semantically-similar words tend to appear in the
                                                                 basic unit, employing bytepair encoding. Fi-
same contexts (Harris, 1954). word2vec (Mikolov
                                                                 nally, BERT (Bidirectional Encoder Representa-
et al., 2013) can be trained with one of two ob-
                                                                 tions from Transformers, Devlin et al., 2019) is
jectives. We use the embeddings trained with the
                                                                 also based on the transformer, but it is bidirec-
Skip-Gram objective which predicts the context
                                                                 tional as opposed to left-to-right as in the OpenAI
words given the target word. GloVe (Pennington
                                                                 GPT, and the directions are dependent as opposed
et al., 2014) learns word vectors with the objec-
                                                                 to ELMo’s independently trained left-to-right and
tive of estimating the log-probability of a word
                                                                 right-to-left LSTMs. It also introduces a some-
pair co-occurrence. fastText (Bojanowski et al.,
                                                                 what different objective called “masked language
2017) extends word2vec by adding information
                                                                 model”: during training, some tokens are ran-
about subwords (bag of character n-grams). This
                                                                 domly masked, and the objective is to restore them
is especially helpful in morphologically-rich lan-
                                                                 from the context.
guages, but can also help handling rare or mis-
spelled words in English.                                        4    Classification Models
Contextualized Word Embeddings are func-                         We implemented minimal “Embed-Encode-
tions computing dynamic word embeddings for                      Predict” models that use the representations from
words given their context sentence, largely ad-                  Section 3 as inputs, keeping them fixed during
dressing polysemy. They are pre-trained as gen-                  training. The rationale behind the model design
eral purpose language models using a large-scale                 was to keep them uniform for easy comparison
unannotated corpus, and can later be used as a rep-              between the representations, and make them sim-
resentation layer in downstream tasks (either fine-              ple so that the model’s success can be attributed
tuned to the task with the other model parameters                directly to the input representations.
or fixed). The representations used in this paper
                                                                 Embed. We use the embedding model to embed
have multiple output layers. We either use only
                                                                 each word in the sentence s = w1 ...wn , obtaining:
the last layer, which was shown to capture seman-
tic information (Peters et al., 2018), or learn a task-
                                                                                 ~v1 , ..., ~vn = Embed(s)                (1)
specific scalar mix of the layers (see Section 6).
   ELMo (Embeddings from Language Models,                        where ~vi stands for the word embedding of word
Peters et al., 2018) are obtained by learning a                  wi , which may be computed as a function of the
character-based language model using a deep biL-                 entire sentence (in the case of contextualized word
STM (Graves and Schmidhuber, 2005). Work-                        embeddings).
Depending on the specific task, we may have                      vector u~0 1...l . In the general case, the input to the
another input w10 , ..., wl0 to embed separately from              classifier is:
the sentence: the paraphrases in the NC Relations
and AN Attributes tasks, and the target word in                                   ~x = [~ui ; ~ui+k ; u~0 1 ; u~0 l ]   (7)
the NC Literality task (to obtain an out-of-context
representation of the target word). We embed this                  where each of ~ui+k , u~0 1 , and u~0 l can be empty vec-
extra input as follows:                                            tors in the cases of single word spans or no addi-
                                                                   tional inputs.
           v~0 1 , ..., v~0 l = Embed(w10 , ..., wl0 )      (2)       The classifier output is defined as:

 Encode. We encode the embedded sequences
~v1 , ..., ~vn and v~0 1 , ..., v~0 l using one of the following       ~o = softmax(W · ReLU(Dropout(h(~x)))) (8)
 3 encode variants. As opposed to the pre-trained
 embeddings, the encoder parameters are updated                    where h is a 300-dimensional hidden layer, the
 during training to fit the specific task.                         dropout probability is 0.2, W ∈ Rk×300 , and k
                                                                   is the number of class labels for the specific task.
• biLM: Encoding the embedded sequence using
                                                                   Implementation Details. We implemented the
  a biLSTM with a hidden dimension d, where d
                                                                   models using the AllenNLP library (Gardner et al.,
  is the input embedding dimension:
                                                                   2018) which is based on the PyTorch framework
                                                                   (Paszke et al., 2017). We train them for up to
            ~u1 , ..., ~un = biLSTM(~v1 , ..., ~vn )        (3)
                                                                   500 epochs, stopping early if the validation per-
• Att: Encoding the embedded sequence using                        formance doesn’t improve in 20 epochs.
  self-attention. Each word is represented as the                     The phrase type model is a sequence tagging
  concatenation of its embedding and a weighted                    model that predicts a label for each embedded (po-
  average over other words in the sentence:                        tentially encoded) word wi . During decoding, we
                                                                   enforce a single constraint that requires that a B-X
                                  n
                                  X                                tag must precede I tag(s).
                   ~ui = [~vi ;         ai,j · ~vj ]        (4)
                                  j=1
                                                                   5     Baselines
   The weights ai are computed by applying dot-
   product between ~vi and every other word, and                   5.1    Human Performance
   normalizing the scores using softmax:                           The human performance on each task can be used
                                                                   as a performance upper bound which shows both
                   ~ai = softmax(~viT · ~v )                (5)    the inherent ambiguity in the task, as well as the
                                                                   limitations of the particular dataset. We estimated
• None: In which we don’t encode the embedded
                                                                   the human performance on each task by sampling
  text, but simply define:
                                                                   and re-annotating 100 examples from each test set.
                                                                      The annotation was carried out in Amazon Me-
                   ~u1 , ..., ~un = ~v1 , ..., ~vn          (6)
                                                                   chanical Turk. We asked 3 workers to annotate
                                                                   each example, taking the majority label as the fi-
For all encoder variants, ~ui stands for the vector
                                                                   nal prediction. To control the quality of the anno-
representing wi , which is used as input to the clas-
                                                                   tations, we required that workers must have an ac-
sifier.
                                                                   ceptance rate of at least 98% on at least 500 prior
 Predict. We represent a span by concatenating                     HITs, and had them pass a qualification test.
 its end-point vectors, e.g. ~ui...i+k = [~ui ; ~ui+k ]               In each annotation task, we showed the work-
 is the target span vector of wi , ..., wi+k . In                  ers the context sentence with the target span high-
 tasks which require a second span, we similarly                   lighted, and asked them questions regarding the
 compute u~0 1...l , representing the encoded span                 target span as exemplified in Table 3. In addition
 w10 , ..., wl0 (e.g. the paraphrase in NC relations).             to the possible answers given in the table, annota-
 The input to the classifier is the concatenation of               tors were always given the choice of “I can’t tell”
~ui...i+k , and, when applicable, the additional span              or “the sentence does not make sense”.
Task            Agreement    Example Question
                                           I feel there are others far more suited to take on the responsibility.
        VPC Classification     84.17%
                                           What is the verb in the highlighted span? (take/take on)
                                           Jamie made a decision to drop out of college.
                                           Mark all that apply to the highlighted span in the given context:
                                           1. It describes an action of “making something”, in the common meaning of “make”.
        LVC Classification     83.78%
                                           2. The essence of the action is described by “decision”.
                                           3. The span could be rephrased without “make” but with a verb like “decide”,
                                           without changing the meaning of the sentence.
                                           He is driving down memory lane and reminiscing about his first love.
          NC Literality        80.81%
                                           Is “lane” used literally or non-literally? (literal/non literal)
                                           Strawberry shortcakes were held as celebrations of the summer fruit harvest.
          NC Relations         86.21%
                                           Can “summer fruit” be described by “fruit that is ripe in the summer”? (yes/no)
                                           Send my warm regards to your parents.
          AN Attributes        86.42%
                                           Does “warm” refer to temperature? (yes/no)

      Table 3: The worker agreement (%) and the question(s) displayed to the workers in each annotation task.

 Model Family                  VPC                LVC                      NC                 NC                 AN          Phrase
                          Classification     Classification            Literality          Relations         Attributes       Type
                               Acc                    Acc                    Acc                Acc                 Acc        F1
 Majority Baselines            23.6                   43.7                   72.5               50.0                50.0      26.6
 Word Embeddings               60.5                   74.6                   80.4               51.2                53.8      44.0
 Contextualized                90.0                   82.5                   91.3               54.3                65.1      64.8
 Human                         93.8                   83.8                   91.0               77.8                86.4        -

Table 4: Summary of the best performance of each family of representations on the various tasks. The evaluation
metric is accuracy except for the phrase type task in which we report span-based F1 score, excluding O tags.

   Of all the annotation tasks, the LVC classifica-                     different between the train and test sets, result-
tion task was more challenging and required care-                       ing in accuracy < 50% even for binary classifi-
ful examination of the different criteria for LVCs.                     cation tasks.
In the example given in Table 3 with the candidate                    • Majority1 assigns for each test item the most
LVC “make a decision”, we considered a worker’s                         common label in the training set for items with
answer as positive (LVC) either if: 1) the worker                       the same first constituent. For example, in the
indicated that “make a decision” does not describe                      VPC classification task, it classifies get through
an action of “making something” AND that the                            as positive in all its contexts because the verb
essence of “make a decision” is in the word “de-                        get appears in more positive than negative ex-
cision”; or 2) if the worker indicated that “make                       amples.
a decision” can be replaced in the given sen-                         • Majority2 symmetrically assigns the label ac-
tence with “decide” without changing the mean-                          cording to the last (typically second) con-
ing of the sentence. The second criterion was                           stituent.
given in the original guidelines of Tu and Roth
(2011). The replacement verb “decide” was se-                         6    Results
lected as it is linked to “decision” in WordNet in
                                                                      Table 4 displays the best performance of each
the derivationally-related relation.
                                                                      model family on the various tasks.
   We didn’t compute the estimated human perfor-
mance on the phrase type task, which is more com-                     Representations. The general trend across tasks
plicated and requires expert annotation.                              is that the performance improves from the major-
                                                                      ity baselines through word embeddings and to the
5.2    Majority Baselines                                             contextualized word representations, with a large
We implemented three majority baselines:                              gap in some of the tasks. Among the contextual-
                                                                      ized word embeddings, BERT performed best on
• MajorityALL is computed by assigning the most                       4 out of 6 tasks, with no consistent preference be-
  common label in the training set to all the test                    tween ELMo and the OpenAI GPT. The best word
  items. Note that the label distribution may be                      embedding representations were GloVe (4/6) fol-
VPC                    LVC                  NC                   NC                   AN                Phrase
  Model
                Classification         Classification        Literality           Relations            Attributes             Type
                 Layer     Encoding    Layer     Encoding    Layer   Encoding     Layer   Encoding     Layer   Encoding     Layer   Encoding

  ELMo           All        Att        All       biLM        All     Att/None Top biLM                 All None             All biLM
  OpenAI GPT     All       None        Top       Att/None    Top      None    All biLM                 Top None             All biLM
  BERT           All        Att        All       biLM        All       Att    All None                 All None             All biLM

Table 5: The best setting (layer and encoding) for each contextualized word embedding model on the various tasks.
Bold entries are the best performers on each task.

                    VPC                    LVC                  NC                 NC                    AN                Phrase
    Model
               Classification         Classification        Literality          Relations            Attributes             Type
    word2vec             biLM                   Att          biLM/Att             None                    -               None/biLM
    GloVe                biLM                   Att             Att               biLM                    -                 biLM
    fastText             Att                   biLM            biLM               biLM                   Att                biLM

Table 6: The best encoding for each word embedding model on the various tasks. Bold entries are the best
performers on each task. Dash marks no preference.

lowed by fastText (2/6).                                           When it comes to word embedding models, Ta-
                                                                ble 6 shows that biLM was preferred more of-
Phenomena. The gap between the best model                       ten. This is not surprising. A contextualized word
performance (achieved by the contextualized rep-                embedding of a certain word is, by definition, al-
resentations) and the estimated human perfor-                   ready aware of the surrounding words, obviating
mance varies considerably across tasks. The best                the need for a second layer of order-aware encod-
performance in NC Literality is on par with hu-                 ing. A word embedding based model, on the other
man performance, and only a few points short of                 hand, must rely on a biLSTM to learn the same.
that in VPC Classification and LVC Classification.
This is an evidence for the utility of contextual-
ized word embeddings in detecting meaning shift,
                                                                   Finally, the best settings on the Phrase Type task
which has positive implications for the yet un-
                                                                use biLM across representations. It may suggest
solved problem of detecting MWEs.
                                                                that predicting a label for each word can benefit
   Conversely, the gap between the best model and
                                                                from a more structured modelling of word order.
the human performance is as high as 23.5 and
                                                                Looking into the errors made by the best model
21.3 points in NC Relations and AN Attributes, re-
                                                                (BERT+All+biLM) reveals that most of the errors
spectively, suggesting that tasks requiring reveal-
                                                                were predicting O, i.e. missing the occurrence of
ing implicit meaning are more challenging to the
                                                                a phrase. With respect to specific phrase types,
existing representations.
                                                                near perfect performance was achieved among the
Layer. Table 5 elaborates on the best setting for               more syntactic categories. Specifically, auxiliary
each representation on the various tasks. In most               (“Did they think we were [going to] feel lucky
cases, there was a preference to learning a scalar              to get any reservation at all?”), adverbs (“any
mix of the layers rather than using only the top                longer”), and determiners (“a bit”). In accor-
layer. We extracted the learned layer weights for               dance to the VPC Classification task, the VPC
each of the All models, and found that the model                label achieved 85% accuracy. 10% were missed
usually learned a balanced mix of the top and bot-              (classified as O) and 5% were confused with a
tom layers.                                                     “weak” MWE. Two of the more difficult types
                                                                were “weak” MWEs (which are judged as more
Encoding. We did not find one encoding setting                  compositional and less idiomatic) and idiomatic
that performed best across tasks and contextual-                verbs. The former achieved accuracy of 22% (68%
ized word embedding models. Instead, it seems                   were classified as O) and the latter only 8% (62%
that tasks related to meaning shift typically prefer            were classified as O). Overall it seems that the
Att or no encoding, while tasks related to implicit             model relied mostly on syntactic cues, failing to
meaning performed better with either biLM or no                 recognize semantic subtleties such as idiomatic
encoding.                                                       meaning and level of compositionality.
be decomposed to possibly rare senses of their
                                                         constituents, as in considering bee to have a sense
                                                         of “competition” in spelling bee and crocodile to
                                                         stand for “manipulative” in crocodile tears. Using
                                                         this definition, we could possibly use the NC lit-
                                                         erality data for word sense induction, in which re-
                                                         cent work has shown that contextualized word rep-
                                                         resentations are successful (Stanovsky and Hop-
                                                         kins, 2018; Amrami and Goldberg, 2018). We are
                                                         interested in testing not only whether the contextu-
                                                         alized models are capable of detecting rare senses
                                                         induced by non-literal usage, which we have con-
                                                         firmed in Section 6, but whether they can also
                                                         model these senses. To that end, we sample tar-
                                                         get words that appear in both literal and non-literal
                                                         examples, and use each contextualized word em-
                                                         bedding model as a language model to predict the
                                                         best substitutes of the target word in each context.
Figure 2: t-SNE projection of BERT representations       Table 7 exemplifies some of these predictions.
of verb-preposition candidates for VPC. Blue (dark)
points are positive examples and orange (light) points      Bold words are words judged reasonable in the
are negative.                                            given context, even if they don’t have the exact
                                                         same meaning as the target word. It is apparent
                                                         that there are more reasonable substitutes for the
7     Analysis                                           literal examples, across models (left part of the ta-
We focus on the contextualized word embeddings,          ble), but BERT performs better than the others.
and look into the representations they provide.          The OpenAI GPT shows a clear disadvantage of
                                                         being uni-directional, often choosing a substitute
7.1    Meaning Shift                                     that doesn’t go well with the next word (“a train
Does the representation capture VPCs? The                to from”).
best performer on the VPC Classification task was           The success is only partial among non-literal
the BERT+All+Att. To get a better idea of the            examples. While some correct substitutes are pre-
signal that BERT contains for VPCs, we chose             dicted for (guilt) trip, the predictions are much
several ambiguous verb-preposition pairs in the          worse for the other examples. The meaning of di-
dataset. We define a verb-preposition pair as am-        amond in diamond wedding is “60th”, and ELMo
biguous if it appeared in at least 8 examples as         makes the closest prediction, 10th (which would
a VPC and at least 8 examples as a non-VPC.              make it “tin wedding”). 400th is a borderline pre-
For a given pair we computed the BERT repre-             diction, because it is also an ordinal number, but
sentations of the sentences in which it appears in       an unreasonable one in the context of years of mar-
the dataset, and, similarly to the model, we rep-        riage.
resented the pair as the concatenation of its word          Finally, the last example snake oil is un-
vectors. In each vector we averaged the layers us-       surprisingly a difficult one, possibly “non-
ing the weights learned by the model. Finally, we        decomposable” (Nunberg et al., 1994), as both
projected the computed vectors into 2D space us-         constituents are non-literal. Some predicted sub-
ing t-SNE (Maaten and Hinton, 2008). Figure 2            stitutes, rogue and charmer, are valid replace-
demonstrates 4 example pairs. The other pairs            ments for the entire noun compound (e.g. “you
we plotted had similar t-SNE plots, confirming           are a rogue salesman”). Others go well with the
that the signal for separating different verb usages     literal meaning of snake creating phrases denot-
comes directly from BERT.                                ing concepts which can indeed be sold (snake egg,
Non-literality as a rare sense. Nunberg et al.           snake skin).
(1994) considered some non-literal compounds as            Overall, while contextualized representations
“idiosyncratically decomposable”, i.e. which can         excel at detecting shifts from the common mean-
ELMo               OpenAI GPT                  BERT                    ELMo              OpenAI GPT                    BERT
                                                                        Creating a guilt [trip]N in another person may be considered to
 The Queen and her husband were on a train [trip]L from
                                                                        be psychological manipulation in the form of punishment for a
 Sydney to Orange.
                                                                        perceived transgression.
 ride          1.24%    to             0.02% travelling       19.51%    tolerance     0.44%   that           0.03%   reaction        8.32%
 carriage      1.02%    headed         0.01% running          8.20%     fest          0.23%   so             0.02%   feeling         8.17%
 journey       0.73%    heading        0.01% journey          7.57%     avoidance     0.16%   trip           0.01%   attachment      8.12%
 heading       0.72%    that           0.009% going           6.56%     onus          0.15%   he             0.01%   sensation       4.73%
 carrying      0.39%    and            0.005% headed          5.75%     association   0.14%   she            0.01%   note            3.31%
 Richard Cromwell so impressed the king with his valour, that           She became the first British monarch to celebrate a [diamond]N
 he was given a [diamond]L ring from the king’s own finger.             wedding anniversary in November 2007.
 diamond       0.23%    and            0.01%   silver         15.99%    customary     0.20%   new            0.11%   royal           1.58%
 wedding       0.19%    of             0.01%   gold           14.93%    royal         0.17%   british        0.02%   1912            1.23%
 pearl         0.18%    to             0.01%   diamond        13.18%    sacrifice     0.15%   victory        0.01%   recent          1.10%
 knighthood    0.16%    ring           0.01%   golden         12.79%    400th         0.13%   french         0.01%   1937            1.08%
 hollow        0.15%    in             0.01%   new            4.61%     10th          0.13%   royal          0.01%   1902            1.08%
 China is attempting to secure its future [oil]L share and establish    Put bluntly, I believe you are a snake [oil]N salesman, a
 deals with other countries.                                            narcissist that would say anything to draw attention to himself.
 beyond        0.44%    in             0.01%   market         98.60%    auto          0.52%   in             0.05% oil               32.5%
 engagement    0.44%    and            0.01%   export         0.45%     egg           0.42%   and            0.01% pit               2.94%
 market        0.34%    for            0.01%   trade          0.14%     hunter        0.42%   that           0.01% bite              2.65%
 nuclear       0.33%    government     0.01%   trading        0.09%     consummate    0.39%   of             0.01% skin              2.36%
 majority      0.29%    supply         0.01%   production     0.04%     rogue         0.37%   charmer        0.008% jar              2.23%

Table 7: Top substitutes for a target word in literal (left) and non-literal (right) contexts, along with model scores.
Bold words are words judged reasonable (not necessarily meaning preserving) in the given context, and underlined
words are suitable substitutes for the entire noun compound, but not for a single constituent.

                               NC Relations       AN Attributes        it encoded in the phrase representation itself, or
  Majority                        50.0                50.0
  -Phrase                         50.0               55.66             does it appear explicitly in the context sentences?
  -Context                        45.06              63.21             Finally, could it be that the performance gap from
  -(Context+Phrase)               45.06              59.43
  Full Model                      54.3                65.1             the majority baseline is due to the models learning
                                                                       to recognize which paraphrases are more probable
Table 8: Accuracy scores of ablations of the phrase,                   than others, regardless of the phrase itself?
context sentence, and both features from the best                         To try answer this question, we performed abla-
models in the NC Relations and AN Attributes
                                                                       tion tests for each of the tasks, using the best per-
tasks (ELMo+Top+biLM and BERT+All+None, re-
spectively).                                                           forming setting for each (ELMo+Top+biLM for
                                                                       NC Relations and BERT+All+None for AN At-
                                                                       tributes). We trained the following models (-X sig-
ings of words, their ability to obtain meaningful                      nifies the ablation of the X feature):
representations for such rare senses is much more
limited.                                                                 1. -Phrase: where we mask the phrase in its
                                                                            context sentence, e.g. replacing “Today, the
7.2   Implicit Meaning                                                      house has become a wine bar or bistro called
The performance of the various models on the                                Barokk” with “Today, the house has become a
tasks that involve revealing implicit meaning are                           something or bistro called Barokk”. Success
substantially worse than on the other tasks. In NC                          in this setting may indicate that the implicit
Relations, ELMo performs best with the biLM-                                information is given explicitly in some of the
encoded model using only the top layer of the rep-                          context sentences.3
resentation, surpassing the majority baseline by
                                                                         2. -Context: the out-of-context version of the
only 4.3 points in accuracy. The best performer
                                                                            original task, in which we replace the context
in AN Attributes is BERT, with no encoding and
                                                                          3
using all the layers, achieving accuracy of 65.1%,                         A somewhat similar phenomenon was recently reported
                                                                       by Senaldi et al. (2019). Their model managed to distinguish
well above the majority baseline (50%).                                idioms from non-idioms, but their ablation study showed the
   We are interested in finding out where the                          model was in fact learning to distinguish between abstract
knowledge of the implicit meaning originates. Is                       contexts (in which idioms tend to appear) and concrete ones.
sentence by the phrase itself, as in setting it   duced strong representations for syntactic phe-
       to “wine bar”. Success in this setting may in-    nomena, but gained smaller performance improve-
       dicate that the phrase representation contains    ments upon the baselines in the more semantic
       this implicit information.                        tasks. Liu et al. (2019) found that some tasks
                                                         (e.g., identifying the tokens that comprise the con-
    3. -(Context+Phrase): in which we omit the           juncts in a coordination construction) required
       context sentence altogether, and provide only     fine-grained linguistic knowledge which was not
       the paraphrase, as in “bar where people buy       available in the representations unless they were
       and drink wine”. Success in this setting may      fine-tuned for the task. To the best of our knowl-
       indicate that negative sampled paraphrases        edge, we are the first to provide an evaluation suite
       form sentences which are less probable in En-     consisting of tasks related to lexical composition.
       glish.
                                                         Lexical Composition. There is a vast literature
   Table 8 shows the results of this experiment. A       on multi-word expressions in general (e.g. Sag
first observation is that the full model performs        et al., 2002; Vincze et al., 2011), and research
best on both tasks, suggesting that the model cap-       focusing on noun compounds (e.g. Nakov, 2013;
tures implicit meaning from various sources. In          Nastase et al., 2013), adjective-noun compositions
the NC Relations, all variants perform on par or         (e.g. Baroni and Zamparelli, 2010; Boleda et al.,
worse than the majority baseline, achieving a few        2013), verb-particle constructions (e.g. Baldwin,
points less than the full model. In the AN At-           2005; Pichotta and DeNero, 2013), and light verb
tributes task it is easier to see that the phrase (AN)   constructions (e.g. Tu and Roth, 2011; Chen et al.,
is important for the classification, while the con-      2015).
text is secondary.
                                                            In recent years, word embeddings have been
8     Related Work                                       used to predict the compositionality of phrases
                                                         (Salehi et al., 2015; Cordeiro et al., 2016), and
Probing Tasks. One way to test whether dense             to identify the implicit relation in adjective-noun
representations capture a certain linguistic prop-       compositions (Hartung et al., 2017) and in noun
erty is to design a probing task for this prop-          compounds (Surtani and Paul, 2015; Dima, 2016;
erty, and build a model that takes the tested rep-       Shwartz and Waterson, 2018; Shwartz and Dagan,
resentation as an input. This kind of “black box”        2018).
testing has become popular recently. Adi et al.             Pavlick and Callison-Burch (2016) created a
(2017) studied whether sentence embeddings cap-          simpler variant of the recognizing textual entail-
ture properties such as sentence length and word         ment task (RTE, Dagan et al., 2013) that tests
order. Conneau et al. (2018) extended their work         whether an adjective-noun composition entails the
with a large number of sentence embeddings, and          noun alone and vice versa in a given context. They
tested various properties at the surface, syntactic,     tested various standard models for RTE and found
and semantic levels. Others focused on interme-          that the models performed poorly with respect
diate representations in neural machine translation      to this phenomenon. To the best of our knowl-
systems (e.g Shi et al., 2016; Belinkov et al., 2017;    edge, contextualized word embeddings haven’t
Dalvi et al., 2017; Sennrich, 2017), or on spe-          been employed for tasks related to lexical compo-
cific linguistic properties such as agreement (Giu-      sition yet.
lianelli et al., 2018), and tense (Bacon and Regier,
2018).                                                   Phrase Representations. With respect to ob-
   More recently, Tenney et al. (2019) and Liu           taining meaningful phrase representations, there is
et al. (2019) each designed a suite of tasks to          a prominent line of work in learning a composi-
test contextualized word embeddings on a broad           tion function over pairs of words. Mitchell and
range of sub-sentence tasks, including part-of-          Lapata (2010) suggested simple composition via
speech tagging, syntactic constituent labeling, de-      vector arithmetics. Baroni and Zamparelli (2010)
pendency parsing, named entity recognition, se-          and later Maillard and Clark (2015) treated adjec-
mantic role labeling, coreference resolution, se-        tival modifiers as functions that operate on nouns
mantic proto-role, and relation classification. Ten-     and change their meanings, and represented them
ney et al. (2019) found that all the models pro-         as matrices. Zanzotto et al. (2010) and Dinu et al.
(2013) extended this approach and composed any           ers process idioms found that the most common
two words by multiplying each word vector by a           and successful strategies were inferring from the
composition matrix. These models start by com-           context (57% success) and relying on the literal
puting the phrases’ distributional representation        meanings of the constituent words (22% success)
(i.e. treating it as a single token) and then learning   (Cooper, 1999). As opposed to distributional mod-
a composition function that approximates it.             els that aim to learn from a large number of (possi-
   The main drawbacks of this approach are that          bly noisy and uninformative) contexts, the senten-
it assumes compositionality and that it operates         tial contexts in this experiment were manually se-
on phrases with a pre-defined number of words.           lected, and a follow up study found that extended
Moreover, we can expect the resulting composi-           contexts (stories) help the interpretation further
tional vectors to capture properties inherited from      (Asl, 2013). The participants didn’t simply rely on
the constituent words, but it is unclear whether         adjacent words or phrases, but also employed rea-
they also capture new properties introduced by           soning. For example, in the sentence “Robert knew
the phrase. For example, the compositional rep-          that he was robbing the cradle by dating a sixteen-
resentation of olive oil may capture properties like     year-old girl”, the participants inferred that 16 is
green (from olive) and fat (from oil), but would it      too young to date, combined it with the knowledge
also capture properties like expensive (a result of      that cradle is where a baby sleeps, and concluded
the extraction process)?                                 that rob the cradle means dating a very young per-
   Alternatively, other approaches were suggested        son. This level of context modeling seems to be
for learning general phrase embeddings, either us-       beyond the scope of current text representations.
ing direct supervision for paraphrase similarity            We expect that improving the ability of repre-
(Wieting et al., 2016), indirectly from an extrin-       sentations to reveal implicit meaning will require
sic task (Socher et al., 2012), or in an unsuper-        training them to handle this specific phenomenon.
vised manner by extending the word2vec objec-            Our evaluation suite, the data and code will be
tive (Poliak et al., 2017). While they don’t have        made available. It is easily extensible, and may be
constrains on the phrase length, these methods           used in the future to evaluate new representations
still suffer from the two other drawbacks above:         for their ability to address lexical composition.
they assume that the meaning of the phrase can
always be composed from its constituent mean-            Acknowledgments
ings, and it is unclear whether they can incorpo-
                                                         This work was supported in part by an Intel
rate implicit information and new properties of
                                                         ICRI-CI grant, the Israel Science Foundation
the phrase. We expected that contextualized word
                                                         grant 1951/17, the German Research Foundation
embeddings, which assign a different vector for a
                                                         through and the German-Israeli Project Coopera-
word in each given context, would address at least
                                                         tion (DIP, grant DA 1600/1-1). Vered is also sup-
the first issue by producing completely different
                                                         ported by the Clore Scholars Programme (2017).
vectors to literal vs. non-literal word occurrences.

9   Discussion and Conclusion                            References
We showed that contextualized word representa-           Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer
tions perform generally better than static word em-        Lavi, and Yoav Goldberg. 2017. Fine-grained
beddings on tasks related to lexical composition.          analysis of sentence embeddings using auxiliary
However, while they are on par with human per-             prediction tasks. In International Conference
formance in recognizing meaning shift, they are            on Learning Representations (ICLR).
still far from that in revealing implicit meaning.
This gap may suggest a limit on the information          Asaf Amrami and Yoav Goldberg. 2018. Word
that distributional models currently provide about         sense induction with neural biLM and symmet-
the meanings of phrases.                                   ric patterns. In Proceedings of the 2018 Con-
   Going beyond the distributional models, an ap-          ference on Empirical Methods in Natural Lan-
proach to build meaningful phrase representations          guage Processing, pages 4860–4867, Brussels,
can get some inspiration from the way that hu-             Belgium. Association for Computational Lin-
mans process phrases. A study on how L2 learn-             guistics.
Fatemeh Mohamadi Asl. 2013. The impact of           Alexis Conneau, Germán Kruszewski, Guillaume
  context on learning idioms in EFL classes.          Lample, Loïc Barrault, and Marco Baroni.
  TESOL Journal, 37(1):2.                             2018. What you can cram into a single $&!#*
                                                      vector: Probing sentence embeddings for lin-
Geoff Bacon and Terry Regier. 2018. Probing           guistic properties. In Proceedings of the 56th
  sentence embeddings for structure-dependent         Annual Meeting of the Association for Compu-
  tense. In Proceedings of the 2018 EMNLP             tational Linguistics (Volume 1: Long Papers),
  Workshop BlackboxNLP: Analyzing and Inter-          pages 2126–2136, Melbourne, Australia. Asso-
  preting Neural Networks for NLP, pages 334–         ciation for Computational Linguistics.
  336. Association for Computational Linguis-
  tics.                                             Thomas C Cooper. 1999. Processing of idioms
                                                      by L2 learners of English. TESOL quarterly,
Timothy Baldwin. 2005. Deep lexical acquisi-          33(2):233–262.
  tion of verb–particle constructions. Computer
  Speech & Language, 19(4):398–414.                 Silvio Cordeiro, Carlos Ramisch, Marco Idiart,
                                                       and Aline Villavicencio. 2016. Predicting the
Marco Baroni and Roberto Zamparelli. 2010.             compositionality of nominal compounds: Giv-
 Nouns are vectors, adjectives are matrices: Rep-      ing word embeddings a hard time. In Proceed-
 resenting Adjective-noun constructions in se-         ings of the 54th Annual Meeting of the Associ-
 mantic space. In Proceedings of the 2010              ation for Computational Linguistics (Volume 1:
 Conference on Empirical Methods in Natural            Long Papers), pages 1986–1997, Berlin, Ger-
 Language Processing, pages 1183–1193, Cam-            many. Association for Computational Linguis-
 bridge, MA. Association for Computational             tics.
 Linguistics.
                                                    Ido Dagan, Dan Roth, Mark Sammons, and Fabio
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,         Zanzotto. 2013. Recognizing Textual Entail-
  Hassan Sajjad, and James Glass. 2017. What do       ment. Morgan & Claypool Publishers.
  neural machine translation models learn about
                                                    Fahim Dalvi, Nadir Durrani, Hassan Sajjad,
  morphology? In Proceedings of the 55th An-
                                                      Yonatan Belinkov, and Stephan Vogel. 2017.
  nual Meeting of the Association for Compu-
                                                      Understanding and improving morphological
  tational Linguistics (Volume 1: Long Papers),
                                                      learning in the neural machine translation de-
  pages 861–872, Vancouver, Canada. Associa-
                                                      coder. In Proceedings of the Eighth Interna-
  tion for Computational Linguistics.
                                                      tional Joint Conference on Natural Language
Piotr Bojanowski, Edouard Grave, Armand Joulin,       Processing (Volume 1: Long Papers), pages
  and Tomas Mikolov. 2017. Enriching word vec-        142–151, Taipei, Taiwan. Asian Federation of
  tors with subword information. Transactions of      Natural Language Processing.
  the Association for Computational Linguistics,    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
  5:135–146.                                          Kristina Toutanova. 2019. Bert: Pre-training
                                                      of deep bidirectional transformers for language
Gemma Boleda, Marco Baroni, The Nghia Pham,
                                                      understanding. In Proceedings of the 2019
  and Louise McNally. 2013. Intensionality was
                                                      Conference of the North American Chapter of
  only alleged: On Adjective-noun composition
                                                      the Association for Computational Linguistics:
  in distributional semantics. In Proceedings of
                                                      Human Language Technologies, Minneapolis,
  the 10th International Conference on Computa-
                                                      Minnesota. Association for Computational Lin-
  tional Semantics (IWCS 2013) – Long Papers,
                                                      guistics.
  pages 35–46, Potsdam, Germany. Association
  for Computational Linguistics.                    Corina Dima. 2016. On the compositionality and
                                                      semantic interpretation of English noun com-
Wei-Te Chen, Claire Bonial, and Martha Palmer.
                                                      pounds. In Proceedings of the 1st Workshop on
 2015. English Light Verb Construction Identi-
                                                      Representation Learning for NLP, pages 27–39.
 fication using Lexical Knowledge. In Twenty-
 Ninth AAAI Conference on Artificial Intelli-       Georgiana Dinu, Nghia The Pham, and Marco Ba-
 gence.                                               roni. 2013. General estimation and evaluation
of compositional distributional semantic mod-        Semantics (*SEM), Volume 2: Proceedings of
  els. In Proceedings of the Workshop on Contin-       the Seventh International Workshop on Seman-
  uous Vector Space Models and their Composi-          tic Evaluation (SemEval 2013), pages 138–143.
  tionality, pages 50–58, Sofia, Bulgaria. Associ-     Association for Computational Linguistics.
  ation for Computational Linguistics.
                                                     Otto Jespersen. 1965. A Modern English Gram-
Matt Gardner, Joel Grus, Mark Neumann,                 mar: On Historical Principles. George Allen
 Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu,        & Unwin Limited.
 Matthew Peters, Michael Schmitz, and Luke
 Zettlemoyer. 2018. AllenNLP: A deep seman-          Nelson F Liu, Matt Gardner, Yonatan Belinkov,
 tic natural language processing platform. In          Matthew Peters, and Noah A Smith. 2019. Lin-
 Proceedings of Workshop for NLP Open Source           guistic knowledge and transferability of contex-
 Software (NLP-OSS), pages 1–6, Melbourne,             tual representations. In Proceedings of the 2019
 Australia. Association for Computational Lin-         Conference of the North American Chapter of
 guistics.                                             the Association for Computational Linguistics:
                                                       Human Language Technologies, Minneapolis,
Mario Giulianelli, Jack Harding, Florian Mohnert,      Minnesota. Association for Computational Lin-
 Dieuwke Hupkes, and Willem Zuidema. 2018.             guistics.
 Under the hood: Using diagnostic classifiers to
 investigate and improve how language models         Laurens van der Maaten and Geoffrey Hinton.
 track agreement information. In Proceedings           2008. Visualizing data using t-SNE. Journal of
 of the 2018 EMNLP Workshop BlackboxNLP:               machine learning research, 9(Nov):2579–2605.
 Analyzing and Interpreting Neural Networks for
 NLP, pages 240–248. Association for Computa-        Jean Maillard and Stephen Clark. 2015. Learning
 tional Linguistics.                                   adjective meanings with a tensor-based Skip-
                                                       Gram model. In Proceedings of the Nine-
Alex Graves and Jürgen Schmidhuber. 2005.              teenth Conference on Computational Natural
  Framewise phoneme classification with bidirec-       Language Learning, pages 327–331, Beijing,
  tional LSTM and other neural network architec-       China. Association for Computational Linguis-
  tures. Neural Networks, 18(5-6):602–610.             tics.

Z Harris. 1954. Distributional Hypothesis. Word,     Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
  10(23):146–162.                                      frey Dean. 2013. Efficient estimation of word
                                                       representations in vector space. In Interna-
Matthias Hartung. 2015. Distributional Semantic        tional Conference on Learning Representations
 Models of Attribute Meaning in Adjectives and         (ICLR).
 Nouns. Ph.D. thesis, Heidelberg University.
                                                     Jeff Mitchell and Mirella Lapata. 2010. Composi-
Matthias Hartung, Fabian Kaupmann, Soufian
                                                        tion in distributional models of semantics. Cog-
 Jebbara, and Philipp Cimiano. 2017. Learn-
                                                        nitive science, 34(8):1388–1429.
 ing compositionality functions on word em-
 beddings for modelling attribute meaning in         Preslav Nakov. 2013.   On the interpretation
 Adjective-noun phrases. In Proceedings of the         of noun compounds: Syntax, semantics, and
 15th Conference of the European Chapter of            entailment. Natural Language Engineering,
 the Association for Computational Linguistics:        19(3):291–330.
 Volume 1, Long Papers, pages 54–64, Valencia,
 Spain. Association for Computational Linguis-       Vivi Nastase, Preslav Nakov, Diarmuid O
 tics.                                                 Seaghdha, and Stan Szpakowicz. 2013. Seman-
                                                       tic Relations between Nominals. Synthesis lec-
Iris Hendrickx, Zornitsa Kozareva, Preslav Nakov,      tures on human language technologies, 6(1):1–
   Diarmuid Ó Séaghdha, Stan Szpakowicz, and           119.
   Tony Veale. 2013. Semeval-2013 task 4: Free
   paraphrases of noun compounds. In Second          Geoffrey Nunberg, Ivan A Sag, and Thomas Wa-
   Joint Conference on Lexical and Computational       sow. 1994. Idioms. Language, 70(3):491–538.
Diarmuid Ó Séaghdha and Ann Copestake. 2009.           Short Papers, pages 503–508, Valencia, Spain.
  Using lexical and relational similarity to clas-     Association for Computational Linguistics.
  sify semantic relations. In Proceedings of
  the 12th Conference of the European Chap-          Alec Radford, Karthik Narasimhan, Tim Sal-
  ter of the ACL (EACL 2009), pages 621–629,           imans, and Ilya Sutskever. 2018. Improv-
  Athens, Greece. Association for Computational        ing language understanding by generative
  Linguistics.                                         pre-training.     URL https://s3-us-west-2.
                                                       amazonaws.       com/openai-assets/research-
Adam Paszke, Sam Gross, Soumith Chintala, Gre-         covers/language-unsupervised/language_
  gory Chanan, Edward Yang, Zachary DeVito,            understanding_paper. pdf.
  Zeming Lin, Alban Desmaison, Luca Antiga,
  and Adam Lerer. 2017. Automatic differenti-        Siva Reddy, Diana McCarthy, and Suresh Man-
  ation in PyTorch. In Autodiff Workshop, NIPS         andhar. 2011. An empirical study on compo-
  2017.                                                sitionality in compound nouns. In Proceedings
                                                       of 5th International Joint Conference on Natu-
Ellie Pavlick and Chris Callison-Burch. 2016.          ral Language Processing, pages 210–218, Chi-
  Most "babies" are "little" and most "prob-           ang Mai, Thailand. Asian Federation of Natural
  lems" are "huge": Compositional Entailment in        Language Processing.
  Adjective-nouns. In Proceedings of the 54th
  Annual Meeting of the Association for Compu-       Ivan A Sag, Timothy Baldwin, Francis Bond, Ann
  tational Linguistics (Volume 1: Long Papers),        Copestake, and Dan Flickinger. 2002. Mul-
  pages 2164–2173, Berlin, Germany. Associa-           tiword expressions: A pain in the neck for
  tion for Computational Linguistics.                  NLP. In International Conference on Intelli-
                                                       gent Text Processing and Computational Lin-
Jeffrey Pennington, Richard Socher, and Christo-       guistics, pages 1–15. Springer.
   pher D. Manning. 2014. Glove: Global Vectors
   for word representation. In Empirical Meth-       Bahar Salehi, Paul Cook, and Timothy Baldwin.
   ods in Natural Language Processing (EMNLP),         2015. A word embedding approach to predict-
   pages 1532–1543.                                    ing the compositionality of multiword expres-
                                                       sions. In Proceedings of the 2015 Conference
Matthew Peters, Mark Neumann, Mohit Iyyer,             of the North American Chapter of the Associ-
 Matt Gardner, Christopher Clark, Kenton Lee,          ation for Computational Linguistics: Human
 and Luke Zettlemoyer. 2018. Deep contextu-            Language Technologies, pages 977–983, Den-
 alized word representations. In Proceedings           ver, Colorado. Association for Computational
 of the 2018 Conference of the North Ameri-            Linguistics.
 can Chapter of the Association for Computa-
 tional Linguistics: Human Language Technolo-        Nathan Schneider and Noah A. Smith. 2015. A
 gies, Volume 1 (Long Papers), pages 2227–             corpus and model integrating multiword expres-
 2237, New Orleans, Louisiana. Association for         sions and supersenses. In Proceedings of the
 Computational Linguistics.                            2015 Conference of the North American Chap-
                                                       ter of the Association for Computational Lin-
Karl Pichotta and John DeNero. 2013. Identifying       guistics: Human Language Technologies, pages
  phrasal verbs using many bilingual corpora. In       1537–1547, Denver, Colorado. Association for
  Proceedings of the 2013 Conference on Empir-         Computational Linguistics.
  ical Methods in Natural Language Processing,
  pages 636–646, Seattle, Washington, USA. As-       Marco Silvio Giuseppe Senaldi, Yuri Bizzoni, and
  sociation for Computational Linguistics.            Alessandro Lenci. 2019. What Do Neural Net-
                                                      works Actually Learn, When They Learn to
Adam Poliak, Pushpendre Rastogi, M. Patrick           Identify Idioms? Proceedings of the Society for
  Martin, and Benjamin Van Durme. 2017. Ef-           Computation in Linguistics, 2(1):310–313.
  ficient, compositional, order-sensitive N-gram
  embeddings. In Proceedings of the 15th Con-        Rico Sennrich. 2017.       How grammatical is
  ference of the European Chapter of the Associ-       character-level neural machine translation? as-
  ation for Computational Linguistics: Volume 2,       sessing MT quality with contrastive translation
You can also read