What do you mean, BERT? Assessing BERT as a Distributional Semantics Model

Page created by Rita Delgado
 
CONTINUE READING
What do you mean, BERT?
                                                          Assessing BERT as a Distributional Semantics Model

                                          Timothee Mickus         Denis Paperno        Mathieu Constant      Kees van Deemter
                                        Université de Lorraine  Utrecht University  Université de Lorraine Utrecht University
                                            CNRS, ATILF         d.paperno@uu.nl         CNRS, ATILF c.j.vandeemter@uu.nl
                                        tmickus@atilf.fr                            mconstant@atilf.fr

                                                               Abstract                          based on BERT. Furthermore, BERT serves both
                                                                                                 as a strong baseline and as a basis for a fine-
                                             Contextualized word embeddings, i.e. vector
                                                                                                 tuned state-of-the-art word sense disambiguation
                                             representations for words in context, are nat-
arXiv:1911.05758v2 [cs.CL] 8 May 2020

                                             urally seen as an extension of previous non-        pipeline (Wang et al., 2019a).
                                             contextual distributional semantic models. In          Analyses aiming to understand the mechanical
                                             this work, we focus on BERT, a deep neural          behavior of Transformers in general, and BERT in
                                             network that produces contextualized embed-         particular, have suggested that they compute word
                                             dings and has set the state-of-the-art in several   representations through implicitly learned syntac-
                                             semantic tasks, and study the semantic coher-       tic operations (Raganato and Tiedemann, 2018;
                                             ence of its embedding space. While showing
                                                                                                 Clark et al., 2019; Coenen et al., 2019; Jawa-
                                             a tendency towards coherence, BERT does not
                                             fully live up to the natural expectations for a
                                                                                                 har et al., 2019, a.o.): representations computed
                                             semantic vector space. In particular, we find       through the ‘attention’ mechanisms of Transform-
                                             that the position of the sentence in which a        ers can arguably be seen as weighted sums of
                                             word occurs, while having no meaning corre-         intermediary representations from the previous
                                             lates, leaves a noticeable trace on the word em-    layer, with many attention heads assigning higher
                                             beddings and disturbs similarity relationships.     weights to syntactically related tokens (however,
                                                                                                 contrast with Brunner et al., 2019; Serrano and
                                         1   Introduction
                                                                                                 Smith, 2019).
                                         A recent success story of NLP, BERT (Devlin et al.,        Complementing these previous studies, in this
                                         2018) stands at the crossroad of two key innova-        paper we adopt a more theory-driven lexical se-
                                         tions that have brought about significant improve-      mantic perspective. While a clear parallel was es-
                                         ments over previous state-of-the-art results. On        tablished between ‘traditional’ noncontextual em-
                                         the one hand, BERT models are an instance of con-       beddings and the theory of distributional seman-
                                         textual embeddings (McCann et al., 2017; Peters         tics (a.o. Lenci, 2018; Boleda, 2019), this link is
                                         et al., 2018), which have been shown to be subtle       not automatically extended to contextual embed-
                                         and accurate representations of words within sen-       dings: some authors (Westera and Boleda, 2019)
                                         tences. On the other hand, BERT is a variant of the     even explicitly consider only “context-invariant”
                                         Transformer architecture (Vaswani et al., 2017)         representations as distributional semantics. Hence
                                         which has set a new state-of-the-art on a wide          we study to what extent BERT, as a contextual em-
                                         variety of tasks ranging from machine translation       bedding architecture, satisfies the properties ex-
                                         (Ott et al., 2018) to language modeling (Dai et al.,    pected from a natural contextualized extension of
                                         2019). BERT-based models have significantly in-         distributional semantics models (DSMs).
                                         creased state-of-the-art over the GLUE benchmark           DSM s assume that meaning is derived from use
                                         for natural language understanding (Wang et al.,        in context. DSMs are nowadays systematically
                                         2019b) and most of the best scoring models for          represented using vector spaces (Lenci, 2018).
                                         this benchmark include or elaborate on BERT. Us-        They generally map each word in the domain of
                                         ing BERT representations has become in many             the model to a numeric vector on the basis of
                                         cases a new standard approach: for instance, all        distributional criteria; vector components are in-
                                         submissions at the recent shared task on gendered       ferred from text data. DSMs have also been com-
                                         pronoun resolution (Webster et al., 2019) were          puted for linguistic items other than words, e.g.,
word senses—based both on meaning inventories          sentence coherence. In section 4, we test whether
(Rothe and Schütze, 2015) and word sense induc-       we can observe cross-sentence coherence for sin-
tion techniques (Bartunov et al., 2015)—or mean-       gle tokens, whereas in section 5 we study to what
ing exemplars (Reisinger and Mooney, 2010; Erk         extent incoherence of representations across sen-
and Padó, 2010; Reddy et al., 2011). The default      tences affects the similarity structure of the seman-
approach has however been to produce represen-         tic space. We summarize our findings in section 6.
tations for word types. Word properties encoded
by DSMs vary from morphological information            2   Theoretical background & connections
(Marelli and Baroni, 2015; Bonami and Paperno,
2018) to geographic information (Louwerse and          Word embeddings have been said to be ‘all-
Zwaan, 2009), to social stereotypes (Bolukbasi         purpose’ representations, capable of unifying the
et al., 2016) and to referential properties (Herbe-    otherwise heterogeneous domain that is NLP (Tur-
lot and Vecchi, 2015).                                 ney and Pantel, 2010). To some extent this claim
                                                       spearheaded the evolution of NLP: focus recently
   A reason why contextualized embeddings have         shifted from task-specific architectures with lim-
not been equated to distributional semantics may       ited applicability to universal architectures requir-
lie in that they are “functions of the entire input    ing little to no adaptation (Radford, 2018; Devlin
sentence” (Peters et al., 2018). Whereas tradi-        et al., 2018; Radford et al., 2019; Yang et al., 2019;
tional DSMs match word types with numeric vec-         Liu et al., 2019, a.o.).
tors, contextualized embeddings produce distinct           Word embeddings are linked to the distribu-
vectors per token. Ideally, the contextualized na-     tional hypothesis, according to which “you shall
ture of these embeddings should reflect the seman-     know a word from the company it keeps” (Firth,
tic nuances that context induces in the meaning of     1957). Accordingly, the meaning of a word can be
a word—with varying degrees of subtlety, rang-         inferred from the effects it has on its context (Har-
ing from broad word-sense disambiguation (e.g.         ris, 1954); as this framework equates the meaning
‘bank’ as a river embankment or as a financial         of a word to the set of its possible usage contexts,
institution) to narrower subtypes of word usage        it corresponds more to holistic theories of meaning
(‘bank’ as a corporation or as a physical building)    (Quine, 1960, a.o.) than to truth-value accounts
and to more context-specific nuances.                  (Frege, 1892, a.o.). In early works on word em-
   Regardless of how apt contextual embeddings         beddings (Bengio et al., 2003, e.g.), a straightfor-
such as BERT are at capturing increasingly finer       ward parallel between word embeddings and dis-
semantic distinctions, we expect the contextual        tributional semantics could be made: the former
variation to preserve the basic DSM properties.        are distributed representations of word meaning,
Namely, we expect that the space structure en-         the latter a theory stating that a word’s meaning is
codes meaning similarity and that variation within     drawn from its distribution. In short, word embed-
the embedding space is semantic in nature. Simi-       dings could be understood as a vector-based im-
lar words should be represented with similar vec-      plementation of the distributional hypothesis. This
tors, and only semantically pertinent distinctions     parallel is much less obvious for contextual em-
should affect these representations. We connect        beddings: are constantly changing representations
our study with previous work in section 2 be-          truly an apt description of the meaning of a word?
fore detailing the two approaches we followed.             More precisely, the literature on distributional
First, we verify in section 3 that BERT embeddings     semantics has put forth and discussed many math-
form natural clusters when grouped by word types,      ematical properties of embeddings: embeddings
which on any account should be groups of similar       are equivalent to count-based matrices (Levy and
words and thus be assigned similar vectors. Sec-       Goldberg, 2014b), expected to be linearly depen-
ond, we test in sections 4 and 5 that contextualized   dant (Arora et al., 2016), expressible as a unitary
word vectors do not encode semantically irrelevant     matrix (Smith et al., 2017) or as a perturbation
features: in particular, leveraging some knowledge     of an identity matrix (Yin and Shen, 2018). All
from the architectural design of BERT, we address      these properties have however been formalized for
whether there is no systematic difference between      non-contextual embeddings: they were formulated
BERT representations in odd and even sentences         using the tools of matrix algebra, under the as-
of running text—a property we refer to as cross-       sumption that embedding matrix rows correspond
to word types. Hence they cannot be applied as                 Liang (2019) for a discussion. We refrain from
such to contextual embeddings. This disconnect                 using learned probes in favor of a more direct as-
in the literature leaves unanswered the question of            sessment of the coherence of the semantic space.
what consequences there are to framing contextu-
alized embeddings as DSMs.                                     3     Experiment 1: Word Type Cohesion
   The analyses that contextual embeddings have                The trait of distributional spaces that we focus
been subjected to differ from most analyses of dis-            on in this study is that similar words should lie
tributional semantics models. Peters et al. (2018)             in similar regions of the semantic space. This
analyzed through an extensive ablation study of                should hold all the more so for identical words,
ELM o what information is captured by each layer               which ought to be be maximally similar. By de-
of their architecture. Devlin et al. (2018) dis-               sign, contextualized embeddings like BERT exhibit
cussed what part of their architecture is criti-               variation within vectors corresponding to identical
cal to the performances of BERT, comparing pre-                word types. Thus, if BERT is a DSM, we expect that
training objectives, number of layers and train-               word types form natural, distinctive clusters in the
ing duration. Other works (Raganato and Tiede-                 embedding space. Here, we assess the coherence
mann, 2018; Hewitt and Manning, 2019; Clark                    of word type clusters by means of their silhouette
et al., 2019; Voita et al., 2019; Michel et al., 2019)         scores (Rousseeuw, 1987).
have introduced specific procedures to understand
how attention-based architectures function on a                3.1    Data & Experimental setup
mechanical level. Recent research has however                  Throughout our experiments, we used the Guten-
questioned the pertinence of these attention-based             berg corpus as provided by the NLTK platform,
analyses (Serrano and Smith, 2019; Brunner et al.,             out of which we removed older texts (King John’s
2019); moreover these works have focused more                  Bible and Shakespeare). Sentences are enumer-
on the inner workings of the networks than on their            ated two by two; each pair of sentences is then
adequacy with theories of meaning.                             used as a distinct input source for BERT. As we
   One trait of DSMs that is very often encoun-                treat the BERT algorithm as a black box, we re-
tered, discussed and exploited in the literature is            trieve only the embeddings from the last layer,
the fact that the relative positions of embeddings             discarding all intermediary representations and at-
are not random. Early vector space models, by de-              tention weights. We used the bert-large-
sign, required that word with similar meanings lie             uncased model in all experiments2 ; therefore
near one another (Salton et al., 1975); as a conse-            most of our experiments are done on word-pieces.
quence, regions of the vectors space describe co-                 To study the basic coherence of BERT’s seman-
herent semantic fields.1 Despite the importance                tic space, we can consider types as clusters of
of this characteristic, the question whether BERT              tokens—i.e. specific instances of contextualized
contextual embeddings depict a coherent semantic               embeddings—and thus leverage the tools of clus-
space on their own has been left mostly untouched              ter analysis. In particular, silhouette score is gen-
by papers focusing on analyzing BERT or Trans-                 erally used to assess whether a specific observa-
formers (with some exceptions, e.g. Coenen et al.,             tion ~v is well assigned to a given cluster Ci drawn
2019). Moreover, many analyses of how mean-                    from a set of possible clusters C. The silhouette
ing is represented in attention-based networks or              score is defined in eq. 1:
contextual embeddings include “probes” (learned
                                                               sep(~v , Ci ) = min{mean d(~v , v~0 )∀ Cj ∈ C − {Ci }}
models such as classifiers) as part of their evalu-                                     v~0 ∈Cj
ation setup to ‘extract’ information from the em-
                                                               coh(~v , Ci ) = mean            d(~v , v~0 )
beddings (Peters et al., 2018; Tang et al., 2018;                             v~0 ∈Ci −{~v }
Coenen et al., 2019; Chang and Chen, 2019, e.g.).
                                                                                    sep(~v , Ci ) − coh(~v , Ci )
Yet this methodology has been criticized as po-                silh(~v , Ci ) =                                       (1)
                                                                                  max{sep(~v , Ci ), coh(~v , Ci )}
tentially conflicting with the intended purpose of
studying the representations themselves (Wieting               We used Euclidean distance for d. In our case,
and Kiela, 2019; Cover, 1965); cf. also Hewitt and             observations ~v therefore correspond to tokens (that
   1
                                                               is, word-piece tokens), and clusters Ci to types.
    Vectors encoding contrasts between words are also ex-
                                                                 2
pected to be coherent (Mikolov et al., 2013b), although this       Measurements were conducted before the release of the
assumption has been subjected to criticism (Linzen, 2016).     bert-large-uncased-whole-words model.
Silhouette scores consist in computing for each
vector observation ~v a cohesion score (viz. the av-
erage distance to other observations in the cluster
Ci ) and a separation score (viz. the minimal av-
erage distance to other observations, i.e. the min-
imal ‘cost’ of assigning ~v to any other cluster
than Ci ). Optimally, cohesion is to be minimized
and separation is to be maximized, and this is re-
flected in the silhouette score itself: scores are de-
fined between -1 and 1; -1 denotes that the ob-
servation ~v should be assigned to another cluster
than Ci , whereas 1 denotes that the observation ~v
is entirely consistent with the cluster Ci . Keep-        Figure 1: Distribution of token silhouette scores
ing track of silhouette scores for a large number
of vectors quickly becomes intractable, hence we
use a slightly modified version of the above def-        we found that 10% of types contained only tokens
inition, and compute separation and cohesion us-         with negative silhouette score.
ing the distance to the average vector for a clus-          The standards we expect of DSMs are not al-
ter rather than the average distance to other vec-       ways upheld strictly; the median and mean score
tors in a cluster, as suggested by Vendramin et al.      are respectively at 0.08 and 0.06, indicating a gen-
(2013). Though results are not entirely equivalent       eral trend of low scores, even when they are posi-
as they ignore the inner structure of clusters, they     tive. We previously noted that both the use of sub-
still present a gross view of the consistency of the     word representations in BERT as well as polysemy
vector space under study.                                and homonymy might impact these results. The
   We do note two caveats with our proposed              amount of meaning variation induced by polysemy
methodology. Firstly, BERT uses subword repre-           and homonymy can be estimated by using a dictio-
sentations, and thus BERT tokens do not necessar-        nary as a sense inventory. The number of distinct
ily correspond to words. However we may conjec-          entries for a type serves as a proxy measure of how
ture that some subwords exhibit coherent mean-           much its meaning varies in use. We thus used a
ings, based on whether they tightly correspond           linear model to predict silhouette scores with log-
to morphemes—e.g. ‘##s’, ‘##ing’ or ‘##ness’.            scaled frequency and log-scaled definition counts,
Secondly, we group word types based on char-             as listed in the Wiktionary, as predictors. We se-
acter strings; yet only monosemous words should          lected tokens for which we found at least one en-
describe perfectly coherent clusters—whereas we          try in the Wiktionary, out of which we then ran-
expect some degree of variation for polysemous           domly sampled 10000 observations. Both defini-
words and homonyms according to how widely               tion counts and frequency were found to be signif-
their meanings may vary.                                 icant predictors, leading the silhouette score to de-
                                                         crease. This suggests that polysemy degrades the
                                                         cohesion score of the type cluster, which is com-
3.2   Results & Discussion
                                                         patible with what one would expect from a DSM.
We compared cohesion to separation scores using          We moreover observed that monosemous words
a paired Student’s t-test, and found a significant       yielded higher silhouette scores than polysemous
effect (p-value < 2 · 2−16 ). This highlights that       words (p < 2 · 2−16 , Cohen’s d = 0.236), though
cohesion scores are lower than separation scores.        they still include a substantial number of tokens
The effect size as measured by Cohen’s d (Cohen’s        with negative silhouette scores.
d = −0.121) is however rather small, suggesting             Similarity also includes related words, and not
that cohesion scores are only 12% lower than sep-        only tokens of the same type. Other studies (Vial
aration scores. More problematically, we can see         et al., 2019; Coenen et al., 2019, e.g.) already
in figure 1 that 25.9% of the tokens have a nega-        stressed that BERT embeddings perform well on
tive silhouette score: one out of four tokens would      word-level semantic tasks. To directly assess
be better assigned to some other type than the one       whether BERT captures this broader notion of sim-
they belong to. When aggregating scores by types,        ilarity, we used the MEN word similarity dataset
(Bruni et al., 2014), which lists pairs of English    ment, or randomly selected sentences. A special
words with human annotated similarity ratings.        token [SEP] is used to indicate sentence bound-
We removed pairs containing words for which we        aries, and the full sentence is prepended with a
had no representation, leaving us with 2290 pairs.    second special token [CLS] used to perform the
We then computed the Spearman correlation be-         actual prediction for NSP. Each token is trans-
tween similarity ratings and the cosine of the av-    formed into an input vector using an input em-
erage BERT embeddings of the two paired word          bedding matrix. To distinguish between tokens
types, and found a correlation of 0.705, showing      from the first and the second sentence, the model
that cosine similarity of average BERT embeddings     adds a learned feature vector seg   ~ A to all tokens
encodes semantic similarity. For comparison, a        from first sentences, and a distinct learned feature
word2vec DSM (Mikolov et al., 2013a, henceforth       vector seg~ B to all tokens from second sentences;
W 2 V ) trained on BooksCorpus (Zhu et al., 2015)     these feature vectors are called ‘segment encod-
using the same tokenization as BERT achieved a        ings’. Lastly, as Transformer models do not have
correlation of 0.669.                                 an implicit representation of word-order, informa-
                                                      tion regarding the index i of the token in the sen-
4     Experiment 2: Cross-Sentence                    tence is added using a positional encoding p(i).
      Coherence                                       Therefore, if the initial training example was “My
                                                      dog barks. It is a pooch.”, the actual input would
As observed in the previous section, overall the      correspond to the following sequence of vectors:
word type coherence in BERT tends to match our
basic expectations. In this section, we do further      ~
                                                      [CLS]      ~ + seg
                                                              + p(0)                ~ + seg
                                                                      ~ A , M~ y + p(1)    ~ A,
tests, leveraging our knowledge of the design of
                                                         dog     ~ + seg
                                                          ~ + p(2)            ~ + p(3)
                                                                      ~ A , barks     ~ + seg~ A,
BERT . We detail the effects of jointly using seg-
ment encodings to distinguish between paired in-                 ~ + seg
                                                           ~. + p(4)  ~ , [SEP]~        ~ + seg
                                                                                    + p(5)    ~ ,
                                                                                  A                               A
put sentences and residual connections.                             ~ + seg           ~ + seg
                                                             ~ + p(6)
                                                            It                  ~ + p(7)
                                                                          ~ B , is           ~ B,
4.1    Formal approach                                              ~ + seg
                                                              ~a + p(8)                  ~ + seg
                                                                                   ~ + p(9)
                                                                          ~ B , pooch           ~ B,
We begin by examining the architectural design                      ~ + seg
                                                            ~. + p(10)    ~ , [SEP] ~      ~ + seg
                                                                                       + p(11)    ~
                                                                                  B                                 B
of BERT. We give some elements relevant to
our study here and refer the reader to the orig-         Due to the general use of residual connections,
inal papers by Vaswani et al. (2017) and Devlin       marking the sentences using the segment encod-
et al. (2018), introducing Transformers and BERT,     ings seg ~ A and seg~ B can introduce a systematic
for a more complete description. On a formal          offset within sentences. Consider that the first
level, BERT is a deep neural network composed         layer uses as input vectors corresponding to word,
of superposed layers of computations. Each layer      position, and sentence information: w         ~ +
                                                                                              ~i + p(i)
is composed of two “sub-layers”: the first per-         ~ i ; for simplicity, let i~i = w
                                                      seg                                     ~
                                                                                        ~i + p(i); we also
forming “multi-head attention”, the second be-        ignore the rest of the input as it does not impact
ing a simple feed-forward network. Throughout         this reformulation. The output from the first sub-
all layers, after each sub-layer, residual connec-
                                                      layer o~1i can be written:
tions and layer normalization are applied, thus the
intermediary output o~L after sub-layer L can be
                                                      o~1i = LayerNorm(Sub1 (i~i + seg         ~ i ) + i~i + seg
                                                                                                               ~ i)
written as a function of the input x~L , as o~L =
LayerNorm(SubL (x~L ) + x~L ).                                                1                                 1~
                                                           = ~bl + ~g 1           Sub1 (i~i + seg
                                                                                                ~ i ) + ~g 1       ii
   BERT is optimized on two training objectives.                              σi1                              σi1
The first, called masked language model, is a vari-                       1
                                                             − ~g 1           µ(Sub1 (i~i + seg ~ i ) + i~i + seg
                                                                                                                ~ i)
ation on the Cloze test for reading proficiency                           σi1
(Taylor, 1953). The second, called next sen-                              1
                                                             + ~g 1           seg~ i
tence prediction (NSP), corresponds to predicting                         σi1
whether two sentences are found one next to the                                1
other in the original corpus or not. Each example          = ~õ1i + ~g 1           ~ i
                                                                                  seg                             (2)
                                                                              σi1
passed as input to BERT is comprised of two sen-
tences, either contiguous sentences from a docu-      This equation is obtain by simply injecting the
~
              barks              ~
                                It’s                                   4.2   Data & Experimental setup
                                                 ~                     If BERT properly describes a semantic vector
                                               pooch
                                                        ~
                                                       dog             space, we should, on average, observe no signifi-
                                                   ~
                                                   My                  cant difference in token encoding imputable to the
                                                                       segment the token belongs to. For a given word
                                                                       type w, we may constitute two groups: wsegA , the
                                          ~a                           set of tokens for this type w belonging to first sen-
                                                                       tences in the inputs, and wsegB , the set of tokens
               (a) without segment encoding bias                       of w belonging to second sentences. If BERT coun-
                                                                       terbalances the segment encodings, random differ-
                  ~
                 barks                                                 ences should cancel out, and therefore the mean of
                                                        ~
                                                       dog             all tokens wsegA should be equivalent to the mean
                                                                       of all tokens wsegB .
                                                 ~
                                               pooch     ~
                                                         My
                                                                          We used the same dataset as in section 3. This
                                                                       setting (where all paired input sentences are drawn
                                        ~
                                                                       from running text) allows us to focus on the effects
                                       It’s
                                                                       of the segment encodings. We retrieved the output
                                                                       embeddings of the last BERT layer and grouped
                                        ~a                             them per word type. To assess the consistency of
                 (b) with segment encoding bias
                                                                       a group of embeddings with respect to a purported
                                                                       reference, we used a mean of squared error (MSE):
             Figure 2: Segment encoding bias                           given a group of embeddings E and a reference
                                                                       vector ~r, we computed how much each vector in
                                                                       E strayed from the reference ~r. It is formally de-
definition for layer-normalization.3 Therefore, by
                                                                       fined as:
recurrence, the final output o~L
                               i for a given token                                              1 XX
w       ~ + seg
 ~i + p(i)      ~ i can be written as:                                        MSE(E, ~r) =                (~vd − ~rd )2 (4)
                                                                                              #E
                                                                                                  ~v ∈E d
                  L            L
                          !          !
                                  1                                    This MSE can also be understood as the average
  o~L
                 K           Y
         ~L          ~g l              × seg
                                           ~ i (3)
    i = õi +                                                          squared distance to the reference ~r. When ~r = E,
                 l=1         l=1 i
                                  σl
                                                                       i.e. ~r is set to be the average vector in E, the
   This rewriting trick shows that segment encod-                      MSE measures variance of E via Euclidean dis-
ings are partially preserved in the output. All em-                    tance. We then used the MSE function to construct
beddings within a sentence contain a shift in a                        pairs of observations: for each word type w, and
specific direction, determined only by the initial                     for each segment encoding segi , we computed
segment encoding and the learned gain parameters                       two scores: MSE(wsegi , wsegi )—which gives us
for layer normalization. In figure 2, we illustrate                    an assessment of how coherent the set of embed-
what this systematic shift might entail. Prior to the                  dings wsegi is with respect to the mean vector
application of the segment encoding bias, the se-                      in that set—and MSE(wsegi , wsegj )—which as-
mantic space is structured by similarity (‘pooch’ is                   sesses how coherent the same group of embed-
near ‘dog’); with the bias, we find a different set                    dings is with respect to the mean vector for the
of characteristics: in our toy example, tokens are                     embeddings of the same type, but from the other
linearly separable by sentences.                                       segment segj . If no significant contrast between
   3
       Layer normalization after sub-layer l is defined as:
                                                                       these two scores can be observed, then BERT coun-
                                                                       terbalances the segment encodings and is coherent
                               ~gl         x − µ(~
                                          (~       x))                 across sentences.
                    x) = ~bl +
        LayerNorml (~
                                             σ
                                          1            1               4.3   Results & Discussion
                         = ~bl + ~gl        x − ~gl
                                            ~            µ(~
                                                           x)
                                          σ            σ
                                                                       We compared results using a paired Student’s t-
where b~l is a bias, denotes element-wise multiplication, g~l          test, which highlighted a significant difference
is a “gain” parameter, σ is the standard deviation of compo-
nents of ~x and µ(~x) = hx̄, . . . , x̄i is a vector with all compo-   based on which segment types belonged to (p-
nents defined as the mean of components of ~        x.                 value < 2 · 2−16 ); the effect size (Cohen’s d =
occurrences in even and odd sentences. However,
                                                       comparing tokens of the same type in consecutive
                                                       sentences is not necessarily the main application
                                                       of BERT and related models. Does the segment-
                                                       based representational variance affect the structure
                                                       of the semantic space, instantiated in similarities
                                                       between tokens of different types? Here we inves-
                                                       tigate how segment encodings impact the relation
                                                       between any two tokens in a given sentence.
                                                       5.1     Data & Experimental setup
                                                       Consistent with previous experiments, we used
                                                       the same dataset (cf. section 3); in this experi-
                                                       ment also mitigating the impact of the NSP objec-
     Figure 3: Log-scaled MSE per reference            tive was crucial. Sentences were thus passed two
                                                       by two as input to the BERT model. As cosine
                                                       has been traditionally used to quantify semantic
−0.527) was found to be stronger than what we          similarity between words (Mikolov et al., 2013b;
computed when assessing whether tokens cluster         Levy and Goldberg, 2014a, e.g.), we then com-
according to their types (cf. section 3). A vi-        puted pairwise cosine of the tokens in each sen-
sual representation of these results, log-scaled, is   tence. This allows us to reframe our assessment of
shown in figure 3. For all sets wsegi , the average    whether lexical contrasts are coherent across sen-
embedding from the set itself was systematically       tences as a comparison of semantic dissimilarity
a better fit than the average embedding from the       across sentences. More formally, we compute the
paired set wsegj . We also noted that a small num-     following set of cosine scores CS for each sen-
ber of items yielded a disproportionate difference     tence S:
in MSE scores and that frequent word types had
                                                             CS = {cos(~v , ~u) | ~v 6= ~u ∧ ~v , ~u ∈ ES }   (5)
smaller differences in MSE scores: roughly speak-
ing, very frequent items—punctuation signs, stop-      with ES the set of embeddings for the sentence S.
words, frequent word suffixes—received embed-          In this analysis, we compare the union of all sets of
dings that are almost coherent across sentences.       cosine scores for first sentences against the union
   Although the observed positional effect of em-      of all sets of cosine scores for second sentences.
beddings’ inconsistency might be entirely due to       To avoid asymmetry, we remove the [CLS] token
segment encodings, additional factors might be         (only present in first sentences), and as with pre-
at play. In particular, BERT uses absolute posi-       vious experiments we neutralize the effects of the
tional encoding vectors to order words within a        NSP objective by using only consecutive sentences
sequence: the first word w1 is marked with the po-     as input.
sitional encoding p(1), the second word w2 with
                                                       5.2     Results & Discussion
p(1), and so on until the last word, wn , marked
with p(n). As these positional encodings are           We compared cosine scores for first and second
added to the word embeddings, the same remark          sentences using a Wilcoxon rank sum test. We
made earlier on the impact of residual connections     observed a significant effect, however small (Co-
may apply to these positional encodings as well.       hen’s d = 0.011). This may perhaps be due to data
Lastly, we also note that many downstream appli-       idiosyncrasies, and indeed when comparing with a
cations use a single segment encoding per input,       W 2 V (Mikolov et al., 2013a) trained on BooksCor-
and thus sidestep the caveat stressed here.            pus (Zhu et al., 2015) using the same tokeniza-
                                                       tion as BERT, we do observe a significant effect
5   Experiment 3: Sentence-level structure             (p < 0.05). However the effect size is six times
                                                       smaller (d = 0.002) than what we found for BERT
We have seen that BERT embeddings do not fully         representations; moreover, when varying the sam-
respect cross-sentence coherence; the same type        ple size (cf. figure 4), p-values for BERT represen-
receives somewhat different representations for        tations drop much faster to statistical significance.
Model                 STScor.     SICK - Rcor.
           0.9                                             Skip-Thought          0.255 60     0.487 62
           0.8                                             USE                   0.666 86     0.689 97
           0.7                                             InferSent             0.676 46     0.709 03
           0.6
 p-value

           0.5                                             BERT , 2 sent. ipt.   0.359 13     0.369 92
           0.4                                             BERT , 1 sent. ipt.   0.482 41     0.586 95
           0.3                                             W2V                   0.370 17     0.533 56
           0.2
           0.1                                           Table 1: Correlation (Spearman ρ) of cosine simi-
             0                                           larity and relatedness ratings on the STS and SICK -
                                                         R benchmarks
                 1K   10K     100K     1M      Full
                             Sample size
                                                         of sentences with a human-annotated relatedness
                      BERT      W2V     p = 0.05
                                                         score. The original presentation of BERT (Devlin
    Figure 4: Wilcoxon tests, 1st vs. 2nd sentences
                                                         et al., 2018) did include a downstream application
                                                         to these datasets, but employed a learned classi-
                                                         fier, which obfuscates results (Wieting and Kiela,
   A possible reason for the larger discrepancy ob-      2019; Cover, 1965; Hewitt and Liang, 2019).
served in BERT representations might be that BERT        Hence we simply reduce the sequence of tokens
uses absolute positional encodings, i.e. the kth         within each sentence into a single vector by sum-
word of the input is encoded with p(k). There-           ming them, a simplistic yet robust semantic com-
fore, although all first sentences of a given length l   position method. We then compute the Spearman
will be indexed with the same set of positional en-      correlation between the cosines of the two sum
codings {p(1), . . . , p(l)}, only second sentences      vectors and the sentence pair’s relatedness score.
of a given length l preceded by first sentences of       We compare two setups: a “two sentences input”
a given length j share the exact same set of posi-       scheme (or 2 sent. ipt. for short)—where we use
tional encodings {p(j + 1), . . . , p(j + l)}. As        the sequences of vectors obtained by passing the
highlighted previously, the residual connections         two sentences as a single input—and a “one sen-
ensure that the segment encodings were partially         tence input” scheme (1 sent. ipt.)—using two dis-
preserved in the output embedding: the same argu-        tinct inputs of a single sentence each.
ment can be made for positional encodings. In any            Results are reported in table 1; we also pro-
event, the fact is that we do observe on BERT rep-       vide comparisons with three different sentence en-
resentations an effect of segment on sentence-level      coders and the aforementioned W 2 V model. As
structure. This effect is greater than one can blame     we had suspected, using sum vectors drawn from
on data idiosyncrasies, as verified by the compari-      a two sentence input scheme single degrades per-
son with a traditional DSM such as W 2 V. If we are      formances below the W 2 V baseline. On the other
to consider BERT as a DSM, we must do so at the          hand, a one sentence input scheme seems to pro-
cost of cross-sentence coherence.                        duce coherent sentence representations: in that
   The analysis above suggests that embeddings           scenario, BERT performs better than W 2 V and
for tokens drawn from first sentences live in a          the older sentence encoder Skip-Thought (Kiros
different semantic space than tokens drawn from          et al., 2015), but worse than the modern USE
second sentences, i.e. that BERT contains two            (Cer et al., 2018) and Infersent (Conneau et al.,
DSM s rather than one. If so, the comparison be-         2017). The comparison with W 2 V also shows
tween two sentence-representations from a single         that BERT representations over a coherent input
input would be meaningless, or at least less co-         are more likely to include some form of composi-
herent than the comparison of two sentence rep-          tional knowledge than traditional DSMs; however
resentations drawn from the same sentence posi-          it is difficult to decide whether some true form of
tion. To test this conjecture, we use two compo-         compositionality is achieved by BERT or whether
sitional semantics benchmarks: STS (Cer et al.,          these performances are entirely a by-product of
2017) and SICK - R (Marelli et al., 2014). These         the positional encodings. In favor of the former,
datasets are structured as triplets, grouping a pair     other research has suggested that Transformer-
based architectures perform syntactic operations          system, absolute encodings are not: other works
(Raganato and Tiedemann, 2018; Hewitt and Man-            have proposed alternative relative position encod-
ning, 2019; Clark et al., 2019; Jawahar et al., 2019;     ings (Shaw et al., 2018; Dai et al., 2019, e.g.); re-
Voita et al., 2019; Michel et al., 2019). In all, these   placing the former with the latter may alleviate the
results suggest that the semantic space of token          gap in lexical contrasts. Other related questions
representations from second sentences differ from         that we must leave to future works encompass
that of embeddings from first sentences.                  testing on other BERT models such as the whole-
                                                          words model, or that of Liu et al. (2019) which
6   Conclusions                                           differs only by its training objectives, as well as
                                                          other contextual embeddings architectures.
Our experiments have focused on testing to what              Our findings suggest that the formulation of the
extent similar words lie in similar regions of            NSP objective of BERT obfuscates its relation to
BERT ’s latent semantic space. Although we saw
                                                          distributional semantics, by introducing a system-
that type-level semantics seem to match our gen-          atic distinction between first and second sentences
eral expectations about DSMs, focusing on details         which impacts the output embeddings. Similarly,
leaves us with a much foggier picture.                    other works (Lample and Conneau, 2019; Yang
   The main issue stems from BERT’s “next sen-            et al., 2019; Joshi et al., 2019; Liu et al., 2019)
tence prediction objective”, which requires tokens        stress that the usefulness and pertinence of the NSP
to be marked according to which sentence they be-         task were not obvious. These studies favored an
long. This introduces a distinction between first         empirical point of view; here, we have shown what
and second sentence of the input that runs con-           sorts of caveats came along with such artificial dis-
trary to our expectations in terms of cross-sentence      tinctions from the perspective of a theory of lexi-
coherence. The validity of such a distinction for         cal semantics. We hope that future research will
lexical semantics may be questioned, yet its ef-          extend and refine these findings, and further our
fects can be measured. The primary assessment             understanding of the peculiarities of neural archi-
conducted in section 3 shows that token repre-            tectures as models of linguistic structure.
sentations did tend to cluster naturally according
to their types, yet a finer study detailed in sec-        Acknowledgments
tion 4 highlights that tokens from distinct sen-
tence positions (even vs. odd) tend to have dif-          We thank Quentin Gliosca whose remarks have
ferent representations. This can seen as a direct         been extremely helpful to this work. We also
consequence of BERT’s architecture: residual con-         thank Olivier Bonami as well as three anony-
nections, along with the use of specific vectors to       mous reviewers for their thoughtful criticism. The
encode sentence position, entail that tokens for a        work was supported by a public grant overseen by
given sentence position are ‘shifted’ with respect        the French National Research Agency (ANR) as
to tokens for the other position. Encodings have a        part of the “Investissements d’Avenir” program:
substantial effect on the structure of the semantic       Idex Lorraine Université d’Excellence (reference:
subspaces of the two sentences in BERT input. Our         ANR-15-IDEX-0004).
experiments demonstrate that assuming sameness
of the semantic space across the two input sen-
tences can lead to a significant performance drop         References
in semantic textual similarity.                           Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,
   One way to overcome this violation of cross-             and Andrej Risteski. 2016. Linear algebraic struc-
sentence coherence would be to consider first and           ture of word senses, with applications to polysemy.
                                                            CoRR, abs/1601.03764.
second sentences representations as belonging to
distinct distributional semantic spaces. The fact         Sergey Bartunov, Dmitry Kondrashkin, Anton Os-
that first sentences were shown to have on aver-            okin, and Dmitry P. Vetrov. 2015. Breaking sticks
age higher pairwise cosines than second sentences           and ambiguities with adaptive skip-gram. CoRR,
can be partially explained by the use of absolute           abs/1502.07257.
positional encodings in BERT representations. Al-         Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
though positional encodings are required so that            Christian Janvin. 2003. A neural probabilistic lan-
the model does not devolve into a bag-of-word               guage model. J. Mach. Learn. Res., 3:1137–1155.
Gemma Boleda. 2019. Distributional semantics and            Thomas M. Cover. 1965. Geometrical and statistical
  linguistic theory. CoRR, abs/1905.01896.                    properties of systems of linear inequalities with ap-
                                                              plications in pattern recognition. IEEE Trans. Elec-
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,                  tronic Computers, 14(3):326–334.
  Venkatesh Saligrama, and Adam T Kalai. 2016.
  Man is to computer programmer as woman is to              Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
  homemaker? debiasing word embeddings. In D. D.              Carbonell, Quoc V. Le, and Ruslan Salakhutdi-
  Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and              nov. 2019. Transformer-xl: Attentive language
  R. Garnett, editors, Advances in Neural Information         models beyond a fixed-length context. CoRR,
  Processing Systems 29, pages 4349–4357. Curran              abs/1901.02860.
  Associates, Inc.
                                                            Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Olivier Bonami and Denis Paperno. 2018. A charac-              Kristina Toutanova. 2018. BERT: pre-training of
  terisation of the inflection-derivation opposition in a      deep bidirectional transformers for language under-
  distributional vector space. Lingua e Langaggio.             standing. CoRR, abs/1810.04805.

Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014.         Katrin Erk and Sebastian Padó. 2010. Exemplar-based
   Multimodal distributional semantics. J. Artif. Intell.     models for word meaning in context. In ACL.
   Res., 49:1–47.
                                                            J. R. Firth. 1957. A synopsis of linguistic theory 1930-
Gino Brunner, Yang Liu, Damián Pascual, Oliver                55. Studies in Linguistic Analysis (special volume of
  Richter, and Roger Wattenhofer. 2019.      On                the Philological Society), 1952-59:1–32.
  the Validity of Self-Attention as Explanation
  in Transformer Models.    arXiv e-prints, page            Gottlob Frege. 1892. Über Sinn und Bedeutung.
  arXiv:1908.04211.                                           Zeitschrift fr Philosophie und philosophische Kritik,
                                                              100:25–50.
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,
  Nicole Limtiaco, Rhomni St. John, Noah Constant,          Zellig Harris. 1954. Distributional structure. Word,
  Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,             10(23):146–162.
  Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil.           Aurélie Herbelot and Eva Maria Vecchi. 2015. Build-
  2018.     Universal sentence encoder.    CoRR,              ing a shared world: mapping distributional to model-
  abs/1803.11175.                                             theoretic semantic spaces. In Proceedings of the
Daniel M. Cer, Mona T. Diab, Eneko Agirre, Iñigo             2015 Conference on Empirical Methods in Natural
  Lopez-Gazpio, and Lucia Specia. 2017. Semeval-              Language Processing, pages 22–32. Association for
  2017 task 1: Semantic textual similarity multilin-          Computational Linguistics.
  gual and crosslingual focused evaluation. In Pro-         John Hewitt and Percy Liang. 2019. Designing and
  ceedings of the 11th International Workshop on Se-          Interpreting Probes with Control Tasks. arXiv e-
  mantic Evaluation, SemEval@ACL 2017, Vancou-                prints, page arXiv:1909.03368.
  ver, Canada, August 3-4, 2017, pages 1–14.
                                                            John Hewitt and Christopher D. Manning. 2019. A
Ting-Yun Chang and Yun-Nung Chen. 2019. What                  structural probe for finding syntax in word represen-
  does this word mean? explaining contextualized em-          tations. In NAACL-HLT.
  beddings with natural language definition.
                                                            Ganesh Jawahar, Benoı̂t Sagot, and Djamé Seddah.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and               2019. What does BERT learn about the structure
  Christopher D. Manning. 2019. What does BERT                of language? In ACL 2019 - 57th Annual Meeting of
  look at? an analysis of BERT’s attention. In Pro-           the Association for Computational Linguistics, Flo-
  ceedings of the 2019 ACL Workshop BlackboxNLP:              rence, Italy.
  Analyzing and Interpreting Neural Networks for
  NLP, pages 276–286, Florence, Italy. Association          Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.
  for Computational Linguistics.                             Weld, Luke Zettlemoyer, and Omer Levy. 2019.
                                                             Spanbert: Improving pre-training by representing
Andy Coenen, Emily Reif, Ann Yuan, Been Kim,                 and predicting spans. CoRR, abs/1907.10529.
  Adam Pearce, Fernanda Viégas, and Martin Watten-
  berg. 2019. Visualizing and measuring the geometry        Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,
  of bert. arXiv e-prints, page arXiv:1906.02715.             Richard Zemel, Raquel Urtasun, Antonio Torralba,
                                                              and Sanja Fidler. 2015. Skip-thought vectors. In
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c            C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
  Barrault, and Antoine Bordes. 2017. Supervised              and R. Garnett, editors, Advances in Neural Infor-
  learning of universal sentence representations from         mation Processing Systems 28, pages 3294–3302.
  natural language inference data. In Proceedings of          Curran Associates, Inc.
  the 2017 Conference on Empirical Methods in Nat-
  ural Language Processing, pages 670–680, Copen-           Guillaume Lample and Alexis Conneau. 2019. Cross-
  hagen, Denmark. Association for Computational               lingual language model pretraining.      CoRR,
  Linguistics.                                                abs/1901.07291.
Alessandro Lenci. 2018. Distributional models of word    Myle Ott, Sergey Edunov, David Grangier, and
  meaning. Annual review of Linguistics, 4:151–171.       Michael Auli. 2018. Scaling neural machine trans-
                                                          lation. In Proceedings of the Third Conference on
Omer Levy and Yoav Goldberg. 2014a. Linguistic            Machine Translation: Research Papers, pages 1–9,
 regularities in sparse and explicit word representa-     Belgium, Brussels. Association for Computational
 tions. In Proceedings of the Eighteenth Confer-          Linguistics.
 ence on Computational Natural Language Learning,
 pages 171–180. Association for Computational Lin-       Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
 guistics.                                                Gardner, Christopher Clark, Kenton Lee, and Luke
                                                          Zettlemoyer. 2018. Deep contextualized word rep-
Omer Levy and Yoav Goldberg. 2014b.         Neural        resentations. In Proceedings of the 2018 Confer-
 word embedding as implicit matrix factorization.         ence of the North American Chapter of the Associ-
 In Z. Ghahramani, M. Welling, C. Cortes, N. D.           ation for Computational Linguistics: Human Lan-
 Lawrence, and K. Q. Weinberger, editors, Advances        guage Technologies, Volume 1 (Long Papers), pages
 in Neural Information Processing Systems 27, pages       2227–2237, New Orleans, Louisiana. Association
 2177–2185. Curran Associates, Inc.                       for Computational Linguistics.

Tal Linzen. 2016. Issues in evaluating semantic spaces   William Van Ormann Quine. 1960. Word And Object.
  using word analogies. In Proceedings of the 1st          MIT Press.
  Workshop on Evaluating Vector-Space Representa-
                                                         Alec Radford. 2018. Improving language understand-
  tions for NLP, pages 13–18, Berlin, Germany. Asso-
                                                           ing by generative pre-training.
  ciation for Computational Linguistics.
                                                         Alec Radford, Jeff Wu, Rewon Child, David Luan,
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-        Dario Amodei, and Ilya Sutskever. 2019. Language
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,            models are unsupervised multitask learners.
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  Roberta: A robustly optimized BERT pretraining         Alessandro Raganato and Jörg Tiedemann. 2018. An
  approach. CoRR, abs/1907.11692.                          analysis of encoder representations in transformer-
                                                           based machine translation. In Proceedings of the
Max M. Louwerse and Rolf A. Zwaan. 2009. Lan-              2018 EMNLP Workshop BlackboxNLP: Analyzing
 guage encodes geographical information. Cognitive         and Interpreting Neural Networks for NLP, pages
 Science 33 (2009) 5173.                                   287–297, Brussels, Belgium. Association for Com-
                                                           putational Linguistics.
Marco Marelli and Marco Baroni. 2015. Affixation
 in semantic space: Modeling morpheme meanings           Siva Reddy, Ioannis P. Klapaftis, Diana McCarthy, and
 with compositional distributional semantics. Psy-          Suresh Manandhar. 2011. Dynamic and static proto-
 chological review, 122 3:485–515.                          type vectors for semantic composition. In IJCNLP.

Marco Marelli, Stefano Menini, Marco Baroni, Luisa       Joseph Reisinger and Raymond Mooney. 2010. Multi-
 Bentivogli, Raffaella Bernardi, and Roberto Zam-           prototype vector-space models of word meaning.
 parelli. 2014. A sick cure for the evaluation of           pages 109–117.
 compositional distributional semantic models. In
 Proceedings of the Ninth International Conference       Sascha Rothe and Hinrich Schütze. 2015. Autoex-
 on Language Resources and Evaluation (LREC’14),           tend: Extending word embeddings to embeddings
 pages 216–223, Reykjavik, Iceland. European Lan-          for synsets and lexemes. CoRR, abs/1507.01127.
 guage Resources Association (ELRA).
                                                         Peter Rousseeuw. 1987. Silhouettes: A graphical aid to
                                                           the interpretation and validation of cluster analysis.
Bryan McCann, James Bradbury, Caiming Xiong, and
                                                           J. Comput. Appl. Math., 20(1):53–65.
  Richard Socher. 2017. Learned in translation: Con-
  textualized word vectors. NIPS.                        G. Salton, A. Wong, and C. S. Yang. 1975. A vec-
                                                           tor space model for automatic indexing. Commun.
Paul Michel, Omer Levy, and Graham Neubig. 2019.           ACM, 18(11):613–620.
  Are Sixteen Heads Really Better than One? arXiv
  e-prints, page arXiv:1905.10650.                       Sofia Serrano and Noah A. Smith. 2019. Is attention
                                                           interpretable? In Proceedings of the 57th Annual
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey         Meeting of the Association for Computational Lin-
  Dean. 2013a. Efficient estimation of word represen-      guistics, pages 2931–2951, Florence, Italy. Associa-
  tations in vector space. CoRR, abs/1301.3781.            tion for Computational Linguistics.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.          Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
  2013b. Linguistic regularities in continuous space       2018. Self-attention with relative position represen-
  word representations. In HLT-NAACL, pages 746–           tations. In Proceedings of the 2018 Conference of
  751.                                                     the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-           language understanding systems.       arXiv preprint
  nologies, Volume 2 (Short Papers), pages 464–468,         arXiv:1905.00537.
  New Orleans, Louisiana. Association for Computa-
  tional Linguistics.                                     Alex Wang, Amanpreet Singh, Julian Michael, Felix
                                                            Hill, Omer Levy, and Samuel R. Bowman. 2019b.
Samuel L. Smith, David H. P. Turban, Steven Hamblin,        GLUE: A multi-task benchmark and analysis plat-
  and Nils Y. Hammerla. 2017. Offline bilingual word        form for natural language understanding. In the Pro-
  vectors, orthogonal transformations and the inverted      ceedings of ICLR.
  softmax. In 5th International Conference on Learn-
  ing Representations, ICLR 2017, Toulon, France,         Kellie Webster, Marta R. Costa-jussà, Christian Hard-
  April 24-26, 2017, Conference Track Proceedings.          meier, and Will Radford. 2019. Gendered ambigu-
  OpenReview.net.                                           ous pronoun (GAP) shared task at the gender bias
                                                            in NLP workshop 2019. In Proceedings of the First
Gongbo Tang, Rico Sennrich, and Joakim Nivre. 2018.         Workshop on Gender Bias in Natural Language Pro-
  An analysis of attention mechanisms: The case of          cessing, pages 1–7, Florence, Italy. Association for
  word sense disambiguation in neural machine trans-        Computational Linguistics.
  lation. In Proceedings of the Third Conference on
  Machine Translation: Research Papers, pages 26–         Matthijs Westera and Gemma Boleda. 2019. Don’t
  35, Belgium, Brussels. Association for Computa-          blame distributional semantics if it can’t do entail-
  tional Linguistics.                                      ment. In Proceedings of the 13th International Con-
                                                           ference on Computational Semantics - Long Papers,
Wilson Taylor. 1953. Cloze procedure: A new tool           pages 120–133, Gothenburg, Sweden. Association
  for measuring readability. Journalism Quarterly,         for Computational Linguistics.
  30:415–433.
                                                          John Wieting and Douwe Kiela. 2019. No train-
Peter D. Turney and Patrick Pantel. 2010. From fre-         ing required: Exploring random encoders for sen-
  quency and to meaning and vector space and models         tence classification. In 7th International Conference
  of semantics. In Journal of Artificial Intelligence       on Learning Representations, ICLR 2019, New Or-
  Research 37 (2010) 141-188 Submitted 10/09; pub-          leans, LA, USA, May 6-9, 2019. OpenReview.net.
  lished.
                                                          Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob            Carbonell, Ruslan Salakhutdinov, and Quoc V.
  Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz            Le. 2019.     Xlnet: Generalized autoregressive
  Kaiser, and Illia Polosukhin. 2017. Attention is all      pretraining for language understanding. CoRR,
  you need. In I. Guyon, U. V. Luxburg, S. Bengio,          abs/1906.08237.
  H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
                                                          Zi Yin and Yuanyuan Shen. 2018. On the dimension-
  nett, editors, Advances in Neural Information Pro-
                                                             ality of word embedding. In S. Bengio, H. Wallach,
  cessing Systems 30, pages 5998–6008. Curran As-
                                                             H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
  sociates, Inc.
                                                             R. Garnett, editors, Advances in Neural Information
Lucas Vendramin, Pablo A. Jaskowiak, and Ricardo J.          Processing Systems 31, pages 887–898. Curran As-
  G. B. Campello. 2013. On the combination of rel-           sociates, Inc.
  ative clustering validity criteria. In Proceedings of
                                                          Yukun Zhu, Jamie Ryan Kiros, Richard S. Zemel, Rus-
  the 25th International Conference on Scientific and
                                                            lan Salakhutdinov, Raquel Urtasun, Antonio Tor-
  Statistical Database Management, SSDBM, pages
                                                            ralba, and Sanja Fidler. 2015. Aligning books and
  4:1–4:12, New York, NY, USA. ACM.
                                                            movies: Towards story-like visual explanations by
Loı̈c Vial, Benjamin Lecouteux, and Didier Schwab.          watching movies and reading books. 2015 IEEE In-
  2019. Sense Vocabulary Compression through the            ternational Conference on Computer Vision (ICCV),
  Semantic Knowledge of WordNet for Neural Word             pages 19–27.
  Sense Disambiguation. In Global Wordnet Confer-
  ence, Wroclaw, Poland.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
  nrich, and Ivan Titov. 2019. Analyzing multi-head
  self-attention: Specialized heads do the heavy lift-
  ing, the rest can be pruned. In Proceedings of the
  57th Annual Meeting of the Association for Com-
  putational Linguistics, pages 5797–5808, Florence,
  Italy. Association for Computational Linguistics.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,
  Amanpreet Singh, Julian Michael, Felix Hill, Omer
  Levy, and Samuel R Bowman. 2019a. Super-
  glue: A stickier benchmark for general-purpose
You can also read