The Unstoppable Rise of Computational Linguistics in Deep Learning

Page created by Clifford Spencer
 
CONTINUE READING
The Unstoppable Rise of Computational Linguistics in Deep Learning

                                                                                      James Henderson
                                                                              Idiap Research Institute, Switzerland
                                                                               james.henderson@idiap.ch

                                                               Abstract                             network architectures which bring together contin-
                                                                                                    uous vector spaces with structured representations
                                             In this paper, we trace the history of neural net-     in ways which are novel for both machine learning
                                             works applied to natural language understand-
arXiv:2005.06420v3 [cs.CL] 11 Jun 2020

                                                                                                    and computational linguistics.
                                             ing tasks, and identify key contributions which
                                             the nature of language has made to the devel-
                                                                                                       Thus, the great progress which we have made
                                             opment of neural network architectures. We             through the application of neural networks to natu-
                                             focus on the importance of variable binding            ral language processing should not be viewed as a
                                             and its instantiation in attention-based models,       conquest, but as a compromise. As well as the un-
                                             and argue that Transformer is not a sequence           questionable impact of machine learning research
                                             model but an induced-structure model. This             on NLP, the nature of language has had a profound
                                             perspective leads to predictions of the chal-          impact on progress in machine learning. In this
                                             lenges facing research in deep learning archi-
                                                                                                    paper we trace this impact, and speculate on future
                                             tectures for natural language understanding.
                                                                                                    progress and its limits.
                                         1   Introduction                                              We start with a sketch of the insights from gram-
                                                                                                    mar formalisms about the nature of language, with
                                         When neural networks first started being applied to        their multiple levels, structured representations and
                                         natural language in the 1980s and 90s, they repre-         rules. The rules were soon learned with statistical
                                         sented a radical departure from standard practice          methods, followed by the use of neural networks
                                         in computational linguistics. Connectionists had           to replace symbols with induced vectors, but the
                                         vector representations and learning algorithms, and        most effective models still kept structured repre-
                                         they didn’t see any need for anything else. Every-         sentations, such as syntactic trees. More recently,
                                         thing was a point in a vector space, and everything        attention-based models have replaced hand-coded
                                         about the nature of language could be learned from         structures with induced structures. The resulting
                                         data. On the other hand, most computational lin-           models represent language with multiple levels
                                         guists had linguistic theories and the poverty-of-the-     of structured representations, much as has always
                                         stimulus argument. Obviously some things were              been done. Given this perspective, we identify re-
                                         learned from data, but all the interesting things          maining challenges in learning language from data,
                                         about the nature of language had to be innate.             and its possible limitations.
                                            A quarter century later, we can say two things
                                         with certainty: they were both wrong. Vector-space         2     Grammar Formalisms versus
                                         representations and machine learning algorithms                  Connectionism
                                         are much more powerful than was thought. Much
                                         of the linguistic knowledge which computational            2.1    Grammar Formalisms
                                         linguists assumed needed to be innate can in fact          Our modern understanding of the computational
                                         be learned from data. But the unbounded discrete           properties of language started with the introduction
                                         structured representations they used have not been         of grammar formalisms. Context Free Grammars
                                         replaced by vector-space representations. Instead,         (Chomsky, 1959) illustrated how a formal system
                                         the successful uses of neural networks in computa-         could model the infinite generative capacity of lan-
                                         tional linguistics have replaced specific pieces of        guage with a bounded grammar. This formalism
                                         computational-linguistic models with new neural            soon proved inadequate to account for the diversity

                                                                      Accepted for publication at ACL 2020, in the theme track.
of phenomena in human languages, and a number                 formalisms capture this unboundedness by allow-
of linguistically-motivated grammar formalisms                ing an unbounded number of entities in a repre-
were proposed (e.g HPSG (Pollard and Sag, 1987),              sentation, and thus an unbounded number of rule
TAG (Joshi, 1987), CCG (Steedman, 2000)).                     applications. It is generally accepted that the num-
   All these grammar formalisms shared certain                ber of entities grows linearly with the length of the
properties, motivated by the understanding of the             sentence (Joshi et al., 1990), so each level can have
nature of languages in Linguistics. They all postu-           at most a number of entities which is linear in the
late representations which decompose an utterances            number of entities at the level(s) below.
into a set of sub-parts, with labels of the parts and a          Computational linguistic grammar formalisms
structure of inter-dependence between them. And               also typically assume that the properties and rela-
they all assume that this decomposition happens               tions are discrete, called symbolic representations.
at multiple levels of representation. For example             These may be atomic categories, as in CFGs, TAGs,
that spoken utterances can be decomposed into sen-            CCG and dependency grammar, or they may be fea-
tences, sentences can be decomposed into words,               ture structures, as in HPSG.
words can be decomposed into morphemes, and
morphemes can be decomposed into phonemes, be-                2.2   Connectionism
fore we reach the observable sound signal. In the
interests of uniformity, we will refer to the sub-            Other researchers who were more interested in the
parts in each level of representation as its entities,        computational properties of neurological systems
their labels as their properties, and their structure of      found this reliance on discrete categorical repre-
inter-dependence as their relations. The structure            sentations untenable. Processing in the brain used
of inter-dependence between entities at different             real-valued representations distributed across many
levels will also be referred to as relations.                 neurons. Based on successes following the de-
   In addition to these representations, grammar              velopment of multi-layered perceptrons (MLPs)
formalisms include specifications of the allowable            (Rumelhart et al., 1986b), an approach to mod-
structures. These may take the form of hard con-              elling cognitive phenomena was developed called
straints or soft objectives, or of deterministic rules        connectionism. Connectionism uses vector-space
or stochastic processes. In all cases, the purpose of         representations to reflect the distributed continuous
these specifications is to account for the regulari-          nature of representations in the brain. Similarly,
ties found in natural languages. In the interests of          their rules are specified with vectors of continu-
uniformity, we will refer to all these different kinds        ous parameters. MLPs are so powerful that they
of specifications of allowable structures as rules.           are arbitrary function approximators (Hornik et al.,
These rules may apply within or between levels of             1989). And thanks to backpropagation learning
representation.                                               (Rumelhart et al., 1986a) in neural network mod-
   In addition to explicit rules, computational lin-          els, such as MLPs and Simple Recurrent Networks
guistic formalisms implicitly make claims about               (SRNs) (Elman, 1990), these vector-space repre-
the regularities found in natural languages through           sentations and rules could be learned from data.
their expressive power. Certain types of rules sim-              The ability to learn powerful vector-space repre-
ply cannot be specified, thus claiming that such              sentations from data led many connectionist to ar-
rules are not necessary to capture the regularities           gue that the complex discrete structured representa-
found in any natural language. These claims differ            tions of computational linguistics were neither nec-
across formalisms, but the study of the expressive            essary nor desirable (e.g. Smolensky (1988, 1990);
power of grammar formalisms have identified cer-              Elman (1991); Miikkulainen (1993); Seidenberg
tain key principles (Joshi et al., 1990). Firstly, that       (2007)). Distributed vector-space representations
the set of rules in a given grammar is bounded.               were thought to be so powerful that there was no
This in turn implies that the set of properties and           need for anything else. Learning from data made
relations in a given grammar is also bounded.                 linguistic theories irrelevant. (See also (Collobert
   But language is unbounded1 in nature, since sen-           and Weston, 2008; Collobert et al., 2011; Sutskever
tences and texts can be arbitrarily long. Grammar             et al., 2014) for more recent incarnations.)
   1
                                                                 The idea that vector-space representations are
    A set of things (e.g. the sentences of a language) have
unbounded size if for any finite size there is always some    adequate for natural language and other cognitive
element in the set which is larger than that.                 phenomena was questioned from several directions.
From neuroscience, researchers questioned how a         adequate to recover them from the features of the
simple vector could encode features of more than        entities (Henderson, 1994, 2000). But these argu-
one thing at a time. If we see a red square to-         ments were largely theoretical, and it was not clear
gether with a blue triangle, how do we represent        how they could be incorporated in learning-based
the difference between that and a red triangle with     architectures.
a blue square, since the vector elements for red,
blue, square and triangle would all be active at the    2.3    Statistical Models
same time? This is known as the variable bind-          Although researchers in computational linguistics
ing problem, so called because variables are used       did not want to abandon their representations, they
to do this binding in symbolic representations, as      did recognise the importance of learning from data.
in red(x) ∧ triangle(x) ∧ blue(y) ∧ square(y).          The first successes in this direction came from
One proposal has been that the precise timing of        learning rules with statistical methods, such as
neuron activation spikes could be used to encode        part-of-speech tagging with hidden Markov mod-
variable binding, called Temporal Synchrony Vari-       els. For syntactic parsing, the development of the
able Binding (von der Malsburg, 1981; Shastri and       Penn Treebank led to many statistical models which
Ajjanagadde, 1993). Neural spike trains have both       learned the rules of grammar (Collins, 1997, 1999;
a phase and a period, so the phase could be used        Charniak, 1997; Ratnaparkhi, 1999).
to encode variable binding while still allowing the        These statistical models were very successful
period to be used for sequential computation. This      at learning from the distributions of linguistic rep-
work indicated how entities could be represented        resentations which had been annotated in the cor-
in a neurally-inspired computational architecture.      pus they were trained on. But they still required
   The adequacy of vector-space representations         linguistically-motivated designs to work well. In
was also questioned based on the regularities found     particular, feature engineering is necessary to make
in natural language. In particular, Fodor and           sure that these statistical machine-learning method
Pylyshyn (1988) argued that connectionist architec-     can search a space of rules which is sufficiently
tures were not adequate to account for regularities     broad to include good models but sufficiently nar-
which they characterised as systematicity (see also     row to allow learning from limited data.
(Smolensky, 1990; Fodor and McLaughlin, 1990)).
In essence, systematicity requires that learned rules   3     Inducing Features of Entities
generalise in a way that respects structured repre-     Early work on neural networks for natural lan-
sentations. Here again the issue is representing        guage recognised the potential of neural networks
multiple entities at the same time, but with the ad-    for learning the features as well, replacing feature
ditional requirement of representing the structural     engineering. But empirically successful neural net-
relationships between these entities. Only rules        work models for NLP were only achieved with
which are parameterised in terms of such represen-      approaches where the neural network was used to
tations can generalise in a way which accounts for      model one component within an otherwise tradi-
the generalisations found in language.                  tional symbolic NLP model.
   Early work on neural networks for natural lan-          The first work to achieve empirical success in
guage recognised the significance of variable bind-     comparison to non-neural statistical models was
ing for solving the issues with systematicity (Hen-     work on language modelling. Bengio et al. (2001,
derson, 1996, 2000). Henderson (1994, 2000) ar-         2003) used an MLP to estimate the parameters of
gued that extending neural networks with temporal       an n-gram language model, and showed improve-
synchrony variable binding made them powerful           ments when interpolated with a statistical n-gram
enough to account for the regularities found in lan-    language model. A crucial innovation of this model
guage. Using time to encode variable bindings           was the introduction of word embeddings. The idea
means that learning could generalise in a linguis-      that the properties of a word could be represented
tically appropriate way (Henderson, 1996), since        by a vector reflecting the distribution of the word
rules (neuronal synapses) learned for one variable      in text was introduced earlier in non-neural statisti-
(time) would systematically generalise to other vari-   cal models (e.g. (Deerwester et al., 1990; Schütze,
ables. Although relations were not stored explicitly,   1993; Burgess, 1998; Padó and Lapata, 2007; Erk,
it was claimed that for language understanding it is    2010)). This work showed that similarity in the
PTB Constituents                           ble parses. These models have also been applied
 model                                    LP     LR     F1      to syntactic dependency parsing (Titov and Hen-
 Costa et al. (2001)                PoS   57.8   64.9   61.1
 Henderson (2003)                   PoS   83.3   84.3   83.8
                                                                derson, 2007b; Yazdani and Henderson, 2015) and
 Henderson (2003)                         88.8   89.5   89.1    joint syntactic-semantic dependency parsing (Hen-
 Henderson (2004)                         89.8   90.4   90.1    derson et al., 2013).
 Vinyals et al. (2015) seq2seq
In contrast to seq2seq models, there have also        structure of the target sentence decoder are both flat
been neural network models of parsing which di-           sequences, but when each target word is generated,
rectly represent linguistic structure, rather than just   it computes attention weights over all source words.
derivation structure, giving them induced vector          These attention weights directly express how target
representations which map one-to-one with the en-         words are correlated with source words, and in this
tities in the linguistic representation. Typically, a     sense can be seen as a soft version of the alignment
recursive neural network is used to compute em-           structure. In traditional statistical machine trans-
beddings of syntactic constituents bottom-up. Dyer        lation, this alignment structure is determined with
et al. (2015) showed improvements by adding these         a separate alignment algorithm, and then frozen
representations to a model of the derivation struc-       while training the model. In contrast, the attention-
ture. Socher et al. (2013a) only modelled the lin-        based NMT model learns the alignment structure
guistic structure, making it difficult to do decoding     jointly with learning the encoder and decoder, in-
efficiently. But the resulting induced constituent        side the deep learning architecture (Bahdanau et al.,
embeddings have a clear linguistic interpretation,        2015).
making it easier to use them within other tasks,
                                                             This attention-based approach to NMT was also
such as sentiment analysis (Socher et al., 2013b).
                                                          applied to mapping a sentence to its syntactic parse
Similarly, models based on Graph Convolutional
                                                          (Vinyals et al., 2015). The attention function learns
Networks have induced embeddings with clear lin-
                                                          the structure of the relationship between the sen-
guistic interpretations within pre-defined model
                                                          tence and its syntactic derivation sequence, but
structures (e.g. (Marcheggiani and Titov, 2017;
                                                          does not have any representation of the structure
Marcheggiani et al., 2018)).
                                                          of the syntactic derivation itself. Empirical results
    All these results demonstrate the incredible effec-   are much better than their seq2seq model (Vinyals
tiveness of inducing vector-space representations         et al., 2015), but not as good as models which ex-
with neural networks, relieving us from the need to       plicitly model both structures (see Table 1).
do feature engineering. But neural networks do not
relieve us of the need to understand the nature of           The change from the sequential LSTM decoders
language when designing our models. Instead of            of previous NMT models to LSTM decoders with
feature engineering, these results show that the best     attention seems like a simple addition, but it fun-
accuracy is achieved by engineering the inductive         damentally changes the kinds of generalisations
bias of deep learning models through their model          which the model is able to learn. At each step in
structure. By designing a hand-coded model struc-         decoding, the state of a sequential LSTM model
ture which reflects the linguistic structure, locality    is a single vector, whereas adding attention means
in the model structure can reflect locality in the lin-   that the state needs to include the unboundedly
guistic structure. The neural network then induces        large set of vectors being attended to. This use of
features of the entities in this model structure.         an unbounded state is more similar to the above
                                                          models with predefined model structure, where an
4     Inducing Relations between Entities                 unboundedly large stack is needed to specify the
                                                          parser state. This change in representation leads to
With the introduction of attention-based models,          a profound change in the generalisations which can
the model structure can now be learned. By choos-         be learned. Parameterised rules which are learned
ing the nodes to be linguistically-motivated entities,    when paying attention to one of these vectors (in
learning the model structure in effect learns the sta-    the set or in the stack) automatically generalise to
tistical inter-dependencies between entities, which       the other vectors. In other words, attention-based
is what we have been referring to as relations.           models have variable binding, which sequential
                                                          LSTMs do not. Each vector represents the fea-
4.1    Attention-Based Models and Variable                tures for one entity, multiple entities can be kept
       Binding                                            in memory at the same time, and rules generalise
The first proposal of an attention-based neural           across these entities. In this sense it is wrong to
model learned a soft alignment between the tar-           refer to attention-based models as sequence mod-
get and source words in neural machine translation        els; they are in fact induced-structure models. We
(NMT) (Bahdanau et al., 2015). The model struc-           will expand on this perspective in the rest of this
ture of the source sentence encoder and the model         section.
4.2   Transformer and Systematicity                       of tasks. The success of BERT has led to vari-
                                                          ous analyses of what it has learned, including the
The generality of attention as a structure-induction
                                                          structural relations learned by the attention func-
method soon became apparent, culminated in
                                                          tions. Although there is no exact mapping from
the development of the Transformer architecture
                                                          these structures to the structures posited by linguis-
(Vaswani et al., 2017). Transformer has multiple
                                                          tics, there are clear indications that the attention
stacked layers of self-attention (attention to the
                                                          functions are learning to extract linguistic relations
other words in the same sequence), interleaved with
                                                          (Voita et al., 2019; Tenney et al., 2019; Reif et al.,
nonlinear functions applied to individual vectors.
                                                          2019).
Each attention layer has multiple attention heads,
allowing each head to learn a different type of re-          With variable binding for the properties of enti-
lation. A Transformer-encoder has one column of           ties and attention functions for relations between
stacked vectors for each position in the input se-        entities, Transformer can represent the kinds of
quence, and the model parameters are shared across        structured representations argued for above. With
positions. A Transformer-decoder adds attention           parameters shared across entities and sensitive to
over an encoded text, and predicts words one at a         these properties and relations, learned rules are
time after encoding the prefix of previously gener-       parameterised in terms of these structures. Thus
ated words.                                               Transformer is a deep learning architecture with
                                                          the kind of generalisation ability required to exhibit
   Although it was developed for encoding and gen-
                                                          systematicity, as in (Fodor and Pylyshyn, 1988).
erating sequences, in Transformer the sequential
structure is not hard-coded into the model struc-            Interestingly, the relations are not stored explic-
ture, unlike previous models of deep learning for         itly. Instead they are extracted from pairs of vec-
sequences (e.g. LSTMs (Hochreiter and Schmidhu-           tors by the attention functions, as with the use of
ber, 1997) and CNNs (LeCun and Bengio, 1995)).            position embeddings to compute relative position
Instead, the sequential structure is input in the form    relations. For the model to induce its own structure,
of position embeddings. In our formulation, posi-         lower levels must learn to embed its relations in
tion embeddings are just properties of individual         pairs of token embeddings, which higher levels of
entities (typically words or subwords). As such,          attention then extract.
these inputs facilitate learning about absolute posi-        That Transformer learns to embed relations in
tions. But they are also designed to allow the model      pairs of token embeddings is apparent from re-
to easily calculate relative position between entities.   cent work on dependency parsing (Kondratyuk
This allows the model’s attention functions to learn      and Straka, 2019; Mohammadshahi and Hender-
to discover the relative position structure of the        son, 2019, 2020). Earlier models of dependency
underlying sequence. In fact, explicitly inputting        parsing successfully use BiLSTMs to embed syn-
relative position relations as embeddings into the        tactic dependencies in pairs of token embeddings
attention functions works even better (Shaw et al.,       (e.g. (Kiperwasser and Goldberg, 2016; Dozat and
2018) (discussed further below). Whether input            Manning, 2016)), which are then extracted to pre-
as properties or as relations, these inputs are just      dict the dependency tree. Mohammadshahi and
features, not hard-coded model structure. The at-         Henderson (2019, 2020) use their proposed Graph-
tention weight functions can then learn to use these      to-Graph Transformer to encode dependencies in
features to induce their own structure.                   pairs of token embeddings, for transition-based
   The appropriateness and generality for natural         and graph-based dependency parsing respectively.
language of the Transformer architecture became           Graph-to-Graph Transformer also inputs previously
even more apparent with the development of pre-           predicted dependency relations into its attention
trained Transformer models like BERT (Devlin              functions (like relative position encoding (Shaw
et al., 2019). BERT models are large Transformer          et al., 2018)). These parsers achieve state of the
models trained mostly on a masked language model          art accuracies, indicating that Transformer finds it
objective, as well as a next-sentence prediction ob-      easy to input and predict syntactic dependency rela-
jective. After training on a very large amount of un-     tions via pairs of token embeddings. Interestingly,
labelled text, the resulting pretrained model can be      initialising the model with pretrained BERT re-
fine tuned for various tasks, with very impressive        sults in large improvements, indicating that BERT
improvements in accuracy across a wide variety            representations also encode syntactically-relevant
relations in pairs of token embeddings.                    that P (i|x, q) ∝ P (i|x) P (q|xi ) ∝ exp(q ·xi ) (ig-
                                                           noring factors independent of i) to reinterpret a
4.3   Nonparametric Representations                        standard attention function.
As we have seen, the problem with vector-space                Since Transformer has a discrete segmentation of
models is not simply about representations, but            its representation into positions (which we call enti-
about the way learned rules generalise. In work on         ties), but no explicit representation of structure, we
grammar formalisms, generalisation is analysed by          can think of this representation as a bag of vectors
looking at the unbounded case, since any bounded           (BoV, i.e. a set of instances of vectors). Each layer
case can simply be memorised. But the use of               has a BoV representation, which is aligned with
continuous representations does not fit well with          the BoV representation below it. The final output
the theory of grammar formalisms, which assumes            only becomes a sequence if the downstream task
a bounded vocabulary of atomic categories. In-             imposes explicit sequential structure on it, which
stead we propose an analysis of the generalisation         attention alone does not.
abilities of Transformer in terms of theory from ma-          These bag of vector representations have two
chine learning, Bayesian nonparametric learning            very interesting properties for natural language.
(Jordan, 2010). We argue that the representations          First, the number of vectors in the bag can grow
of Transformer are the minimal nonparametric ex-           arbitrarily large, which captures the unbounded na-
tension of a vector space.                                 ture of language. Secondly, the vectors in the bag
   To connect Transformer to Bayesian probabili-           are exchangeable, in the sense of Jordan (2010).
ties, we assume that a Transformer representation          In other words, renumbering the indices used to
can be thought of as the parameters of a probabil-         refer to the different vectors will not change the
ity distribution. This is natural, since a model’s         interpretation of the representation.3 This is be-
state represents a belief about the input, and in          cause the learned parameters in Transformer are
Bayesian approaches beliefs are probability distri-        shared across all positions. These two properties
butions. From this perspective, computing a rep-           are clearly related; exchangeability allows learning
resentation is inferring the parameters of a proba-        to generalise to unbounded representations, since
bility distribution from the observed input. This          there is no need to learn about indices which are
is analogous to Bayesian learning, where we infer          not in the training data.
the parameters of a distribution over models from             These properties mean that BoV representations
observed training data. In this section, we outline        are nonparametric representations. In other words,
how theory from Bayesian learning helps us under-          the specification of a BoV representation cannot
stand how the representations of Transformer lead          be done just by choosing values for a fixed set of
to better generalisation.                                  parameters. The number of parameters you need
   We do not make any specific assumptions about           grows with the size of the bag. This is crucial
what probability distributions are specified by a          for language because the amount of information
Transformer representation, but it is useful to keep       conveyed by a text grows with the length of the
in mind an example. One possibility is a mixture           text, so we need nonparametric representations.
model, where each vector specifies the parame-                To illustrate the usefulness of this view of BoVs
ters of a multi-dimensional distribution, and the          as nonparametric representations, we propose to
total distribution is the weighted sum across the          use methods from Bayesian learning to define a
vectors of these distributions. For example, we            prior distribution over BoVs where the size of
can interpret the vectors x=x1 , . . . , xn in a Trans-    the bag is not known. Such a prior would be
former’s representation as specifying a belief about       needed for learning the number of entities in a
the queries q that will be received from a down-           Transformer representation, discussed below, using
stream attention function, as in:                          variational Bayesian approaches. For this exam-
                X                                          ple, we will use the above interpretation of a BoV
   P (q|x) =        P (i|x) P (q|xi )
                 i
                                                           x={xi | 1≤i≤k} as    Pspecifying a distribution over
                                                           queries, P (q|x)= i P (i|x)P (q|xi ). A prior dis-
                                   X
    P (i|x) = exp( 21 ||xi ||2 ) /    exp( 21 ||xi ||2 )
                                   i
                                                           tribution over these P (q|x) distributions can be
  P (q|xi ) = N (q ; µ=xi , σ=1)                              3
                                                               These indices should not be confused with position em-
                                                           beddings. In fact, position embeddings are needed precisely
With this interpretation of x, we can use the fact         because the indices are meaningless to the model.
specified, for example, with a Dirichlet Process,          generates a sentence, the number of positions is
DP (α, G0 ). The concentration parameter α con-            chosen by the model, but it is simply trying to guess
trols the generation of a sequence of probabilities        the number of positions that would have been given
ρ1 , ρ2 , . . ., which correspond to the P (i|x) distri-   if this was a training example. These Transformer
bution (parameterised by the ||xi ||). The base dis-       models never try to induce the number of token
tribution G0 controls the generation of the P (q|xi )      embeddings they use in an unsupervised way.4
distributions (parameterised by the xi ).                     Given that current models hard-code different
    The use of exchangeability to support generali-        token definitions for different tasks (e.g. character
sation to unbounded representations implies a third        embeddings versus word embeddings versus sen-
interesting property, discrete segmentation into en-       tence embeddings), it is natural to ask whether a
tities. In other words, the information in a BoV           specification of the set of entities at a given level
is spread across an integer number of vectors. A           of representation can be learned. There are models
vector cannot be half included in a BoV; it is either      which induce the set of entities in an input text, but
included or not. In changing from a vector space           these are (to the best of our knowledge) not learned
to a bag-of-vector space, the only change is this          jointly with a downstream deep learning model.
discrete segmentation into entities. In particular,        Common examples include BPE (Sennrich et al.,
no discrete representation of structure is added to        2016) and unigram language model (Kudo, 2018),
the representation. Thus, the BoV representation           which use statistics of character n-grams to decide
of Transformer is the minimal nonparametric ex-            how to split words into subwords. The resulting
tension of a vector space.                                 subwords then become the entities for a deep learn-
    With this minimal nonparametric extension,             ing model, such as Transformer (e.g. BERT), but
Transformer is able to explicitly represent enti-          they do not explicitly optimise the performance of
ties and their properties, and implicitly represent a      this downstream model. In a more linguistically-
structure of relations between these entities. The         informed approach to the same problem, statistical
continuing astounding success of Transformer in            models have been proposed for morphology induc-
natural language understanding tasks suggests that         tion (e.g. (Elsner et al., 2013)). Also, Semi-Markov
this is an adequate deep learning architecture for         CRF models (Sarawagi and Cohen, 2005) can learn
the kinds of structured representations needed to          segmentations of an input string, which have been
account for the nature of language.                        used in the output layers of neural models (e.g.
                                                           (Kong et al., 2015)). The success of these models
5   Looking Forward: Inducing Levels                       in finding useful segmentations of characters into
    and their Entities                                     subwords suggests that learning the set of entities
                                                           can be integrated into a deep learning model. But
As argued above, the great success of neural net-          this task is complicated by the inherently discrete
works in NLP has not been because they are radi-           nature of the segmentation into entities. It remains
cally different from pre-neural computational theo-        to find effective neural architectures for learning
ries of language, but because they have succeeded          the set of entities jointly with the rest of the neu-
in replacing hand-coded components of those mod-           ral model, and for generalising such methods from
els with learned components which are specifically         the level of character strings to higher levels of
designed to capture the same generalisations. We           representation.
predict that there is at least one more hand-coded            The other remaining hand-coded component of
aspect of these models which can be learned from           computational linguistic models is levels of repre-
data, but question whether they all can be.                sentation. Neural network models of language typ-
   Transformer can learn representations of entities       ically only represent a few levels, such as the char-
and their relations, but current work (to the best of      acter sequence plus the word sequence, the word
our knowledge) all assumes that the set of entities is     sequence plus the syntax tree, or the word sequence
a predefined function of the text. Given a sentence,       plus the syntax tree plus the predicate-argument
a Transformer does not learn how many vectors it           structure (Henderson et al., 2013; Swayamdipta
should use to represent it. The number of positions
                                                               4
in the input sequence is given, and the number                   Recent work on inducing sparsity in attention weights
                                                           (Correia et al., 2019) effectively learns to reduce the number
of token embeddings is the same as the number              of entities used by individual attention heads, but not by the
of input positions. When a Transformer decoder             model as a whole.
et al., 2016). And these levels and their entities        fundamental ways. Vector space representations
are defined before training starts, either in pre-        (as in MLPs) are not adequate, nor are vector spaces
processing or in annotated data. If we had methods        which evolve over time (as in LSTMs). Attention-
for inducing the set of entities at a given level (dis-   based models are fundamentally different because
cussed above), then we could begin to ask whether         they use bag-of-vector representations. BoV rep-
we can induce the levels themselves.                      resentations are nonparametric representations, in
   One common approach to inducing levels of rep-         that the number of vectors in the bag can grow ar-
resentation in neural models is to deny it is a prob-     bitrarily large, and these vectors are exchangeable.
lem. Seq2seq and end2end models typically take               With BoV representations, attention-based neu-
this approach. These models only include represen-        ral network models like Transformer can model the
tations at a lower level, both for input and output,      kinds of unbounded structured representations that
and try to achieve equivalent performance to mod-         computational linguists have found to be necessary
els which postulate some higher level of represen-        to capture the generalisations in natural language.
tation (e.g. (Collobert and Weston, 2008; Collobert       And deep learning allows many aspects of these
et al., 2011; Sutskever et al., 2014; Vinyals et al.,     structured representations to be learned from data.
2015)). The most successful example of this ap-
                                                             However, successful deep learning architectures
proach has been neural machine translation. The
                                                          for natural language currently still have many hand-
ability of neural networks to learn such models is
                                                          coded aspects. The levels of representation are
impressive, but the challenge of general natural
                                                          hand-coded, based on linguistic theory or available
language understanding is much greater than ma-
                                                          resources. Often deep learning models only address
chine translation. Nonetheless, models which do
                                                          one level at a time, whereas a full model would
not explicitly model levels of representation can
                                                          involve levels ranging from the perceptual input to
show that they have learned about different levels
                                                          logical reasoning. Even within a given level, the
implicitly (Peters et al., 2018; Tenney et al., 2019).
                                                          set of entities is a pre-defined function of the text.
   We think that it is far more likely that we will
be able to design neural architectures which induce          This analysis suggests that an important next
multiple levels of representation than it is that we      step in deep learning architectures for natural lan-
can ignore this problem entirely. However, it is          guage understanding will be the induction of enti-
not at all clear that even this will be possible. Un-     ties. It is not clear what advances in deep learning
like the components previously learned, no linguis-       methods will be necessary to improve over our
tic theory postulates different levels of representa-     current fixed entity definitions, nor whether the re-
tion for different languages. Generally speaking,         sulting entities will be any different from the ones
there is a consensus that the levels minimally in-        postulated by linguistic theory. If we can induce
clude phonology, morphology, syntactic structure,         the entities at a given level, a more challenging
predicate-argument structure, and discourse struc-        task will be the induction of the levels themselves.
ture. This language-universal nature of levels of         The presumably-innate nature of linguistic levels
representation suggests that in humans the levels         suggests that this might not even be possible.
of linguistic representation are innate. This draws           But of one thing we can be certain: the immense
into question whether levels of representation can        success of adapting deep learning architectures to
be learned at all. Perhaps they are innate because        fit with our computational-linguistic understanding
human brains are not able to learn them from data.        of the nature of language will doubtless continue,
If so, perhaps it is the same for neural networks,        with greater insights for both natural language pro-
and so attempts to induce levels of representation        cessing and machine learning.
are doomed to failure.
   Or perhaps we can find new neural network archi-
                                                          Acknowledgements
tectures which are even more powerful than what is
now thought possible. It wouldn’t be the first time!
                                                          We would like to thank Paola Merlo, Suzanne
6   Conclusions                                           Stevenson, Ivan Titov, members of the Idiap NLU
                                                          group, and the anonymous reviewers for their com-
We conclude that the nature of language has influ-        ments and suggestions.
enced the design of deep learning architectures in
References                                                  neural networks with multitask learning. In Proceed-
                                                            ings of the Twenty-Fifth International Conference
Daniel Andor, Chris Alberti, David Weiss, Aliaksei          (ICML 2008), pages 160–167, Helsinki, Finland.
  Severyn, Alessandro Presta, Kuzman Ganchev, Slav
  Petrov, and Michael Collins. 2016. Globally normal-     Gonçalo M. Correia, Vlad Niculae, and André F. T.
  ized transition-based neural networks. In Proceed-        Martins. 2019. Adaptively sparse transformers. In
  ings of the 54th Annual Meeting of the Association        Proceedings of the 2019 Conference on Empirical
  for Computational Linguistics (Volume 1: Long Pa-         Methods in Natural Language Processing and the
  pers), pages 2442–2452, Berlin, Germany. Associa-         9th International Joint Conference on Natural Lan-
  tion for Computational Linguistics.                       guage Processing (EMNLP-IJCNLP), pages 2174–
                                                            2184, Hong Kong, China. Association for Computa-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-            tional Linguistics.
  gio. 2015. Neural machine translation by jointly
  learning to align and translate. In Proceedings of      Fabrizio Costa, Vincenzo Lombardo, Paolo Frasconi,
  ICLR.                                                     and Giovanni Soda. 2001. Wide coverage incre-
                                                            mental parsing by learning attachment preferences.
Yoshua Bengio, Réjean Ducharme, and Pascal Vincent.        pages 297–307.
  2001. A neural probabilistic language model. In
  Advances in Neural Information Processing Systems       Scott Deerwester, Susan T. Dumais, George W. Fur-
  13, pages 932–938. MIT Press.                             nas, Thomas K. Landauer, and Richard Harshman.
                                                            1990. Indexing by latent semantic analysis. Jour-
Yoshua Bengio, Réjean Ducharme, Pascal Vincent,            nal of the American Society for Information Science,
  and Christian Janvin. 2003. A neural probabilis-          41(6):391–407.
  tic language model. J. Machine Learning Research,
  3:1137–1155.                                            Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                             Kristina Toutanova. 2019. BERT: Pre-training of
Curt Burgess. 1998. From simple associations to the          deep bidirectional transformers for language under-
  building blocks of language: Modeling meaning in           standing. In Proceedings of the 2019 Conference
  memory with the HAL model. Behavior Research               of the North American Chapter of the Association
  Methods, Instruments, & Computers, 30(2):188–              for Computational Linguistics: Human Language
  198.                                                      Technologies, Volume 1 (Long and Short Papers),
                                                             pages 4171–4186, Minneapolis, Minnesota. Associ-
Eugene Charniak. 1997. Statistical parsing with a            ation for Computational Linguistics.
  context-free grammar and word statistics. In Proc.
  14th National Conference on Artificial Intelligence,    Timothy Dozat and Christopher D. Manning. 2016.
  Providence, RI. AAAI Press/MIT Press.                     Deep biaffine attention for neural dependency pars-
                                                            ing. CoRR, abs/1611.01734. ICLR 2017.
Danqi Chen and Christopher Manning. 2014. A fast
  and accurate dependency parser using neural net-        Chris Dyer, Miguel Ballesteros, Wang Ling, Austin
  works. In Proceedings of the 2014 Conference on           Matthews, and Noah A. Smith. 2015. Transition-
  Empirical Methods in Natural Language Processing          based dependency parsing with stack long short-
  (EMNLP), pages 740–750, Doha, Qatar. Association          term memory. In Proceedings of the 53rd Annual
  for Computational Linguistics.                            Meeting of the Association for Computational Lin-
                                                            guistics and the 7th International Joint Conference
Noam Chomsky. 1959. On certain formal properties of         on Natural Language Processing (Volume 1: Long
  grammars. Information and Control, 2:137–167.             Papers), pages 334–343, Beijing, China. Associa-
                                                            tion for Computational Linguistics.
Michael Collins. 1997. Three generative, lexicalized
  models for statistical parsing. In Proc. 35th Meeting   Jeffrey L. Elman. 1990. Finding structure in time. Cog-
  of Association for Computational Linguistics and           nitive Science, 14(2):179–212.
  8th Conf. of European Chapter of Association for
  Computational Linguistics, pages 16–23, Somerset,       Jeffrey L. Elman. 1991. Distributed representations,
  New Jersey.                                                simple recurrent networks, and grammatical struc-
                                                             ture. Machine Learning, 7:195–225.
Michael Collins. 1999. Head-Driven Statistical Mod-
  els for Natural Language Parsing. Ph.D. thesis, Uni-    Micha Elsner, Sharon Goldwater, Naomi Feldman, and
  versity of Pennsylvania, Philadelphia, PA.                Frank Wood. 2013. A joint learning model of word
                                                            segmentation, lexical acquisition, and phonetic vari-
R. Collobert, J. Weston, L. Bottou, M. Karlen,              ability. In Proceedings of the 2013 Conference on
  K. Kavukcuoglu, and P. Kuksa. 2011. Natural lan-          Empirical Methods in Natural Language Processing,
  guage processing (almost) from scratch. Journal of        pages 42–54, Seattle, Washington, USA. Associa-
  Machine Learning Research, 12:2493–2537.                  tion for Computational Linguistics.

Ronan Collobert and Jason Weston. 2008. A unified         Katrin Erk. 2010. What is word meaning, really? (and
  architecture for natural language processing: deep        how can distributional models help us describe it?).
In Proceedings of the 2010 Workshop on GEometri-         M.I. Jordan. 2010. Bayesian nonparametric learn-
  cal Models of Natural Language Semantics, pages            ing: Expressive priors for intelligent systems. In
  17–26, Uppsala, Sweden. Association for Computa-           R. Dechter, H. Geffner, and J. Halpern, editors,
  tional Linguistics.                                        Heuristics, Probability and Causality: A Tribute to
                                                             Judea Pearl, chapter 10. College Publications.
Jerry A. Fodor and B. McLaughlin. 1990. Connection-
   ism and the problem of systematicity: Why smolen-       Aravind K. Joshi. 1987. An introduction to tree adjoin-
   sky’s solution doesn’t work. Cognition, 35:183–           ing grammars. In Alexis Manaster-Ramer, editor,
   204.                                                      Mathematics of Language. John Benjamins, Amster-
                                                             dam.
Jerry A. Fodor and Zenon W. Pylyshyn. 1988. Connec-
   tionism and cognitive architecture: A critical analy-   Aravind K. Joshi, K. Vijay-Shanker, and David Weir.
   sis. Cognition, 28:3–71.                                  1990. The convergence of mildly context-sensitive
                                                             grammatical formalisms. In Peter Sells, Stuart
James Henderson. 1994. Description Based Parsing             Shieber, and Tom Wasow, editors, Foundational Is-
  in a Connectionist Network. Ph.D. thesis, Univer-          sues in Natural Language Processing. MIT Press,
  sity of Pennsylvania, Philadelphia, PA. Technical          Cambridge MA. Forthcoming.
  Report MS-CIS-94-46.
                                                           Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-
James Henderson. 1996. A connectionist architecture           ple and accurate dependency parsing using bidirec-
  with inherent systematicity. In Proceedings of the          tional LSTM feature representations. Transactions
  Eighteenth Conference of the Cognitive Science So-          of the Association for Computational Linguistics,
  ciety, pages 574–579, La Jolla, CA.                         4:313–327.
James Henderson. 2000. Constituency, context, and          Dan Kondratyuk and Milan Straka. 2019. 75 lan-
  connectionism in syntactic parsing. In Matthew             guages, 1 model: Parsing universal dependencies
  Crocker, Martin Pickering, and Charles Clifton, ed-        universally. In Proceedings of the 2019 Confer-
  itors, Architectures and Mechanisms for Language           ence on Empirical Methods in Natural Language
  Processing, pages 189–209. Cambridge University            Processing and the 9th International Joint Confer-
  Press, Cambridge UK.                                       ence on Natural Language Processing (EMNLP-
                                                             IJCNLP), pages 2779–2795, Hong Kong, China. As-
James Henderson. 2003. Inducing history representa-          sociation for Computational Linguistics.
  tions for broad coverage statistical parsing. In Proc.
  joint meeting of North American Chapter of the As-       Lingpeng Kong, Chris Dyer, and Noah A. Smith. 2015.
  sociation for Computational Linguistics and the Hu-        Segmental recurrent neural networks.
  man Language Technology Conf., pages 103–110,
  Edmonton, Canada.                                        Taku Kudo. 2018. Subword regularization: Improving
                                                             neural network translation models with multiple sub-
James Henderson. 2004. Discriminative training of            word candidates. In Proceedings of the 56th Annual
  a neural network statistical parser. In Proceedings        Meeting of the Association for Computational Lin-
  of the 42nd Meeting of the Association for Compu-          guistics (Volume 1: Long Papers), pages 66–75, Mel-
  tational Linguistics (ACL’04), Main Volume, pages          bourne, Australia. Association for Computational
  95–102, Barcelona, Spain.                                  Linguistics.

James Henderson, Paola Merlo, Ivan Titov, and              Yann LeCun and Yoshua Bengio. 1995. Convolutional
  Gabriele Musillo. 2013. Multilingual joint pars-           networks for images, speech, and time-series. In
  ing of syntactic and semantic dependencies with a          Michael A. Arbib, editor, The handbook of brain the-
  latent variable model. Computational Linguistics,          ory and neural networks (Second ed.), page 276278.
  39(4):949–998.                                             MIT press.

E.K.S. Ho and L.W. Chan. 1999. How to design a             Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
  connectionist holistic parser. Neural Computation,        proving distributional similarity with lessons learned
  11(8):1995–2016.                                          from word embeddings. Transactions of the Associ-
                                                            ation for Computational Linguistics, 3:211–225.
Sepp Hochreiter and Jrgen Schmidhuber. 1997.
  Long short-term memory. Neural Computation,              C. von der Malsburg. 1981. The correlation theory of
  9(8):1735–1780.                                             brain function. Technical Report 81-2, Max-Planck-
                                                              Institute for Biophysical Chemistry, Gottingen.
K. Hornik, M. Stinchcombe, and H. White. 1989. Mul-
   tilayer feedforward networks are universal approxi-     Diego Marcheggiani, Joost Bastings, and Ivan Titov.
   mators. Neural Networks, 2:359–366.                       2018. Exploiting semantics in neural machine trans-
                                                             lation with graph convolutional networks. In Pro-
Ajay N. Jain. 1991.      PARSEC: A Connectionist             ceedings of the 2018 Conference of the North Amer-
  Learning Architecture for Parsing Spoken Language.         ican Chapter of the Association for Computational
  Ph.D. thesis, Carnegie Mellon University, Pitts-           Linguistics: Human Language Technologies, Vol-
  burgh, PA.                                                 ume 2 (Short Papers), pages 486–492, New Orleans,
Louisiana. Association for Computational Linguis-        F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Ad-
  tics.                                                    vances in Neural Information Processing Systems
                                                           32, pages 8594–8603. Curran Associates, Inc.
Diego Marcheggiani and Ivan Titov. 2017. Encoding
  sentences with graph convolutional networks for se-    D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
  mantic role labeling. In Proceedings of the 2017         1986a. Learning internal representations by error
  Conference on Empirical Methods in Natural Lan-          propagation. In D. E. Rumelhart and J. L. McClel-
  guage Processing, pages 1506–1515, Copenhagen,           land, editors, Parallel Distributed Processing, Vol 1,
  Denmark. Association for Computational Linguis-          pages 318–362. MIT Press, Cambridge, MA.
  tics.
                                                         D. E. Rumelhart, J. L. McClelland, and the PDP Re-
Risto Miikkulainen. 1993. Subsymbolic Natural Lan-         seach group. 1986b. Parallel Distributed Process-
  guage Processing: An integrated model of scripts,        ing: Explorations in the microstructure of cognition,
  lexicon, and memory. MIT Press, Cambridge, MA.           Vol 1. MIT Press, Cambridge, MA.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-     Sunita Sarawagi and William W Cohen. 2005. Semi-
  rado, and Jeff Dean. 2013. Distributed representa-       markov conditional random fields for information
  tions of words and phrases and their composition-        extraction. In L. K. Saul, Y. Weiss, and L. Bottou,
  ality. In C.J.C. Burges, L. Bottou, M. Welling,          editors, Advances in Neural Information Processing
  Z. Ghahramani, and K.Q. Weinberger, editors, Ad-         Systems 17, pages 1185–1192. MIT Press.
  vances in Neural Information Processing Systems
  26, pages 3111–3119. Curran Associates, Inc.           Hinrich Schütze. 1993. Word space. In Advances
                                                           in Neural Information Processing Systems 5, pages
Alireza Mohammadshahi and James Henderson. 2019.           895–902. Morgan Kaufmann.
  Graph-to-graph transformer for transition-based de-
  pendency parsing.                                      Mark S. Seidenberg. 2007. Connectionist models of
                                                          reading. In Gareth Gaskell, editor, Oxford Hand-
Alireza Mohammadshahi and James Henderson. 2020.          book of Psycholinguistics, chapter 14, pages 235–
  Recursive non-autoregressive graph-to-graph trans-      250. Oxford University Press.
  former for dependency parsing with iterative refine-
  ment.                                                  Rico Sennrich, Barry Haddow, and Alexandra Birch.
                                                           2016. Neural machine translation of rare words
Sebastian Padó and Mirella Lapata. 2007. Dependency-
                                                           with subword units. In Proceedings of the 54th An-
  based construction of semantic space models. Com-
                                                           nual Meeting of the Association for Computational
  putational Linguistics, 33(2):161–199.
                                                           Linguistics (Volume 1: Long Papers), pages 1715–
Jeffrey Pennington, Richard Socher, and Christopher        1725, Berlin, Germany. Association for Computa-
   Manning. 2014. Glove: Global vectors for word rep-      tional Linguistics.
   resentation. In Proceedings of the 2014 Conference
   on Empirical Methods in Natural Language Process-     Lokendra Shastri and Venkat Ajjanagadde. 1993. From
   ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso-        simple associations to systematic reasoning: A con-
   ciation for Computational Linguistics.                  nectionist representation of rules, variables, and dy-
                                                           namic bindings using temporal synchrony. Behav-
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt            ioral and Brain Sciences, 16:417–451.
 Gardner, Christopher Clark, Kenton Lee, and Luke
 Zettlemoyer. 2018. Deep contextualized word rep-        Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
 resentations. In Proceedings of the 2018 Confer-          2018. Self-attention with relative position represen-
 ence of the North American Chapter of the Associ-         tations. In Proceedings of the 2018 Conference of
 ation for Computational Linguistics: Human Lan-           the North American Chapter of the Association for
 guage Technologies, Volume 1 (Long Papers), pages         Computational Linguistics: Human Language Tech-
 2227–2237, New Orleans, Louisiana. Association            nologies, Volume 2 (Short Papers), pages 464–468,
 for Computational Linguistics.                            New Orleans, Louisiana. Association for Computa-
                                                           tional Linguistics.
Carl Pollard and Ivan A. Sag. 1987. Information-Based
  Syntax and Semantics. Vol 1: Fundamentals. Cen-        Paul Smolensky. 1988. On the proper treatment of con-
  ter for the Study of Language and Information, Stan-     nectionism. Behavioral and Brain Sciences, 11:1–
  ford, CA.                                                17.

Adwait Ratnaparkhi. 1999. Learning to parse natural      Paul Smolensky. 1990. Tensor product variable bind-
  language with maximum entropy models. Machine            ing and the representation of symbolic structures in
  Learning, 34:151–175.                                    connectionist systems. Artificial Intelligence, 46(1-
                                                           2):159–216.
Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B
  Viegas, Andy Coenen, Adam Pearce, and Been Kim.        Richard Socher, John Bauer, Christopher D. Manning,
  2019. Visualizing and measuring the geometry of          and Andrew Y. Ng. 2013a. Parsing with compo-
  bert. In H. Wallach, H. Larochelle, A. Beygelzimer,      sitional vector grammars. In Proceedings of the
51st Annual Meeting of the Association for Compu-         H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
  tational Linguistics (Volume 1: Long Papers), pages       nett, editors, Advances in Neural Information Pro-
  455–465, Sofia, Bulgaria. Association for Computa-        cessing Systems 30, pages 5998–6008. Curran Asso-
  tional Linguistics.                                       ciates, Inc.

Richard Socher, Alex Perelygin, Jean Wu, Jason            Oriol Vinyals, Ł ukasz Kaiser, Terry Koo, Slav Petrov,
  Chuang, Christopher D. Manning, Andrew Ng, and            Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-
  Christopher Potts. 2013b. Recursive deep models           mar as a foreign language. In C. Cortes, N. D.
  for semantic compositionality over a sentiment tree-      Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-
  bank. In Proceedings of the 2013 Conference on            nett, editors, Advances in Neural Information Pro-
  Empirical Methods in Natural Language Processing,         cessing Systems 28, pages 2773–2781. Curran Asso-
  pages 1631–1642, Seattle, Washington, USA. Asso-          ciates, Inc.
  ciation for Computational Linguistics.
                                                          Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
Mark Steedman. 2000. The Syntactic Process. MIT             nrich, and Ivan Titov. 2019. Analyzing multi-head
 Press, Cambridge.                                          self-attention: Specialized heads do the heavy lift-
                                                            ing, the rest can be pruned. In Proceedings of the
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.         57th Annual Meeting of the Association for Com-
   Sequence to sequence learning with neural networks.      putational Linguistics, pages 5797–5808, Florence,
   In Z. Ghahramani, M. Welling, C. Cortes, N. D.           Italy. Association for Computational Linguistics.
   Lawrence, and K. Q. Weinberger, editors, Advances
                                                          Majid Yazdani and James Henderson. 2015. Incre-
   in Neural Information Processing Systems 27, pages
                                                           mental recurrent neural network dependency parser
   3104–3112. Curran Associates, Inc.
                                                           with search-based discriminative training. In Pro-
                                                           ceedings of the Nineteenth Conference on Computa-
Swabha Swayamdipta, Miguel Ballesteros, Chris Dyer,
                                                           tional Natural Language Learning, pages 142–152,
  and Noah A. Smith. 2016. Greedy, joint syntactic-
                                                           Beijing, China. Association for Computational Lin-
  semantic parsing with stack LSTMs. In Proceedings
                                                           guistics.
  of The 20th SIGNLL Conference on Computational
  Natural Language Learning, pages 187–197, Berlin,
  Germany. Association for Computational Linguis-
  tics.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
   BERT rediscovers the classical NLP pipeline. In
   Proceedings of the 57th Annual Meeting of the Asso-
   ciation for Computational Linguistics, pages 4593–
   4601, Florence, Italy. Association for Computational
   Linguistics.

Ivan Titov and James Henderson. 2007a. A latent vari-
   able model for generative dependency parsing. In
   Proceedings of the Tenth International Conference
   on Parsing Technologies, pages 144–155, Prague,
   Czech Republic. Association for Computational Lin-
   guistics.

Ivan Titov and James Henderson. 2007b. A latent
   variable model for generative dependency parsing.
   In Proceedings of the International Conference on
   Parsing Technologies (IWPT’07), Prague, Czech Re-
   public. Association for Computational Linguistics.

Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.
   2010. Word representations: A simple and general
   method for semi-supervised learning. In Proceed-
   ings of the 48th Annual Meeting of the Association
   for Computational Linguistics, pages 384–394, Up-
   psala, Sweden. Association for Computational Lin-
   guistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In I. Guyon, U. V. Luxburg, S. Bengio,
You can also read