Attention Can Reflect Syntactic Structure

Page created by Tiffany Graves
 
CONTINUE READING
Attention Can Reflect Syntactic Structure
Attention Can Reflect Syntactic Structure
                                                                                   (If You Let It)
                                                                  ∗
                                                                Vinit Ravishankar† ∗ Artur Kulmizev‡ Mostafa Abdou§
                                                                          Anders Søgaard§       Joakim Nivre‡
                                                      †
                                                        Language Technology Group, Department of Informatics, University of Oslo
                                                              ‡
                                                                Department of Linguistics and Philology, Uppsala University
                                                            §
                                                              Department of Computer Science, University of Copenhagen
                                                                               †
                                                                                 vinitr@ifi.uio.no
                                                            ‡
                                                              {artur.kulmizev,joakim.nivre}@lingfil.uu.se
                                                                         §
                                                                           {abdou,soegaard}@di.ku.dk

                                                                   Abstract                              1997) as NLP’s feature-encoder-of-choice, largely
                                                                                                         due to its compatibility with parallelized training
                                             Since the popularization of the Transformer
                                                                                                         regimes and ability to handle long-distance depen-
arXiv:2101.10927v1 [cs.CL] 26 Jan 2021

                                             as a general-purpose feature encoder for NLP,
                                             many studies have attempted to decode lin-
                                                                                                         dencies.
                                             guistic structure from its novel multi-head at-                Certainly, the nature of attention as a distribu-
                                             tention mechanism. However, much of such                    tion over tokens lends itself to a straightforward
                                             work focused almost exclusively on English —                interpretation of a model’s inner workings. Bah-
                                             a language with rigid word order and a lack                 danau et al. (2014) illustrate this nicely in the con-
                                             of inflectional morphology. In this study, we               text of seq2seq machine translation, showing
                                             present decoding experiments for multilingual
                                                                                                         that the attention learned by their models reflects
                                             BERT across 18 languages in order to test the
                                             generalizability of the claim that dependency               expected cross-lingual idiosyncrasies between En-
                                             syntax is reflected in attention patterns. We               glish and French, e.g., concerning word order. With
                                             show that full trees can be decoded above base-             self-attentive Transformers, interpretation becomes
                                             line accuracy from single attention heads, and              slightly more difficult, as attention is distributed
                                             that individual relations are often tracked by              across words within the input itself. This is fur-
                                             the same heads across languages. Furthermore,               ther compounded by the use of multiple layers and
                                             in an attempt to address recent debates about
                                                                                                         heads, each combination of which yields its own
                                             the status of attention as an explanatory mecha-
                                             nism, we experiment with fine-tuning mBERT                  alignment, representing a different (possibly re-
                                             on a supervised parsing objective while freez-              dundant) view of the data. Given the similarity
                                             ing different series of parameters. Interest-               of such attention matrices to the score matrices
                                             ingly, in steering the objective to learn explicit          employed in arc-factored dependency parsing (Mc-
                                             linguistic structure, we find much of the same              Donald et al., 2005a,b), a salient question concern-
                                             structure represented in the resulting attention            ing interpretability becomes: Can we expect some
                                             patterns, with interesting differences with re-             combination of these parameters to capture linguis-
                                             spect to which parameters are frozen.
                                                                                                         tic structure in the form of a dependency tree, espe-
                                         1   Introduction                                                cially if the model performs well on NLP tasks? If
                                                                                                         not, can we relax the expectation and examine the
                                         In recent years, the attention mechanism proposed               extent to which subcomponents of the linguistic
                                         by Bahdanau et al. (2014) has become an indis-                  structure, such as subject-verb relations, are repre-
                                         pensable component of many NLP systems. Its                     sented? This prospect was first posed by Raganato
                                         widespread adoption was, in part, heralded by                   et al. (2018) for MT encoders, and later explored
                                         the introduction of the Transformer architecture                by Clark et al. (2019) for BERT. Ultimately, the
                                         (Vaswani et al., 2017a), which constrains a soft                consensus of these and other studies (Voita et al.,
                                         alignment to be learned across discrete states in               2019; Htut et al., 2019; Limisiewicz et al., 2020)
                                         the input (self-attention), rather than across input            was that, while there appears to exist no “general-
                                         and output (e.g., Xu et al., 2015; Rocktäschel et al.,         ist” head responsible for extracting full dependency
                                         2015). The Transformer has, by now, supplanted                  structures, standalone heads often specialize in cap-
                                         the popular LSTM (Hochreiter and Schmidhuber,                   turing individual grammatical relations.
                                             ∗
                                                 Equal contribution. Order was decided by a coin toss.      Unfortunately, most of such studies focused their
Attention Can Reflect Syntactic Structure
experiments entirely on English, which is typologi-      key (K), and value (V ) vectors via three separate
cally favored to succeed in such scenarios due to its    linear transformations and passed to an attention
rigid word order and lack of inflectional morphol-       mechanism. A single attention head computes
ogy. It remains to be seen whether the attention         scaled dot-product attention between K and Q, out-
patterns of such models can capture structural fea-      putting a weighted sum of V :
                                                                                                 QK >
                                                                                                       
tures across typologically diverse languages, or if
                                                            Attention(Q, K, V ) = softmax √               V
the reported experiments on English are a misrep-                                                   dk
resentation of local positional heuristics as such.                                                        (1)
Furthermore, though previous work has investi-           For multihead attention (MHA), the same process
gated how attention patterns might change after          is repeated for k heads, allowing the model to
fine-tuning on different tasks (Htut et al., 2019),      jointly attend to information from different repre-
a recent debate about attention as an explanatory        sentation subspaces at different positions (Vaswani
mechanism (Jain and Wallace, 2019; Wiegreffe and         et al., 2017b). Ultimately, the output of all heads is
Pinter, 2019) has cast the entire enterprise in doubt.   concatenated and passed through a linear projection
Indeed, it remains to be seen whether fine-tuning        W O:                                          
on an explicit structured prediction task, e.g. de-         Hi = Attention QWiQ , KWiK , V WiV             (2)
pendency parsing, can force attention to represent
the structure being learned, or if the patterns ob-
served in pretrained models are not altered in any        MHA(Q, K, V ) = concat(H1 , H2 , ..., Hk )W O
meaningful way.                                                                                           (3)
   To address these issues, we investigate the           Every layer also consists of a feed-forward network
prospect of extracting linguistic structure from         (FFN), consisting of two Dense layers with ReLU
the attention weights of multilingual Transformer-       activation functions. For each layer, therefore, the
based language models. In light of the surveyed          output of MHA is passed through a LayerNorm
literature, our research questions are as follows:       with residual connections, passed through FFN,
    1. Can we decode dependency trees for some           and then through another LayerNorm with residual
       languages better than others?                     connections.
    2. Do the same layer–head combinations track
                                                         Searching for structure Often, the line of in-
       the same relations across languages?
                                                         quiry regarding interpretability in NLP has been
    3. How do attention patterns change after fine-
                                                         concerned with extracting and analyzing linguistic
       tuning with explicit syntactic annotation?
                                                         information from neural network models of lan-
    4. Which components of the model are involved
                                                         guage (Belinkov and Glass, 2019). Recently, such
       in these changes?
                                                         investigations have targeted Transformer models
In answering these questions, we believe we can          (Hewitt and Manning, 2019; Rosa and Mareček,
shed further light on the (cross-)linguistic proper-     2019; Tenney et al., 2019), at least in part because
ties of Transformer-based language models, as well       the self-attention mechanism employed by these
as address the question of attention patterns being      models offers a possible window into their inner
a reliable representation of linguistic structure.       workings. With large-scale machine translation and
                                                         language models being openly distributed for ex-
2     Attention as Structure
                                                         perimentation, several researchers have wondered
Transformers The focus of the present study              if self-attention is capable of representing syntactic
is mBERT, a multilingual variant of the exceed-          structure, despite not being trained with any overt
ingly popular language model (Devlin et al., 2019).      parsing objective.
BERT is built upon the Transformer architecture             In pursuit of this question, Raganato et al. (2018)
(Vaswani et al., 2017b), which is a self-attention-      applied a maximum-spanning-tree algorithm over
based encoder-decoder model (though only the en-         the attention weights of several trained MT models,
coder is relevant to our purposes). A Transformer        comparing them with gold trees from Universal De-
takes a sequence of vectors x = [x1 , x2 , ...xn ] as    pendencies (Nivre et al., 2016, 2020). They found
input and applies a positional encoding to them,         that, while the accuracy was not comparable to that
in order to retain the order of words in a sentence.     of a supervised parser, it was nonetheless higher
These inputs are then transformed into query (Q),        than several strong baselines, implying that some
Attention Can Reflect Syntactic Structure
structure was consistently represented. Clark et al.     standard attention mechanisms. On a theoretical
(2019) corroborated the same findings for BERT           level, they argue that, although attention weights do
when decoding full trees, but observed that indi-        not give an exclusive “faithful” explanation, they
vidual dependency relations were often tracked by        do provide a meaningful plausible explanation.
specialized heads and were decodable with much               This discussion is relevant to our study because
higher accuracy than some fixed-offset baselines.        it remains unclear whether or not attending to syn-
Concurrently, Voita et al. (2019) made a similar         tactic structure serves, in practice, as plausible ex-
observation about heads specializing in specific         planation for model behavior, or whether or not it is
dependency relations, proposing a coarse taxon-          even capable of serving as such. Indeed, the studies
omy of head attention functions: positional, where       of Raganato et al. (2018) and Clark et al. (2019)
heads attend to adjacent tokens; syntactic, where        relate a convincing but incomplete picture — tree
heads attend to specific syntactic relations; and rare   decoding accuracy just marginally exceeds base-
words, where heads point to the least frequent to-       lines and various relations tend to be tracked across
kens in the sentence. Htut et al. (2019) followed        varying heads and layers. Thus, our fine-tuning ex-
Raganato et al. (2018) in decoding dependency            periments (detailed in the following section) serve
trees from BERT-based models, finding that fine-         to enable an “easy” setting wherein we explicitly
tuning on two classification tasks did not produce       inform our models of the same structure that we are
syntactically plausible attention patterns. Lastly,      trying to extract. We posit that, if, after fine-tuning,
Limisiewicz et al. (2020) modified UD annotation         syntactic structures were still not decodable from
to better represent attention patterns and introduced    the attention weights, one could safely conclude
a supervised head-ensembling method for consoli-         that these structures are being stored via a non-
dating shared syntactic information across heads.        transparent mechanism that may not even involve
                                                         attention weights. Such an insight would allow us
Does attention have explanatory value? Though            to conclude that attention weights cannot provide
many studies have yielded insight about how atten-       even a plausible explanation for models relying on
tion behaves in a variety of models, the question of     syntax.
whether it can be seen as a “faithful” explanation
of model predictions has been subject to much re-        3     Experimental Design
cent debate. For example, Jain and Wallace (2019)
present compelling arguments that attention does         To examine the extent to which we can decode de-
not offer a faithful explanation of predictions. Pri-    pendency trees from attention patterns, we run a
marily, they demonstrate that there is little correla-   tree decoding algorithm over mBERT’s attention
tion between standard feature importance measures        heads — before and after fine-tuning via a parsing
and attention weights. Furthermore, they contend         objective. We surmise that doing so will enable us
that there exist counterfactual attention distribu-      to determine if attention can be interpreted as a re-
tions, which are substantially different from learned    liable mechanism for capturing linguistic structure.
attention weights but that do not alter a model’s pre-
                                                         3.1    Model
dictions. Using a similar methodology, Serrano and
Smith (2019) corroborate that attention does not         We employ mBERT1 in our experiments, which
provide an adequate account of an input compo-           has been shown to perform well across a variety
nent’s importance.                                       of NLP tasks (Hu et al., 2020; Kondratyuk and
   In response to these findings, Wiegreffe and Pin-     Straka, 2019a) and capture aspects of syntactic
ter (2019) question the assumptions underlying           structure cross-lingually (Pires et al., 2019; Chi
such claims. Attention, they argue, is not a prim-       et al., 2020). mBERT features 12 layers with 768
itive, i.e., it cannot be detached from the rest of a    hidden units and 12 attention heads, with a joint
model’s components as is done in the experiments         WordPiece sub-word vocabulary across languages.
of Jain and Wallace (2019). They propose a set           The model was trained on the concatenation of
of four analyses to test whether a given model’s         WikiDumps for the top 104 languages with the
attention mechanism can provide meaningful ex-           largest Wikipedias,where principled sampling was
planation and demonstrate that the alternative at-       employed to enforce a balance between high- and
tention distributions found via adversarial training       1
                                                             https://github.com/google-research/
methods do, in fact, perform poorly compared to          bert
Attention Can Reflect Syntactic Structure
low-resource languages.                                       the exact tree structure we aim to decode. To this
                                                              end, we employ the graph-based decoding algo-
3.2    Decoding Algorithm                                     rithm of the biaffine parser introduced by Dozat
For decoding dependency trees, we follow Ra-                  and Manning (2016). We replace the standard
ganato et al. (2018) in applying the Chu-Liu-                 BiLSTM encoder for this parser with the entire
Edmonds maximum spanning tree algorithm (Chu,                 mBERT network, which we fine-tune with the pars-
1965) to every layer/head combination available               ing loss. The full parser decoder consists of four
in mBERT (12 × 12 = 144 in total). In order for               dense layers, two for head/child representations for
the matrices to correspond to gold treebank tok-              dependency arcs (dim. 500) and two for head/child
enization, we remove the cells corresponding to               representations for dependency labels (dim. 100).
the BERT delimiter tokens ([CLS] and [SEP]).                  These are transformed into the label space via a
In addition to this, we sum the columns and av-               bilinear transform.
erage the rows corresponding to the constituent                  After training the parser, we can decode the fine-
subwords of gold tokens, respectively (Clark et al.,          tuned mBERT parameters in the same fashion as
2019). Lastly, since attention patterns across heads          described in Section 3.2. We surmise that, if atten-
may differ in whether they represent heads attend-            tion heads are capable of tracking hierarchical rela-
ing to their dependents or vice versa, we take our            tions between words in any capacity, it is precisely
input to be the element-wise product of a given               in this setting that this ability would be attested. In
attention matrix and its transpose (A ◦ A> ). We              addition to this, we are interested in what individual
liken this to the joint probability of a head attend-         components of the mBERT network are capable of
ing to its dependent and a dependent attending to             steering attention patterns towards syntactic struc-
its head, similarly to Limisiewicz et al. (2020). Per         ture. We believe that addressing this question will
this point, we also follow Htut et al. (2019) in eval-        help us not only in interpreting decisions made by
uating the decoded trees via Undirected Unlabeled             BERT-based neural parsers, but also in aiding us de-
Attachment Score (UUAS) — the percentage of                   veloping syntax-aware models in general (Strubell
undirected edges recovered correctly. Since we dis-           et al., 2018; Swayamdipta et al., 2018). As such —
count directionality, this is effectively a less strict       beyond fine-tuning all parameters of the mBERT
measure than UAS, but one that has a long tradition           network (our basic setting) — we perform a se-
in unsupervised dependency parsing since Klein                ries of ablation experiments wherein we update
and Manning (2004).                                           only one set of parameters per training cycle, e.g.
                                                              the Query weights WiQ , and leave everything else
3.3    Data
                                                              frozen. This gives us a set of 6 models, which are
For our data, we employ the Parallel Universal De-            described below. For each model, all non-BERT
pendencies (PUD) treebanks, as collected in UD                parser components are always left unfrozen.
v2.4 (Nivre et al., 2019). PUD was first released
as part of the CONLL 2017 shared task (Zeman                     • K EY: only the K components of the trans-
et al., 2018), containing 1000 parallel sentences,                 former are unfrozen; these are the represen-
which were (professionally) translated from En-                    tations of tokens that are paying attention to
glish, German, French, Italian, and Spanish to 14                  other tokens.
other languages. The sentences are taken from two                • Q UERY: only the Q components are unfrozen;
domains, news and wikipedia, the latter implying                   these, conversely, are the representations of
some overlap with mBERT’s training data (though                    tokens being paid attention to.
we did not investigate this). We include all PUD                 • KQ: both keys and queries are unfrozen.
treebanks except Thai.2                                          • VALUE: semantic value vectors per token (V )
                                                                   are unfrozen; they are composed after being
3.4    Fine-Tuning Details                                         weighted with attention scores obtained from
In addition to exploring pretrained mBERT’s atten-                 the K/Q matrices.
tion weights, we are also interested in how attention            • D ENSE: the dense feed-forward networks in
might be guided by a training objective that learns                the attention mechanism; all three per layer
   2
                                                                   are unfrozen.
     Thai is the only treebank that does not have a non-PUD
treebank available in UD, which we need for our fine-tuning      • N ONE: The basic setting with nothing frozen;
experiments.                                                       all parameters are updated with the parsing
AR      CS      DE      EN      ES      FI      FR      HI      ID      IT      JA      KO      PL      PT      RU      SV      TR     ZH

 BASELINE        50      40      36      36      40      42      40      46      47      40      43      55      45      41      42      39      52     41
                 53      53      49      47      50      48      41      48      50      41      45      64      52      50      51      51      55     42
 P RE
                 7-6    10-8    10-8    10-8     9-5    10-8     2-3     2-3     9-5     6-4     2-3     9-2    10-8     9-5    10-8    10-8     3-8    2-3
              76         78      76      71      77      66      45      72      75      58      42      64      75      76      75      74      55     38
 N ONE
             11-10      11-10   11-10   10-11   10-11   10-11   11-10   11-10   11-10   11-10   11-10   11-10   11-10   11-10   10-8    10-8     3-8    2-3
                  62     64      58      53      59      56      41      54      59      47      44      62      64      58      61      59      55     41
 K EY
                 10-8   10-8    11-12   10-8    11-12   10-8    7-12    10-8    10-8     9-2     2-3    10-8    10-8    11-12   10-8    12-10   3-12    2-3
                  69     74      70      66      73      63      42      62      67      54      45      65      72      70      70      68      56     42
 Q UERY
                 11-4   10-8    11-4    11-4    11-4    10-8    11-4    11-4    11-4    11-4     2-3    10-8    11-4    11-4    10-8    11-4    10-8    2-3
                  71     76      70      65      74      62      43      64      69      55      44      64      73      73      69      69      55     41
 KQ
                 11-4   11-4    11-4    11-4    11-4    11-4    10-11   11-4    11-4    11-4     2-3    11-4    11-4    11-4    11-4    11-4    11-4    2-3
                  75     72      72      64      76      59      45      63      73      55      45      66      73      74      69      65      57     42
 VALUE
                 12-5   12-5    12-5    12-5    12-5    12-5    12-5    12-5    12-5    12-5     2-3    10-8    12-5    12-5    12-5    12-5    12-5    3-8
              68         71      65      60      67      61      42      65      66      49      44      64      70      64      67      64      55      40
 D ENSE
             11-10      11-10   11-10   10-8    12-10   11-10   10-8    11-10   11-10    9-5    3-12    11-10   11-10   12-5    11-10   11-10   11-10   3-12

Table 1: Adjacent-branching baseline and maximum UUAS decoding accuracy per PUD treebank, expressed as
best score and best layer/head combination for UUAS decoding. P RE refers to basic mBERT model before fine-
tuning, while all cells below correspond different fine-tuned models described in Section 3.4. Best score indicated
in bold.

         loss.
We fine-tune each of these models on a concaten-
tation of all PUD treebanks for 20 epochs, which
effectively makes our model multilingual. We do
so in order to 1) control for domain and annotation
confounds, since all PUD sentences are parallel
and are natively annotated (unlike converted UD
treebanks, for instance); 2) increase the number of
training samples for fine-tuning, as each PUD tree-
bank features only 1000 sentences; and 3) induce
a better parser through multilinguality, as in Kon-
dratyuk and Straka (2019b). Furthermore, in or-
der to gauge the overall performance of our parser
across all ablated settings, we evaluate on the test
set of the largest non-PUD treebank available for
each language, since PUD only features test par-
titions. When training, we employ a combined
dense/sparse Adam optimiser, at a learning rate of
3 ∗ 10−5 . We rescale gradients to have a maximum
norm of 5.

4       Decoding mBERT Attention

The second row of Table 1 (P RE) depicts the UUAS
after running our decoding algorithm over mBERT
attention matrices, per language. We see a famil-
iar pattern to that in Clark et al. (2019) among
                                                                                  Figure 1: UUAS of MST decoding per layer and head,
others — namely that attention patterns extracted                                 across languages. Heads (y-axis) are sorted by accu-
directly from mBERT appear to be incapable of                                     racy for easier visualization.
decoding dependency trees beyond a threshold of
50–60% UUAS accuracy. However, we also note
that, in all languages, the attention-decoding algo-                              linear structures are captured with regularity. In-
rithm outperforms a BASELINE (row 1) that draws                                   deed, head 8 in layer 10 appears to be particularly
an (undirected) edge between any two adjacent                                     strong in this regard, returning the highest UUAS
words in linear order, which implies that some non-                               for 7 languages. Interestingly, the accuracy patterns
Figure 2: Left: UUAS per relation across languages (best layer/head combination indicated in cell). Right: Best
UUAS as a function of best positional baseline (derived from the treebank), selected relations.

across layers depicted in Figure 1 tend to follow        guages (Korean Hangul and Japanese Hiragana and
an identical trend for all languages, with nearly all    Katakana included), which rely on whitespace and
heads in layer 7 returning high within-language          WordPiece (Wu et al., 2016). It is thus possible that
accuracies.                                              the attention distributions for these two languages
   It appears that attention for some languages (Ara-    (at least where CJK characters are relevant) are de-
bic, Czech, Korean, Turkish) is comparatively eas-       voted to composing words, rather than structural
ier to decode than others (French, Italian, Japanese,    relations, which will distort the attention matrices
Chinese). A possible explanation for this result is      that we compute to correspond with gold tokeniza-
that dependency relations between content words,         tion (e.g. by maxing rows and averaging columns).
which are favored by the UD annotation, are more
likely to be adjacent in the morphologically rich        Relation analysis We can disambiguate what
languages of the first group (without intervening        sort of structures are captured with regularity by
function words). This assumption seems to be cor-        looking at the UUAS returned per dependency rela-
roborated by the high baseline scores for Arabic,        tion. Figure 2 (left) shows that adjectival modifiers
Korean and Turkish (but not Czech). Conversely,          (amod, mean UUAS = 85 ±12) and determiners
the low baselines scores and the likewise low de-        (det, 88 ± 6) are among the easiest relations to
coding accuracies for the latter four languages are      decode across languages. Indeed, words that are
difficult to characterize. Indeed, we could not iden-    connected by these relations are often adjacent to
tify what factors — typological, annotation, tok-        each other and may be simple to decode if a head is
enization or otherwise — would set French and Ital-      primarily concerned with tracking linear order. To
ian apart from the remaining languages in terms of       verify the extent to which this might be happening,
score. However, we hypothesize that the tokeniza-        we plot the aforementioned decoding accuracy as
tion and our treatment of subword tokens plays a         a function of select relations’ positional baselines
part in attempting to decode attention from Chinese      in Figure 2 (right). The positional baselines, in this
and Japanese representations. Per the mBERT doc-         case, are calculated by picking the most frequent
umentation,3 Chinese and Japanese Kanji character        offset at which a dependent occurs with respect to
spans within the CJK Unicode range are character-        its head, e.g., −1 for det in English, meaning one
tokenized. This lies in contrast with all other lan-     position to the left of the head. Interestingly, while
  3
    https://github.com/google-research/                  we observe significant variation across the posi-
bert/blob/master/multilingual.md                         tional baselines for amod and det, the decoding
accuracy remains quite high.                            terns of the fine-tuned model typically yields better
   In slight contrast to this, the core subject         UUAS than the pretrained model, particularly in
(nsubj, 58 ± 16 SD) and object (obj, 64 ± 13)           the highest layers. Indeed, the lowest layer at which
relations prove to be more difficult to decode. Un-     fine-tuning appears to improve decoding is layer
like the aforementioned relations, nsubj and obj        7. This implies that, regardless of which compo-
are much more sensitive to the word order proper-       nent remains frozen, the parameters facing any sort
ties of the language at hand. For example, while        of significant and positive update tend to be those
a language like English, with Subject-Verb-Object       appearing towards the higher-end of the network,
(SVO) order, might have the subject frequently          closer to the output.
appear to the left of the verb, an SOV language            For the frozen components, the best improve-
like Hindi might have it several positions further      ments in UUAS are seen at the final layer in VALUE,
away, with an object and its potential modifiers        which is also the only model that shows consistent
intervening. Indeed, the best positional baseline       improvement, as well as the highest average im-
for English nsubj is 39 UUAS, while it is only          provement in mean scores4 for the last few layers.
10 for Hindi. Despite this variation, the relation      Perhaps most interestingly, the mean UUAS (Fig-
seems to be tracked with some regularity by the         ure 3 (bottom)) for our “attentive” components –
same head (layer 3, head 9), returning 60 UUAS          keys, queries, and their combination – does not
for English and 52 for Hindi. The same can largely      appear to have improved by much after fine-tuning.
be said for obj, where the positional baselines re-     In contrast, the maximum does show considerable
turn 51 ± 18. In this latter case, however, the heads   improvement; this seems to imply that although
tend to be much differently distributed across lan-     all components appear to be more or less equally
guages. Finally, he results for the obj relation        capable of learning decodable heads, the attentive
provides some support for our earlier explanation       components, when fine-tuned, appear to sharpen
concerning morphologically rich languages, as Ara-      fewer heads.
bic, Czech, Korean and Turkish all have among the          Note that the only difference between keys and
highest accuracies (as well as positional baselines).   queries in an attention mechanism is that keys are
                                                        transposed to index attention from/to appropriately.
5   Fine-Tuning Experiments                             Surprisingly, K EY and Q UERY appear to act some-
                                                        what differently, with Q UERY being almost uni-
Next, we investigate the effect fine-tuning has on      formly better than K EY with the best heads, whilst
UUAS decoding. Row 3 in Table 1 (N ONE) indi-           K EY is slightly better with averages, implying dis-
cates that fine-tuning does result in large improve-    tinctions in how both store information. Further-
ments to UUAS decoding across most languages,           more, allowing both keys and queries seems to
often by margins as high as ∼ 30%. This shows           result in an interesting contradiction – the ultimate
that with an explicit parsing objective, attention      layer, which has reasonable maximums and aver-
heads are capable of serving as explanatory mecha-      ages for both K EY and Q UERY, now seems to show
nisms for syntax; syntactic structure can be made       a UUAS drop almost uniformly. This is also true
to be transparently stored in the heads, in a man-      for the completely unfrozen encoder.
ner that does not require additional probe fitting or
parameterized transformation to extract.                Supervised Parsing In addition to decoding
   Given that we do manage to decode reasonable         trees from attention matrices, we also measure su-
syntactic trees, we can then refine our question        pervised UAS/LAS on a held-out test set.5 Based
— what components are capable of learning these         on Figure 4, it is apparent that all settings result
trees? One obvious candidate is the key/query           in generally the same UAS. This is somewhat ex-
component pair, given that attention weights are a      pected; Lauscher et al. (2020) see better results on
scaled softmax of a composition of the two. Figure      parsing with the entire encoder frozen, implying
3 (top) shows the difference between pretrained         that the task is easy enough for a biaffine parser
UUAS and fine-tuned UUAS per layer, across mod-
                                                           4
els and languages. Interestingly, the best parsing            The inner average is over all heads; the outer is over all
accuracies do not appear to vary much depending         languages.
                                                            5
                                                              Note that the test set in our scenario is from the actual,
on what component is frozen. We do see a clear          non-parallel language treebank; as such, we left Korean out of
trend, however, in that decoding the attention pat-     this comparison due to annotation differences.
Figure 3: (Top) best scores across all heads, per language; (bottom) mean scores across all heads, per language.
The languages (hidden from the X-axis for brevity) are, in order, ar, cs, de, en, es, fi, fr, hi, id, it, ja, ko, pl, pt, ru,
sv, tr, zh

                                                                be better than N ONE on average, implying that
                                                                non-dense components might actually be hurting
                                                                labelling capacity.
                                                                   In brief, consolidating the two sets of results
                                                                above, we can draw three interesting conclusions
                                                                about the components:
                                                                   1. Value vectors are best aligned with syntactic
                                                                      dependencies; this is reflected both in the best
                                                                      head at the upper layers, and the average score
                                                                      across all heads.
                                                                   2. Dense layers appear to have moderate infor-
                                                                      mative capacity, but appear to have the best
                                                                      learning capacity for the task of arc labelling.
                                                                   3. Perhaps most surprisingly, Key and Query
                                                                      vectors do not appear to make any outstanding
                                                                      contributions, save for sharpening a smaller
                                                                      subset of heads.
Figure 4: Mean UAS and LAS when evaluating differ-              Our last result is especially surprising for UUAS de-
ent models on language-specific treebanks (Korean ex-
cluded due to annotation differences). M BERT refers
                                                                coding. Keys and queries, fundamentally, combine
to models where the entire mBERT network is frozen              to form the attention weight matrix, which is pre-
as input to the parser.                                         cisely what we use to decode trees. One would ex-
                                                                pect that allowing these components to learn from
                                                                labelled syntax would result in the best improve-
to learn, given frozen mBERT representations.6                  ments to decoding, but all three have surprisingly
The LAS distinction is, however, rather interesting:            negligible mean improvements. This indicates that
there is a marked difference between how impor-                 we need to further improve our understanding of
tant the dense layers are, as opposed to the atten-             how attentive structure and weighting really works.
tive components. This is likely not reflected in our
                                                                Cross-linguistic observations We notice no
UUAS probe as, strictly speaking, labelling arcs
                                                                clear cross-linguistic trends here across different
is not equivalent to searching for structure in sen-
                                                                component sets; however, certain languages do
tences, but more akin to classifying pre-identified
                                                                stand out as being particularly hard to decode from
structures. We also note that D ENSE appears to
                                                                the fine-tuned parser. These include Japanese, Ko-
   6
     Due to training on concatenated PUD sets, however, our     rean, Chinese, French and Turkish. For the first
results are not directly comparable/                            three, we hypothesise that tokenization clashes with
mBERT’s internal representations may play a role.       wards resembling the dependency structure. In con-
Indeed, as we hypothesized in Section 3.2, it could     trast, it was the Value (V ) parameters, which are
be the case that the composition of CJK charac-         used for computing a weighted sum over the KQ-
ters into gold tokens for Chinese and Japanese may      produced attention, that yielded the most faithful
degrade the representations (and their correspond-      representations of the linguistic structure via atten-
ing attention) therein. Furthermore, for Japanese       tion.
and Korean specifically, it has been observed that         Though prior work (Kovaleva et al., 2019; Zhao
tokenization strategies employed by different tree-     and Bethard, 2020) seems to indicate that there is
banks could drastically influence the conclusions       a lack of a substantial change in attention patterns
one may draw about their inherent hierarchical          after fine-tuning on syntax- and semantics-oriented
structure (Kulmizev et al., 2020). Turkish and          classification tasks, the opposite effect has been
French are admittedly more difficult to diagnose.       observed with fine-tuning on negation scope reso-
Note, however, that we fine-tuned our model on a        lution, where a more explanatory attention mech-
concatenation of all PUD treebanks. As such, any        anism can be induced (Htut et al., 2019). Our re-
deviation from PUD’s annotation norms is there-         sults are similar to the latter, and we demonstrate
fore likely to be heavily penalised, by virtue of       that given explicit syntactic annotation, attention
signal from other languages drowning out these          weights do end up storing more transparently de-
differences.                                            codable structure. It is, however, still unclear which
                                                        sets of transformer parameters are best suited for
6   Conclusion                                          learning this information and storing it in the form
                                                        of attention.
In this study, we revisited the prospect of decoding
dependency trees from the self-attention patterns of    Acknowledgements
Transformer-based language models. We elected
to extend our experiments to 18 languages in or-        Our experiments were run on resources provided
der to gain better insight about how tree decoding      by UNINETT Sigma2 - the National Infrastructure
accuracy might be affected in the face of (mod-         for High Performance Computing and Data Stor-
est) typological diversity. Surprisingly, across all    age in Norway, under the NeIC-NLPL umbrella.
languages, we were able to decode dependency            Mostafa and Anders were funded by a Google Fo-
trees from attention patterns more accurately than      cused Research Award. We would like to thank
an adjacent-linking baseline, implying that some        Daniel Dakota and Ali Basirat for some fruitful
structure was indeed being tracked by the mech-         discussions and the anonymous reviewers for their
anism. In looking at specific relation types, we        excellent feedback.
corroborated previous studies in showing that par-
ticular layer-head combinations tracked the same
relation with regularity across languages, despite      References
typological differences concerning word order, etc.     Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
   In investigating the extent to which attention can     gio. 2014. Neural machine translation by jointly
be guided to properly capture structural relations        learning to align and translate. arXiv preprint
                                                          arXiv:1409.0473.
between input words, we fine-tuned mBERT as in-
put to a dependency parser. This, we found, yielded     Yonatan Belinkov and James Glass. 2019. Analysis
large improvements over the pretrained attention          methods in neural language processing: A survey.
patterns in terms of decoding accuracy, demonstrat-       Transactions of the Association for Computational
ing that the attention mechanism was learning to          Linguistics, 7:49–72.
represent the structural objective of the parser. In
                                                        Ethan A. Chi, John Hewitt, and Christopher D. Man-
addition to fine-tuning the entire mBERT network,         ning. 2020. Finding universal grammatical rela-
we conducted a series of experiments, wherein we          tions in multilingual BERT. In Proceedings of the
updated only select components of model and left          58th Annual Meeting of the Association for Compu-
the remainder frozen. Most surprisingly, we ob-           tational Linguistics, pages 5564–5577, Online. As-
                                                          sociation for Computational Linguistics.
served that the Transformer parameters designed
for composing the attention matrix, K and Q, were       Yoeng-Jin Chu. 1965. On the shortest arborescence of
only modestly capable of guiding the attention to-        a directed graph. Scientia Sinica, 14:1396–1400.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and           Dan Kondratyuk and Milan Straka. 2019b. 75 lan-
  Christopher D. Manning. 2019. What does BERT              guages, 1 model: Parsing Universal Dependencies
  look at? an analysis of BERT’s attention. In Pro-         universally. In Proceedings of the 2019 Confer-
  ceedings of the 2019 ACL Workshop BlackboxNLP:            ence on Empirical Methods in Natural Language
  Analyzing and Interpreting Neural Networks for            Processing and the 9th International Joint Confer-
  NLP, pages 276–286, Florence, Italy. Association          ence on Natural Language Processing (EMNLP-
  for Computational Linguistics.                            IJCNLP), pages 2779–2795, Hong Kong, China. As-
                                                            sociation for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. Bert: Pre-training of        Olga Kovaleva, Alexey Romanov, Anna Rogers, and
   deep bidirectional transformers for language under-      Anna Rumshisky. 2019. Revealing the Dark Secrets
   standing. In Proceedings of the 2019 Conference of       of BERT. In Proceedings of the 2019 Conference on
   the North American Chapter of the Association for        Empirical Methods in Natural Language Processing
   Computational Linguistics: Human Language Tech-          and the 9th International Joint Conference on Natu-
   nologies, Volume 1 (Long and Short Papers), pages        ral Language Processing (EMNLP-IJCNLP), pages
   4171–4186.                                               4364–4373, Hong Kong, China. Association for
                                                            Computational Linguistics.
Timothy Dozat and Christopher D Manning. 2016.
  Deep biaffine attention for neural dependency pars-     Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou,
  ing. arXiv preprint arXiv:1611.01734.                     and Joakim Nivre. 2020. Do neural language mod-
                                                            els show preferences for syntactic formalisms? In
John Hewitt and Christopher D Manning. 2019. A              Proceedings of the 58th Annual Meeting of the Asso-
  structural probe for finding syntax in word represen-     ciation for Computational Linguistics, pages 4077–
  tations. In Proceedings of the 2019 Conference of         4091, Online. Association for Computational Lin-
  the North American Chapter of the Association for         guistics.
  Computational Linguistics: Human Language Tech-
  nologies, Volume 1 (Long and Short Papers), pages       Anne Lauscher, Vinit Ravishankar, Ivan Vuli, and
  4129–4138.                                                Goran Glava. 2020. From Zero to Hero: On the Lim-
                                                            itations of Zero-Shot Cross-Lingual Transfer with
Sepp Hochreiter and Jürgen Schmidhuber. 1997.              Multilingual Transformers. arXiv:2005.00633 [cs].
  Long short-term memory. Neural computation,               ArXiv: 2005.00633.
  9(8):1735–1780.                                         Tomasz Limisiewicz, Rudolf Rosa, and David Mareček.
                                                            2020. Universal dependencies according to bert:
Phu Mon Htut, Jason Phang, Shikha Bordia, and               both more specific and more general. arXiv preprint
  Samuel R Bowman. 2019. Do attention heads in              arXiv:2004.14620.
  bert track syntactic dependencies? arXiv preprint
  arXiv:1911.12246.                                       Ryan McDonald, Koby Crammer, and Fernando
                                                            Pereira. 2005a. Online large-margin training of de-
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham         pendency parsers. In Proceedings of the 43rd An-
  Neubig, Orhan Firat, and Melvin Johnson. 2020.            nual Meeting of the Association for Computational
  XTREME: A Massively Multilingual Multi-task               Linguistics (ACL), pages 91–98.
  Benchmark for Evaluating Cross-lingual Generaliza-
  tion. arXiv:2003.11080 [cs]. ArXiv: 2003.11080.         Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
                                                            Jan Hajič. 2005b. Non-projective dependency pars-
Sarthak Jain and Byron C Wallace. 2019. Attention is        ing using spanning tree algorithms. In Proceed-
  not explanation. In Proceedings of the 2019 Con-          ings of the Human Language Technology Confer-
  ference of the North American Chapter of the Asso-        ence and the Conference on Empirical Methods in
  ciation for Computational Linguistics: Human Lan-         Natural Language Processing (HLT/EMNLP), pages
  guage Technologies, Volume 1 (Long and Short Pa-          523–530.
  pers), pages 3543–3556.
                                                          Joakim Nivre, Mitchell Abrams, Željko Agić, Lars
Dan Klein and Christopher D. Manning. 2004. Corpus-         Ahrenberg, Gabrielė Aleksandravičiūtė, Lene An-
  based induction of syntactic structure: Models of de-     tonsen, Katya Aplonova, Maria Jesus Aranz-
  pendency and constituency. In Proceedings of the          abe, Gashaw Arutie, Masayuki Asahara, Luma
  42nd Annual Meeting of the Association for Compu-         Ateyah, Mohammed Attia, Aitziber Atutxa, Lies-
  tational Linguistics (ACL), pages 479–486.                beth Augustinus, Elena Badmaeva, Miguel Balles-
                                                            teros, Esha Banerjee, Sebastian Bank, Verginica
Dan Kondratyuk and Milan Straka. 2019a. 75 lan-             Barbu Mititelu, Victoria Basmov, John Bauer, San-
  guages, 1 model: Parsing universal dependencies           dra Bellato, Kepa Bengoetxea, Yevgeni Berzak,
  universally. In Proceedings of the 2019 Confer-           Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Erica
  ence on Empirical Methods in Natural Language             Biagetti, Eckhard Bick, Agnė Bielinskienė, Ro-
  Processing and the 9th International Joint Confer-        gier Blokland, Victoria Bobicev, Loı̈c Boizou,
  ence on Natural Language Processing (EMNLP-               Emanuel Borges Völker, Carl Börstell, Cristina
  IJCNLP), pages 2779–2795.                                 Bosco, Gosse Bouma, Sam Bowman, Adriane
Boyd, Kristina Brokaitė, Aljoscha Burchardt, Marie         Barbara Plank, Thierry Poibeau, Martin Popel,
Candito, Bernard Caron, Gauthier Caron, Gülşen            Lauma Pretkalniņa, Sophie Prévost, Prokopis Proko-
Cebiroğlu Eryiğit, Flavio Massimiliano Cecchini,          pidis, Adam Przepiórkowski, Tiina Puolakainen,
Giuseppe G. A. Celano, Slavomı́r Čéplö, Savas            Sampo Pyysalo, Andriela Rääbis, Alexandre Rade-
Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho,           maker, Loganathan Ramasamy, Taraka Rama, Car-
Jayeol Chun, Silvie Cinková, Aurélie Collomb,             los Ramisch, Vinit Ravishankar, Livy Real, Siva
Çağrı Çöltekin, Miriam Connor, Marine Courtin,          Reddy, Georg Rehm, Michael Rießler, Erika
Elizabeth Davidson, Marie-Catherine de Marneffe,            Rimkutė, Larissa Rinaldi, Laura Rituma, Luisa
Valeria de Paiva, Arantza Diaz de Ilarraza, Carly           Rocha, Mykhailo Romanenko, Rudolf Rosa, Da-
Dickerson, Bamba Dione, Peter Dirix, Kaja Do-               vide Rovati, Valentin Roca, Olga Rudina, Jack
brovoljc, Timothy Dozat, Kira Droganova, Puneet             Rueter, Shoval Sadde, Benoı̂t Sagot, Shadi Saleh,
Dwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky,           Alessio Salomoni, Tanja Samardžić, Stephanie Sam-
Binyam Ephrem, Tomaž Erjavec, Aline Etienne,               son, Manuela Sanguinetti, Dage Särg, Baiba Saulı̄te,
Richárd Farkas, Hector Fernandez Alcalde, Jennifer         Yanin Sawanakunanon, Nathan Schneider, Sebastian
Foster, Cláudia Freitas, Kazunori Fujita, Katarı́na        Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan
Gajdošová, Daniel Galbraith, Marcos Garcia, Moa           Seraji, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu,
Gärdenfors, Sebastian Garza, Kim Gerdes, Filip             Muh Shohibussirri, Dmitry Sichinava, Natalia Sil-
Ginter, Iakes Goenaga, Koldo Gojenola, Memduh               veira, Maria Simi, Radu Simionescu, Katalin Simkó,
Gökırmak, Yoav Goldberg, Xavier Gómez Guino-              Mária Šimková, Kiril Simov, Aaron Smith, Isabela
vart, Berta González Saavedra, Matias Grioni, Nor-         Soares-Bastos, Carolyn Spadine, Antonio Stella,
munds Grūzı̄tis, Bruno Guillaume, Céline Guillot-         Milan Straka, Jana Strnadová, Alane Suhr, Umut
Barbance, Nizar Habash, Jan Hajič, Jan Hajič jr.,         Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima
Linh Hà Mỹ, Na-Rae Han, Kim Harris, Dag                   Taji, Yuta Takahashi, Fabio Tamburini, Takaaki
Haug, Johannes Heinecke, Felix Hennig, Barbora              Tanaka, Isabelle Tellier, Guillaume Thomas, Li-
Hladká, Jaroslava Hlaváčová, Florinel Hociung, Pet-     isi Torga, Trond Trosterud, Anna Trukhina, Reut
ter Hohle, Jena Hwang, Takumi Ikeda, Radu Ion,              Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka
Elena Irimia, O.lájı́dé Ishola, Tomáš Jelı́nek, An-     Urešová, Larraitz Uria, Hans Uszkoreit, Sowmya
ders Johannsen, Fredrik Jørgensen, Hüner Kaşıkara,        Vajjala, Daniel van Niekerk, Gertjan van No-
Andre Kaasen, Sylvain Kahane, Hiroshi Kanayama,             ord, Viktor Varga, Eric Villemonte de la Clerg-
Jenna Kanerva, Boris Katz, Tolga Kayadelen, Jes-            erie, Veronika Vincze, Lars Wallin, Abigail Walsh,
sica Kenney, Václava Kettnerová, Jesse Kirchner,          Jing Xian Wang, Jonathan North Washington,
Arne Köhn, Kamil Kopacewicz, Natalia Kotsyba,              Maximilan Wendt, Seyi Williams, Mats Wirén,
Jolanta Kovalevskaitė, Simon Krek, Sookyoung               Christian Wittern, Tsegay Woldemariam, Tak-sum
Kwak, Veronika Laippala, Lorenzo Lambertino, Lu-            Wong, Alina Wróblewska, Mary Yako, Naoki Ya-
cia Lam, Tatiana Lando, Septina Dian Larasati,              mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M.
Alexei Lavrentiev, John Lee, Phng Lê H`ông,               Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir
Alessandro Lenci, Saran Lertpradit, Herman Le-              Zeldes, Daniel Zeman, Manying Zhang, and
ung, Cheuk Ying Li, Josie Li, Keying Li, Kyung-             Hanzhi Zhu. 2019. Universal dependencies 2.4.
Tae Lim, Yuan Li, Nikola Ljubešić, Olga Logi-             LINDAT/CLARIAH-CZ digital library at the Insti-
nova, Olga Lyashevskaya, Teresa Lynn, Vivien                tute of Formal and Applied Linguistics (ÚFAL), Fac-
Macketanz, Aibek Makazhanov, Michael Mandl,                 ulty of Mathematics and Physics, Charles Univer-
Christopher Manning, Ruli Manurung, Cătălina              sity.
Mărănduc, David Mareček, Katrin Marheinecke,
Héctor Martı́nez Alonso, André Martins, Jan             Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
Mašek, Yuji Matsumoto, Ryan McDonald, Sarah                ter, Yoav Goldberg, Jan Hajič, Christopher D. Man-
McGuinness, Gustavo Mendonça, Niko Miekka,                 ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Margarita Misirpashayeva, Anna Missilä, Cătălin          Natalia Silveira, Reut Tsarfaty, and Dan Zeman.
Mititelu, Yusuke Miyao, Simonetta Montemagni,               2016. Universal Dependencies v1: A multilingual
Amir More, Laura Moreno Romero, Keiko Sophie                treebank collection. In Proceedings of the 10th In-
Mori, Tomohiko Morioka, Shinsuke Mori, Shigeki              ternational Conference on Language Resources and
Moro, Bjartur Mortensen, Bohdan Moskalevskyi,               Evaluation (LREC).
Kadri Muischnek, Yugo Murawaki, Kaili Müürisep,
Pinkey Nainwani, Juan Ignacio Navarro Horñiacek,         Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Lng             ter, Jan Hajič, Christopher D. Manning, Sampo
Nguy˜ên Thi., Huy`ên Nguy˜ên Thi. Minh, Yoshi-           Pyysalo, Sebastian Schuster, Francis Tyers, and Dan
hiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj,           Zeman. 2020. Universal Dependencies v2: An ev-
Hanna Nurmi, Stina Ojala, Adédayo. Olúòkun, Mai          ergrowing multilingual treebank collection. In Pro-
Omura, Petya Osenova, Robert Östling, Lilja Øvre-          ceedings of the 12th International Conference on
lid, Niko Partanen, Elena Pascual, Marco Passarotti,        Language Resources and Evaluation (LREC).
Agnieszka Patejuk, Guilherme Paulino-Passos, An-
gelika Peljak-Łapińska, Siyao Peng, Cenel-Augusto        Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
Perez, Guy Perrier, Daria Petrova, Slav Petrov,             How multilingual is multilingual bert? In Proceed-
Jussi Piitulainen, Tommi A Pirinen, Emily Pitler,           ings of the 57th Annual Meeting of the Association
                                                            for Computational Linguistics, pages 4996–5001.
Alessandro Raganato, Jörg Tiedemann, et al. 2018. An      Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
  analysis of encoder representations in transformer-        Le, Mohammad Norouzi, Wolfgang Macherey,
  based machine translation. In Proceedings of the           Maxim Krikun, Yuan Cao, Qin Gao, Klaus
  2018 EMNLP Workshop BlackboxNLP: Analyzing                 Macherey, et al. 2016. Google’s neural machine
  and Interpreting Neural Networks for NLP. The As-          translation system: Bridging the gap between hu-
  sociation for Computational Linguistics.                   man and machine translation.      arXiv preprint
                                                             arXiv:1609.08144.
Tim Rocktäschel, Edward Grefenstette, Karl Moritz
  Hermann, Tomáš Kočiskỳ, and Phil Blunsom. 2015.      Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
  Reasoning about entailment with neural attention.          Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
  arXiv preprint arXiv:1509.06664.                           and Yoshua Bengio. 2015. Show, attend and tell:
                                                             Neural image caption generation with visual atten-
Rudolf Rosa and David Mareček. 2019. Inducing syn-          tion. In International conference on machine learn-
  tactic trees from bert representations. arXiv preprint     ing, pages 2048–2057.
  arXiv:1906.11511.
                                                           Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-
                                                             thast, Milan Straka, Filip Ginter, Joakim Nivre, and
Sofia Serrano and Noah A Smith. 2019. Is attention           Slav Petrov. 2018. Conll 2018 shared task: Mul-
  interpretable? arXiv preprint arXiv:1906.03731.            tilingual parsing from raw text to universal depen-
                                                             dencies. In Proceedings of the CoNLL 2018 Shared
Emma Strubell, Patrick Verga, Daniel Andor,                  Task: Multilingual parsing from raw text to univer-
  David Weiss, and Andrew McCallum. 2018.                    sal dependencies, pages 1–21.
  Linguistically-informed self-attention for semantic
  role labeling. In Proceedings of the 2018 Confer-        Yiyun Zhao and Steven Bethard. 2020. How does
  ence on Empirical Methods in Natural Language              BERT’s attention change when you fine-tune? An
  Processing, pages 5027–5038, Brussels, Belgium.            analysis methodology and a case study in negation
  Association for Computational Linguistics.                 scope. page 19.

Swabha Swayamdipta, Sam Thomson, Kenton Lee,
  Luke Zettlemoyer, Chris Dyer, and Noah A. Smith.
  2018. Syntactic scaffolds for semantic structures.
  In Proceedings of the 2018 Conference on Em-
  pirical Methods in Natural Language Processing,
  pages 3772–3782, Brussels, Belgium. Association
  for Computational Linguistics.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
   Bert rediscovers the classical nlp pipeline. arXiv
   preprint arXiv:1905.05950.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017a. Attention is all
  you need. In Advances in neural information pro-
  cessing systems, pages 5998–6008.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  Kaiser, and Illia Polosukhin. 2017b. Attention Is
  All You Need. arXiv:1706.03762 [cs]. ArXiv:
  1706.03762.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
  nrich, and Ivan Titov. 2019. Analyzing multi-
  head self-attention: Specialized heads do the heavy
  lifting, the rest can be pruned. arXiv preprint
  arXiv:1905.09418.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
  not not explanation. In Proceedings of the 2019 Con-
  ference on Empirical Methods in Natural Language
  Processing and the 9th International Joint Confer-
  ence on Natural Language Processing (EMNLP-
  IJCNLP), pages 11–20.
Figure 5: Positional scores across relations for all languages.
Figure 6: Decoding UUAS as a function of best positional baselines.

    Figure 7: Parsing scores across components and languages.
You can also read