Climbing the Tower of Treebanks: Improving Low-Resource Dependency Parsing via Hierarchical Source Selection

 
CONTINUE READING
Climbing the Tower of Treebanks: Improving Low-Resource
             Dependency Parsing via Hierarchical Source Selection

                 Goran Glavaš                                              Ivan Vulić
            University of Mannheim                                    University of Cambridge
          Data and Web Science Group                                 Language Technology Lab
    goran@informatik.uni-mannheim.de                                   iv250@cam.ac.uk

                      Abstract                             et al., 2019; Liu et al., 2019; Brown et al., 2020)
                                                           and their end-to-end fine-tuning for downstream
    Recent work on multilingual dependency pars-
    ing focused on developing highly multilingual          tasks has reduced the downstream relevance of su-
    parsers that can be applied to a wide range of         pervised syntactic parsing. What is more, there is
    low-resource languages. In this work, we sub-          more and more evidence that PLMs implicitly ac-
    stantially outperform such “one model to rule          quire rich syntactic knowledge through large-scale
    them all” approach with a heuristic selection          pretraining (Hewitt and Manning, 2019; Chi et al.,
    of languages and treebanks on which to train           2020) and that exposing them to explicit syntax
    the parser for a specific target language. Our         from human-coded treebanks does not offer sig-
    approach, dubbed T OWER, first hierarchically
                                                           nificant language understanding benefits (Kuncoro
    clusters all Universal Dependencies languages
    based on their mutual syntactic similarity com-        et al., 2020; Glavaš and Vulić, 2021). In order to
    puted from human-coded URIEL vectors. For              implicitly acquire syntactic competencies, however,
    each low-resource target language, we then             PLMs need language-specific corpora at the scale
    climb this language hierarchy starting from            at which it can only be obtained for a tiny por-
    the leaf node of that language and heuristi-           tion of world’s 7,000+ languages. For the remain-
    cally choose the hierarchy level at which to col-      ing vast majority of languages – with limited-size
    lect training treebanks. This treebank selection
                                                           monolingual corpora – explicit syntax still provides
    heuristic is based on: (i) the aggregate size of
    all treebanks subsumed by the hierarchy level          valuable linguistic bias for more sample-efficient
    and (ii) the similarity of the languages in the        learning in downstream NLP tasks.
    training sample with the target language. For             Reliable syntactic parsing requires annotated
    languages without development treebanks, we            treebanks of reasonable size: this prerequisite is,
    additionally use (ii) for model selection (i.e.,       unfortunately, satisfied for even fewer languages.
    early stopping) in order to prevent overfitting        Despite the multi-year, well-coordinated annota-
    to development treebanks of closest languages.
    Our T OWER approach shows substantial gains
                                                           tion efforts such as the Universal Dependencies
    for low-resource languages over two state-of-          (Nivre et al., 2016, 2020) project, language-specific
    the-art multilingual parsers, with more than 20        treebanks are unlikely to appear anytime soon for
    LAS point gains for some of those languages.           most world languages. This renders the transfer of
    Parsing models and code available at: https:           syntactic knowledge from high-resource languages
    //github.com/codogogo/towerparse.                      with annotated treebanks a necessity. A truly zero-
                                                           shot transfer for low-resource languages assumes a
1   Introduction
                                                           set of training treebanks from resource-rich source
Syntactic parsing – grounded in a wide variety of          languages and a target language without any syn-
formalisms (Taylor et al., 2003; De Marneffe et al.,       tactic annotations. Effectively, the task is then to
2006; Hockenmaier and Steedman, 2007; Nivre                identify the subset of source treebanks, the parser
et al., 2016, inter alia) – has been the backbone          trained on which would yield the best parsing per-
of natural language processing (NLP) for decades,          formance for the target language. An exhaustive
and an indispensable preprocessing step for tack-          search over all possible subsets of source treebanks
ling higher-level language understanding tasks. A          is not only computationally intractable1 but also
recent major paradigm shift in NLP towards large-
scale pretrained language models (PLMs) (Devlin                1
                                                                   One can create 2N − 1 different training sets from a

                                                        4878
            Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4878–4888
                          August 1–6, 2021. ©2021 Association for Computational Linguistics
uninformative in true zero-shot scenarios in which
there is no development treebank (i.e., any syn-
tactically annotated data) for the target language.
Most existing transfer methods therefore either
(1) choose one (or a few) best source languages
for each target language (Rosa and Zabokrtsky,
2015; Agić, 2017; Lin et al., 2019; Litschko et al.,
2020) or (2) train a single multilingual parser on
all available treebanks; such parsers, based on pre-
trained multilingual encoders, currently produce
                                                                Figure 1: Part of the syntax-based hierarchical cluster-
best results in low-resource parsing (Kondratyuk                ing of UD languages (ISO 639-1 codes).
and Straka, 2019; Üstün et al., 2020). Other trans-
fer approaches, e.g., based on data augmentation
(Şahin and Steedman, 2018; Vania et al., 2019),                similarity. To this end, we represent each language
violate the zero-shot transfer by assuming a small              with its syntax knn vector from the URIEL
target-language treebank – a requirement unful-                 database (Littell et al., 2017). Features of these
filled for most world languages.2                               103-dimensional vectors correspond to individual
    In this work, we propose a simple and effec-                syntactic properties from manually coded linguis-
tive heuristic for selecting a good set of source               tic resources such as WALS (Dryer and Haspel-
treebanks for any given low-resource target lan-                math, 2013) and SSWL (Collins and Kayne, 2009).
guage. In our approach, named T OWER, we first                  URIEL’s syntax knn strategy replaces feature
hierarchically cluster all Universal Dependencies               values missing in those resources with kNN-based
(UD) languagues. To this end, we compute syntac-                predictions (cf. (Littell et al., 2017) for more de-
tic similarity of languages by comparing manually               tails). We then carry out hierarchical agglomer-
coded vectors of their syntactic properties from the            ative clustering with Ward’s linkage (Anderberg,
URIEL database (Littell et al., 2017). We then it-              2014) with Euclidean distances between URIEL
eratively ‘climb’ that language hierarchy level by              vectors guiding the clustering. Figure 1 shows a
level, starting from the leaf node of the target lan-           dendrogram of one part of the resulting hierarchy.
guage. We stop ‘climbing’ (i.e., select the set of              We display the complete hierarchy in the Appendix.
source treebanks subsumed by the current hierar-                The syntax-based clustering largely reflects mem-
chy level), when the relative decrease in linguistic            berships in language (sub)families, with a few no-
similarity of the training sample w.r.t the target              table exceptions: e.g., Tagalog (tl), from the Aus-
language outweighs the increase in size of the train-           tronesian family appears to be syntactically similar
ing sample. We additionally exploit the linguistic              to (and is joined with) Scottish (gd), Irish (ga),
similarity between the target language and its clos-            and Welsh (cy) from the Celtic branch of the Indo-
est sources with existing development treebanks to              European family.
inform a model selection (that is, early-stopping)
heuristic. T OWER substantially outperforms state-              Treebank Selection (TBS). For a given test tree-
of-the-art multilingual parsers – UDPipe (Straka,               bank, we start climbing the hierarchy from the leaf
2018), UDify (Kondratyuk and Straka, 2019), and                 node of the treebank’s language. Let sl denote
UDapter (Üstün et al., 2020) on low-resource lan-             the number of climbing steps we take from the
guages, while offering comparable performance for               target leaf node l. If the target test treebank also
high-resource languages.                                        has the corresponding training portion, in-treebank
                                                                training constitutes the first training configuration
2    Climbing the T OWER of Treebanks                           (we denote this configuration with sl = −1). For
                                                                resource-rich languages with several training tree-
Constructing the T OWER. We start by hierar-
                                                                banks, we create the next training sample by con-
chically clustering the set of 89 languages from
                                                                catenating all of those treebanks (we denote this
Universal Dependencies 3 based on their syntactic
                                                                level with sl = 0).4 For low-resource target lan-
collection of N source treebanks.
    2                                                              4
      For the vast majority of world languages there does not        For example, for the Russian test treebank SynTagRus,
exist a single manually annotated syntactic tree.               the training set at sl = −1 consists of the train portion of
    3
      We worked with the UD version 2.5.                        the same SynTagRus treebank; at sl = 0, we concatenate

                                                            4879
guages without any training treebanks, the first                   language. Model selection based on the develop-
training sample is collected at sl = 1, where the                  ment set of the source language, on the other hand,
language is joined with other languages. The train-                overfits the model to the source language, which
ing set corresponding to a hierarachy level (i.e.,                 may hurt effectiveness of the cross-lingual transfer
each join in the tree) concatenates all training tree-             (Keung et al., 2020; Chen and Ritter, 2020). For
banks of all languages (i.e., leaf nodes) of the re-               test treebanks with a respective development por-
spective hierarchy subtree.5                                       tion, T OWER uses that development set for model
   Let {Sn }Nn=0 (or −1) be the set of training configu-           selection. For low-resource languages l without
rations collected by climbing the hierarchy starting               development treebanks, we compile a proxy de-
from the target language l and let Sn = ∪{Tk }K     k=1            velopment set Dl = ∪{Dk }K    k=1 by collecting all
be the n-th training set consisting of K training                  development treebanks Dk from the hierarchy level
treebanks. As we climb the hierarchy (i.e., as n                   closest to l that encompasses at least one treebank
increases), the training set Sn is bound to grow; at               with a development set.6 Intuitively, the more
the same time, the sample of training languages                    syntactically similar Dl is to l, the more benefi-
becomes increasingly dissimilar w.r.t. the target                  cial the model selection based on Dl will be for
language l. In other words, as we climb higher                     performance on l, the optimal model checkpoint
up the induced syntactic hierarchy of languages,                   w.r.t. l should be closer to the model checkpoint
we train on more data but from a mixture of (syn-                  exhibiting best performance on Dl . Accordingly,
tactically) more distant languages. Let lk be the                  with M as the model checkpoint with best per-
language of the training treebank Tk . We then quan-               formance on Dl , we select the model chekpoint
tify the syntactic similarity sim(Sn , l) between the              M 0 = bsim(Dl , l) · M c (see Eq.(1)) as the “opti-
training set Sn and the target language l as follows:              mal” checkpoint for the target language l.

                                 K                                 Shallow Biaffine Parser. T OWER employs the
                           1 X
       sim(Sn , l) =           |Tk | · cos(lk , l)         (1)     shallow biaffine parser of Glavaš and Vulić (2021),
                         |Sn |
                               k=1                                 stacked on top of the pretrained XLM-R (Conneau
with cos(lk , l) as cosine similarity between URIEL                et al., 2020). Compared to the standard biaffine
vectors of lk and l, and relative sizes of individual              parser (Dozat and Manning, 2017; Kondratyuk
treebanks |Tk |/|Sn | as weights. We then use the                  and Straka, 2019; Üstün et al., 2020), this shallow
following simple heuristic to select the best train-               variant forwards word-level representations (aggre-
ing set Sn : we stop climbing when the relative                    gated from subword output) directly into biaffine
growth of the training set becomes smaller than the                products, bypassing deep feed-forward transforma-
relative decrease of the similarity with the target                tions that produce dependent- and head-specific
language, i.e., we select the smallest n for which                 vectors (Dozat and Manning, 2017). The shallow
the following condition is satisifed:                              variant is reported to perform comparably (Glavaš
                                                                   and Vulić, 2021), while being faster to train.
                |Sn+1 |    sim(Sn , l)
                        <               .                  (2)     3       Evaluation and Discussion
                 |Sn |    sim(Sn+1 , l)
                                                                   Treebanks and Baselines. We evaluate T OWER
                                                                   on 138 (test) treebanks from Universal Dependen-
Model Selection (MS). Early stopping based on                      cies (Nivre et al., 2020).7 We compare T OWER
the model performance on a development set (dev)                   against two state-of-the-art multilingual parsers:
is an important mechanism for preventing model                     (1) UDify (Kondratyuk and Straka, 2019) couples
overfitting in supervised machine learning. In a                   the multilingual BERT (mBERT) (Devlin et al.,
truly zero-shot transfer setup, on the one hand, we
                                                                       6
do not have any development data in the target                            E.g., Dl for l=tl consists of develompent portions of ga
                                                                   and gd treebanks, whereas Dl for l=cy consists only of the
the training portions of Russian GSD, PUD, and SynTagRus           development set of ga.
treebanks.                                                              7
                                                                          We work with UD v2.5. Due to mismatches between
    5
      Note that the number of climbs sl needed to reach some       XLM-R’s subword tokenizer and word-level treebank tokens
hierarchy level depends on the language l: e.g., the hierarchy     we skip: all Chinese treebanks, Assyrian (AS), Old Russian
level joining Tagalog (tl) with Scottish, Irish, and Welsh ({gd,   (RNC and TOROT), Skolt Sami (Giellagas), Japanese (Mod-
ga, cy}) is reached in sl = 1 climbs from Tagalog, sl = 2          ern and BCCWJ), A. Greek (Perseus), Gothic (PROIEL), Cop-
climbs from Scottish and sl = 3 climbs from Irish and Welsh.       tic (Scriptorium), OC Slavonic (PROIEL) and Yoruba (YTB).

                                                              4880
∩UDify                                                    ∩UDapt                                                     H IGH                                                  L OW
        Model                                  UAS                     LAS                            UAS                       LAS                    UAS                              LAS                          UAS                     LAS
        UDify                                  80.9                     73.9                           –                         –                      89.2                            85.3                         39.9                    22.2
        UDapter                                 –                        –                            63.8                      52.8                    90.9                            87.6                         43.9                    29.3
        T OWER                                 82.4                     74.3                          68.9                      56.0                    90.0                            86.3                         53.7                    33.8
         -TBS                                  80.8                     73.2                          62.8                      51.7                    89.4                            85.6                         47.0                    30.1
         -MS                                   82.1                     74.1                          67.9                      55.2                    89.4                            85.6                         51.2                    32.2
         -TBS-MS                               80.7                     83.1                          62.4                      51.3                    89.4                            85.6                         45.9                    29.0

Table 1: Parsing performance (UAS, LAS) on different UD treebank subsets for state-of-the-art multilingual parsers
UDify and UDapter and variants of our T OWER method. Bold: best performance in each column.

  100                                                                                                    UDify                  UDApter               Tower
                                                                                                                                  93.7 93.5
                                                                                                                                                      92.1 92.8                                93.11 92.2 93.8
                                                                                                             91.5 92.0 91.4                   92.0
  90                          88.5 89.7 89.3                                 89.0 90.2     88.1 88.8 87.1                                                         89.4                                                89.0 90.3
                                                                                                                                                                                 85.9                                              86.6
                  84.4 83.7                            83.3
           82.9                                                       82.0                                                                                                              81.7
                                                81.0          80.0
  80
                                                                                                                                                                          74.3
                                                                                                                                                                                                                                                  69.6 70.0
  70                                                                                                                                                                                                                                       67.4

  60
          ar (PADT) en (EWT) eu (BDT)                                fi (TDT)             he (HTB) hi (HDTB) it (ISDT)                               ja (GSD)            ko (GSD)              ru (STR)              sv (TB)              tr (IMST)

                                                                                                        UDify               UDApter                   Tower
                                                                                                                                                                                                              74.3
  75                                                                                             69.2 68.4                                                                                             69.5
                                                                            58.5          59.3

  50                                                                               44.5                                                                                                 44.7
                                        38.4                         39.8                                                                                                                       40.1
                                                                                                                                                            36.7 35.0                                                                                    33.8
                                                                                                                                                     32.2                                                                                         29.3
  25                                                                                                                                                                             22.2                                                      22.2
                                                                                                                         17.3     16.4 19.2 19.4                          18.6                                                     16.2
                                                                                                             13.0 12.5                                                                                                      12.1
                 8.2 6.2                        8.6 8.1 8.0                                                                                                                                                           8.0
           4.5                3.5 5.9
   0
                 akk             am                bm                        br                  fo              kpv                 myv                pcm                      sa                    tl                   wbp               AVG

Figure 2: LAS performance of UDify, UDapter and T OWER on 12 high-resource treebanks (top figure), and 11
low-resource languages (bottom figure).

2019) with the deep biaffine parser (Dozat and                                                                                   H = 768 and apply a dropout (p = 0.1) on its out-
Manning, 2017) and trains on all UD treebanks; (2)                                                                               puts before forwarding them to the shallow parsing
UDapter (Üstün et al., 2020) extends mBERT with                                                                                head. We train in batches of 32 sentences and op-
adapter parameters (Houlsby et al., 2019; Pfeiffer                                                                               timize parameters with Adam (Kingma and Ba,
et al., 2020) that are contextually generated (Pla-                                                                              2015) (starting learning rate 10−5 ). We train for 30
tanios et al., 2018) from URIEL vectors – the pa-                                                                                epochs, with early stopping based on dev loss.8
rameters of the adapter generator are trained on
                                                                                                                                 Results and Discussion. We show detailed re-
treebanks of 13 diverse resource-rich languages se-
                                                                                                                                 sults for all 138 treebanks in the Appendix. In
lected by Kulmizev et al. (2019). We additionally
                                                                                                                                 Table 1, we show averages over different treebank
quantify the contributions of T OWER’s heuristic
                                                                                                                                 subsets: treebanks on which both T OWER and (1)
components (TBS and MS, see §2) by evaluating
                                                                                                                                 UDify (∩UDify; 111 treebanks) and (2) UDapter
variants in which we (1) remove TBS and train on
                                                                                                                                 (∩UDapt; 39 treebanks) have been evaluated, (3)
the closest language with training data (-TBS), (2)
                                                                                                                                 12 high-resource languages on which UDapter was
remove MS and just select the model checkpoint
                                                                                                                                 trained (H IGH) and (4) 11 low-resource treebanks
that performs best on the proxy dev set Dl (-MS),
                                                                                                                                 (L OW) for which all three models have been eval-
and (3) remove both TBS and MS (-TBS-MS).
                                                                                                                                 uated. We show LAS scores for languages from
                                                                                                                                     8
Training and Optimization Details. We limit                                                                                            For low-resource languages without the dev set, we use
                                                                                                                                 the proxy Dl (see 2). We checkpoint the model (i.e., measure
input sequences to 128 subword tokens. We use                                                                                    the dev loss) 10 times per epoch and stop training when the
XLM-R Base with L = 12 layers and hidden size                                                                                    loss does not decrease over 10 consecutive checkpoints.

                                                                                                                   4881
H IGH and L OW in Figure 2. Similar trends are           heuristics to inform the source selection. A wide-
observed with UAS scores.                                scale UD evaluation and comparisons to current
   T OWER outperforms UDify and UDapter in all           state-of-the-art multilingual dependency parsers
setups except H IGH, with especially pronounced          validated the effectiveness of T OWER, especially in
gains for L OW. This renders T OWER particu-             low-resource languages. Moreover, while the main
larly successful for the intended use case: low-         experiments in this work were based on one partic-
resource languages without any training data. Ad-        ular state-of-the-art parsing architecture, T OWER is
mittedly, the fact that T OWER is built on XLM-          fully independent of the chosen underlying parsing
R, whereas UDify and UDapter use mBERT, im-              model, and thus widely applicable.
pedes the direct “apples-to-apples” comparison.
Two sets of results, however, strongly suggest that      Acknowledgments
it is T OWER’s heuristics (TBS & MS) that drive          Goran Glavaš is supported by the Baden
its performance rather than the XLM-R (instead           Württemberg Stiftung (Eliteprogramm, AGREE
of mBERT) encoder. First, UDapter outperforms            grant). The work of Ivan Vulić is supported by
T OWER on high-resource languages with large             the ERC Consolidator Grant LEXICAL: Lexical
training treebanks (i.e., the HIGH setup). For these     Acquisition Across Languages (no. 648909).
languages, however, T OWER effectively does not
employ its heuristics: (i) TBS selects the large
language-specific treebank(s), as adding any other       References
language prohibitively reduces the perfect similar-      Željko Agić. 2017. Cross-lingual parser selection
ity sim(S0 , l) = 1 (see Eq. (1)); (ii) MS is not used      for low-resource languages. In Proceedings of the
because each high-resource treebank has its own             NoDaLiDa 2017 Workshop on Universal Dependen-
dedicated dev set. Secondly, removing T OWER’s              cies, pages 1–10.
heuristics (see -TBS-MS in Table 1) brings its per-      Michael R. Anderberg. 2014. Cluster Analysis for Pub-
formance slightly below that of UDapter, rendering         lications. Academic Press.
TBS (primarily) and MS (rather than the XLM-R
                                                         Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
encoder) crucial for T OWER’s gains. Comparing             Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
-TBS and -MS reveals that, somewhat expectedly,            Neelakantan, Pranav Shyam, Girish Sastry, Amanda
selecting the “optimal” training sample (TBS) con-         Askell, et al. 2020. Language models are few-shot
tributes to the overall performance more than the          learners. In Proceedings of NeurIPS 2020.
heuristic early stopping (MS).                           Yang Chen and Alan Ritter. 2020. Model selection for
   Looking at individual low-resource languages            cross-lingual transfer using a learned scoring func-
(Fig. 2), we observe largest gains for Amharic (am)        tion. arXiv preprint arXiv:2010.06127.
and Sanskrit (sa). While Sanskrit benefits from          Ethan A. Chi, John Hewitt, and Christopher D. Man-
T OWER selecting training languages from the same          ning. 2020. Finding universal grammatical relations
family (Marathi, Urdu, and Hindi), Amharic (Afro-          in multilingual BERT. In Proceedings of ACL 2020,
Asiatic family), interestingly, benefits from tree-        pages 5564–5577.
banks of syntactically similar languages from an-        Chris Collins and Richard Kayne. 2009.      Syntactic
other family (cf. the full T OWER hierarchy in the         structures of the world’s languages.
Appendix) – Tamil and Telugu (Dravidian family).
                                                         Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Similarly, Tagalog (Austronesian language) pars-
                                                           Vishrav Chaudhary, Guillaume Wenzek, Francisco
ing massively benefits from training on Scottish           Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
and Irish treebanks (Indo-European, Celtic).               moyer, and Veselin Stoyanov. 2020. Unsupervised
                                                           cross-lingual representation learning at scale. In
4   Conclusion                                             Proceedings of ACL 2020, pages 8440–8451.

                                                         Marie-Catherine De Marneffe, Bill MacCartney,
We proposed T OWER, a simple yet effective ap-            Christopher D Manning, et al. 2006. Generat-
proach to the crucial problem of source language          ing typed dependency parses from phrase structure
selection for multilingual and cross-lingual depen-       parses. In Proceedings of LREC 2006, pages 449–
dency parsing. It leverages the language hierarchy,       454.
induced from syntax-based manually coded URIEL           Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
language vectors, and simple treebank selection             Kristina Toutanova. 2019. BERT: Pre-training of

                                                    4882
deep bidirectional transformers for language under-       learning. In Proceedings of ACL 2019, pages 3125–
  standing. In Proceedings of NAACL-HLT 2019,               3135.
  pages 4171–4186.
                                                          Robert Litschko, Ivan Vulić, Željko Agić, and Goran
Timothy Dozat and Christopher D. Manning. 2017.             Glavaš. 2020. Towards instance-level parser selec-
  Deep biaffine attention for neural dependency pars-       tion for cross-lingual transfer of dependency parsers.
  ing. In Proceedings of ICLR 2017.                         In Proceedings of COLING 2020, pages 3886–3898.
M. S. Dryer and M. Haspelmath. 2013. WALS On-             Patrick Littell, David R Mortensen, Ke Lin, Kather-
  line. Max Planck Institute for Evolutionary Anthro-       ine Kairis, Carlisle Turner, and Lori Levin. 2017.
  pology, Leipzig.                                          URIEL and lang2vec: Representing languages as
                                                            typological, geographical, and phylogenetic vectors.
Goran Glavaš and Ivan Vulić. 2021. Is supervised syn-     In Proceedings of EACL 2017, pages 8–14.
  tactic parsing beneficial for language understanding?
  An empirical investigation. In Proceedings of EACL      Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
  2021, pages 3090–3104.                                    dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
                                                            Luke Zettlemoyer, and Veselin Stoyanov. 2019.
John Hewitt and Christopher D. Manning. 2019. A             RoBERTa: A robustly optimized BERT pretraining
  structural probe for finding syntax in word represen-     approach. arXiv:1907.11692.
  tations. In Proceedings of NAACL-HLT 2019, pages
  4129–4138.                                              Joakim Nivre, Marie-Catherine De Marneffe, Filip Gin-
                                                            ter, Yoav Goldberg, Jan Hajic, Christopher D Man-
Julia Hockenmaier and Mark Steedman. 2007. CCG-
                                                            ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
   bank: A corpus of CCG derivations and dependency
                                                            Natalia Silveira, et al. 2016. Universal Dependen-
   structures extracted from the Penn Treebank. Com-
                                                            cies v1: A multilingual treebank collection. In Pro-
   putational Linguistics, 33(3):355–396.
                                                            ceedings of LREC 2016, pages 1659–1666.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
  Bruna Morrone, Quentin De Laroussilhe, Andrea           Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
  Gesmundo, Mona Attariyan, and Sylvain Gelly.              ter, Jan Hajič, Christopher D Manning, Sampo
  2019. Parameter-efficient transfer learning for NLP.      Pyysalo, Sebastian Schuster, Francis Tyers, and
  In Proceedings of ICML 2019, pages 2790–2799.             Daniel Zeman. 2020. Universal Dependencies v2:
                                                            An evergrowing multilingual treebank collection.
Phillip Keung, Yichao Lu, Julian Salazar, and Vikas         arXiv preprint arXiv:2004.10643.
  Bhardwaj. 2020. Don’t use English dev: On the
  zero-shot cross-lingual evaluation of contextual em-    Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish-
  beddings. In Proceedings of EMNLP 2020, pages             warya Kamath, Ivan Vulić, Sebastian Ruder,
  549–554.                                                  Kyunghyun Cho, and Iryna Gurevych. 2020.
                                                            AdapterHub: A framework for adapting Transform-
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A              ers. In Proceedings of EMNLP 2020: System
  method for stochastic optimization. In Proceedings        Demonstrations, pages 46–54.
  of ICLR 2015.
                                                          Emmanouil Antonios Platanios, Mrinmaya Sachan,
Dan Kondratyuk and Milan Straka. 2019. 75 lan-              Graham Neubig, and Tom Mitchell. 2018. Contex-
  guages, 1 model: Parsing Universal Dependencies           tual parameter generation for universal neural ma-
  universally. In Proceedings of EMNLP-IJCNLP               chine translation. In Proceedings of EMNLP 2018,
  2019, pages 2779–2795.                                    pages 425–435.

Artur Kulmizev, Miryam de Lhoneux, Johannes               Rudolf Rosa and Zdenek Zabokrtsky. 2015. Klcpos3-a
  Gontrum, Elena Fano, and Joakim Nivre. 2019.              language similarity measure for delexicalized parser
  Deep contextualized word embeddings in transition-        transfer. In Proceedings of ACL-JCNLP 2015, pages
  based and graph-based dependency parsing - a tale         243–249.
  of two parsers revisited. In Proceedings of EMNLP-
  IJCNLP 2019, pages 2755–2768.                           Gözde Gül Şahin and Mark Steedman. 2018. Data aug-
                                                            mentation via dependency tree morphing for low-
Adhiguna Kuncoro, Lingpeng Kong, Daniel Fried,              resource languages. In Proceedings of EMNLP
  Dani Yogatama, Laura Rimell, Chris Dyer, and Phil         2018, pages 5004–5009.
  Blunsom. 2020. Syntactic structure distillation pre-
  training for bidirectional encoders. Transactions of    Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL
  the ACL, 8:776–794.                                       2018 UD shared task. In Proceedings of the CoNLL
                                                            2018 Shared Task: Multilingual Parsing from Raw
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,           Text to Universal Dependencies, pages 197–207.
  Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,
  Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios          Ann Taylor, Mitchell Marcus, and Beatrice Santorini.
  Anastasopoulos, Patrick Littell, and Graham Neubig.       2003. The Penn treebank: An overview. In Tree-
  2019. Choosing transfer languages for cross-lingual       banks, pages 5–22.

                                                     4883
Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and
  Gertjan van Noord. 2020. UDapter: Language adap-
  tation for truly universal dependency parsing. In
  Proceedings of EMNLP 2020.
Clara Vania, Yova Kementchedjhieva, Anders Søgaard,
  and Adam Lopez. 2019. A systematic comparison
  of methods for low-resource dependency parsing on
  genuinely low-resource languages. In Proceedings
  of EMNLP 2019, pages 1105–1116.

                                                  4884
Appendix

 Treebank              Method    UAS     LAS      Treebank                Method    UAS     LAS
 Afrikaans             UDPipe    89.38   86.58    Danish                  UDPipe    86.88   84.31
 AfriBooms (af)        UDify     86.97   83.48    DDT (da)                UDify     87.76    84.5
                       T OWER    88.26   85.28                            T OWER    85.60   82.14
 Akkadian              UDify     27.65    4.54    Dutch                   UDPipe    91.37   88.38
 PISANDUB (akk)        UDapter    26.4     8.2    Alpino (nl)             UDify     94.23   91.21
                       T OWER    33.45    6.12                            T OWER    93.42   90.31
 Amharic               UDify     17.38    3.49    Dutch                   UDPipe     90.2   86.39
 ATT (am)              UDapter    12.8    5.91    LassySmall (nl)         UDify     94.34   91.22
                       T OWER    72.64   38.41                            T OWER    92.45   88.29
 Ancient Greek         UDPipe    85.93   82.11    English ESL (en)        T OWER    30.22    6.45
 PROIEL (grc)          UDify     78.91   72.66
                       T OWER    85.04   79.85                            UDPipe    89.63   86.97
                                                  English                 UDify     90.96    88.5
 Arabic NYUAD (ar)     T OWER    33.53   15.94    EWT (en)                UDapter   93.12   89.67
                                                                          T OWER    92.16   89.29
                       UDPipe    87.54   82.94
 Arabic                UDify     87.72   82.88                            UDPipe    87.27   84.12
                                                  English
 PADT (ar)             UDapter   88.66   84.42                            UDify     89.14   85.73
                       T OWER    88.92   83.72    GUM (en)                T OWER    90.07   86.61
 Arabic PUD (ar)       UDify     76.17   67.07    English                 UDPipe    84.15   79.71
                       T OWER    75.94   59.72    LinES (en)              UDify     87.33   83.71
                                                                          T OWER    87.12   82.91
 Armenian              UDPipe    78.62   71.27
 ArmTDP (hy)           UDify     85.63   78.61    English PUD (en)        UDify     91.52   88.66
                       T OWER    86.57   80.51                            T OWER    90.89   87.33
 Bambara               UDify     30.28     8.6    English                 UDPipe    90.29   87.27
 CRB (bm)              UDapter    28.7     8.1    ParTUT (en)             UDify     92.84   90.14
                       T OWER    31.33    8.03                            T OWER    89.36   85.63
                       UDPipe    86.11   82.86    English Pronouns (en)   T OWER    89.50   85.37
 Basque                UDify     84.94   80.97
 BDT (eu)              UDapter   87.25   83.33    Erzya                   UDify      31.9   16.38
                       T OWER    84.28   80.02    JR (myv)                UDapter   34.21   19.15
                                                                          T OWER    36.44   19.38
                       UDPipe    78.58   72.72
 Belarusian            UDify     91.82   87.19    Estonian                UDPipe     88.0   85.18
 HSE (be)              UDapter   84.16   79.33    EDT (et)                UDify     89.53   86.67
                       T OWER    86.40   81.56                            T OWER    90.24   87.08
                       UDapter    52.9   37.34    Estonian EWT (et)       T OWER    88.80   84.54
 Bhojpuri BHTB (bho)   T OWER    52.62   35.86
                                                  Faroese                 UDify     67.24   59.26
 Breton                UDify     63.52   39.84    OFT (fo)                UDapter   77.15    69.2
                       UDapter   72.91    58.5                            T OWER    77.43   68.41
 KEB (br)              T OWER    67.73   44.47
                                                  Finnish                 UDPipe    90.68   87.89
 Bulgarian             UDPipe    93.38   90.35    FTB (fi)                UDify     86.37    81.4
                       UDify     95.54    92.4                            T OWER    91.91   89.05
 BTB (bg)              T OWER    95.67   92.03
                                                  Finnish PUD (fi)        UDify     89.76   86.58
                       UDPipe     32.6   18.83                            T OWER    88.24   82.48
 Buryat                UDify     48.43   26.28
 BDT (bxr)                                                                UDPipe    89.88   87.46
                       UDapter   48.68   28.89    Finnish                 UDify     86.42   82.03
                       T OWER    51.53   29.16    TDT (fi)                UDapter   91.87   89.01
 Catalan               UDPipe    93.22   91.06                            T OWER    92.78   90.22
 AnCora (ca)           UDify     94.25   92.33    French FQB (fr)         T OWER    93.36   87.00
                       T OWER    94.04   92.08
                                                  French FTB (fr)         T OWER    28.04   14.80
 Croatian              UDPipe     91.1   86.78
 SET (hr)              UDify     94.08   89.79    French                  UDPipe    90.65   88.06
                       T OWER    92.22   87.02    GSD (fr)                UDify      93.6   91.45
                                                                          T OWER    94.06   91.31
 Czech                 UDPipe    92.99   90.71
 CAC (cs)              UDify     94.33   92.41    French PUD (fr)         UDify     88.36   82.76
                       T OWER    94.91   92.10                            T OWER    91.02   83.52
 Czech                 UDPipe     86.9   84.03    French                  UDPipe    92.17   89.63
 CLTT (cs)             UDify     91.69   89.96    ParTUT (fr)             UDify     90.55   88.06
                       T OWER    94.11   91.38                            T OWER    87.90   79.33
 Czech                 UDPipe    92.91   89.75    French                  UDPipe    92.37   90.73
 FicTree (cs)          UDify     95.19   92.77    Sequoia (fr)            UDify     92.53   90.05
                       T OWER    95.12   91.83                            T OWER    92.07   89.93
 Czech                 UDPipe    93.33   91.31    French                  UDPipe     82.9   77.53
 PDT (cs)              UDify     94.73   92.88    Spoken (fr)             UDify     85.24   80.01
                       T OWER    95.01   92.41                            T OWER    84.41   74.77
 Czech PUD (cs)        UDify     92.59   87.95    Galician                UDPipe    86.44   83.82
                       T OWER    93.26   87.06    CTG (gl)                UDify     84.75   80.89
                                                                          T OWER    83.85   80.65

                                                 4885
Treebank                 Method    UAS     LAS     Treebank                  Method    UAS     LAS
Galician                 UDPipe    82.72   77.69                             UDPipe     87.7   84.24
                         UDify     84.08   76.77   Korean                    UDify     82.74   74.26
TreeGal (gl)             T OWER    77.57   66.87   GSD (ko)                  UDapter   89.39   85.91
                                                                             T OWER    86.04   81.70
German                   UDPipe    85.53   81.07
GSD (de)                 UDify     87.81   83.59   Korean                    UDPipe    88.42   86.48
                         T OWER    89.11   84.19   Kaist (ko)                UDify     87.57   84.52
                                                                             T OWER    88.78   86.11
German HDT (de)          T OWER    97.65   96.54
                                                   Korean PUD (ko)           UDify     63.57   46.89
German LIT (de)          T OWER    86.55   78.74                             T OWER    61.78   38.40
German PUD (de)          UDify     89.86   84.46                             UDPipe    45.23   34.32
                         T OWER    89.15   81.02   Kurmanji                  UDify     35.86    20.4
                         UDPipe     92.1   89.79   MG (kmr)                  UDapter   26.37    12.1
Greek                                                                        T OWER    72.00   51.02
GDT (el)                 UDify     94.33   92.15
                         T OWER    94.13   91.16                             UDPipe    91.06    88.8
                                                   Latin
                         UDPipe     89.7   86.86   ITTB (la)                 UDify     92.43   90.12
Hebrew                   UDify     91.63   88.11                             T OWER    91.25   87.67
HTB (he)                 UDapter   91.86   88.75                             UDPipe    83.34   78.66
                         T OWER    90.71   87.05   Latin
                                                   PROIEL (la)               UDify     84.85   80.52
                         UDPipe    94.85   91.83                             T OWER    83.74   77.75
Hindi                    UDify     95.13   91.46
HDTB (hi)                                          Latin                     UDPipe     71.2   61.28
                         UDapter   95.29   91.96                             UDify     78.33    69.6
                         T OWER    95.12   91.42   Perseus (la)              T OWER    73.53   62.16
Hindi PUD (hi)           UDify     71.64   58.42                             UDPipe     87.2   83.35
                         T OWER    73.02   50.68   Latvian
                                                   LVTB (lv)                 UDify     89.33   85.09
                         UDPipe    84.04   79.73                             T OWER    92.26   88.52
Hungarian
Szeged (hu)              UDify     89.68   84.88   Lithuanian ALKSNIS (lt)   T OWER    87.35   81.58
                         T OWER    87.87   81.02
                                                   Lithuanian                UDPipe    51.98   42.17
Indonesian               UDPipe    85.31   78.99                             UDify     79.06   69.34
                         UDify     86.45    80.1   HSE (lt)                  T OWER    79.25   65.47
GSD (id)                 T OWER    83.71   76.84
                                                   Livvi KKPP (olo)          UDapter   57.86   43.34
Indonesian PUD (id)      UDify     77.47    56.9                             T OWER    62.77   44.62
                         T OWER    76.71   53.16
                                                   Maltese                   UDPipe    84.65   79.71
Irish                    UDPipe    80.39   72.34                             UDify     83.07   75.56
                         UDify     80.05   69.28   MUDT (mt)                 T OWER    76.64   67.31
IDT (ga)                 T OWER    80.33   66.80
                                                                             UDPipe    70.63   61.41
                         UDPipe    93.49   91.54   Marathi                   UDify     79.37   67.72
Italian                  UDify     95.54   93.69   UFAL (mr)                 UDapter   61.01    44.4
ISDT (it)                UDapter   95.32   93.46                             T OWER    70.39   57.77
                         T OWER    94.47   91.98
                                                   Mbya Guarani
Italian PUD (it)         UDify     94.18   91.76                             T OWER    18.10   5.82
                         T OWER    94.13   89.01   Dooley (gun)
                         UDPipe    92.64   90.47   Mbya Guarani
Italian                                                                      T OWER    32.36   11.23
                         UDify     95.96   93.68   Thomas (gun)
ParTUT (it)              T OWER    95.06   91.57
                                                   Moksha JR (mdf)           UDapter   40.15   26.55
Italian PoSTWITA (it)    T OWER    86.95   81.75                             T OWER    44.21   27.45
Italian TWITTIRO (it)    T OWER    86.93   80.91   Naija                     UDify     45.75   32.16
                                                   NSC (pcm)                 UDapter   49.24   36.72
Italian VIT (it)         T OWER    91.80   87.05                             T OWER    52.03   34.95
                         UDPipe    95.06   93.73   North Sami                UDPipe     78.3   73.49
Japanese                 UDify     94.37   92.08
                                                   Giella (sme)              UDify      74.3   67.13
GSD (ja)                 UDapter   94.87   92.84                             T OWER    53.53   42.05
                         T OWER    92.58   89.44
                                                   Norwegian                 UDPipe    92.39   90.49
Japanese PUD (ja)        UDify     94.89   93.62                             UDify     93.97   92.18
                         T OWER    91.12   88.41   Bokmaal (no)              T OWER    94.77   93.12
Karelian KKPP (krl)      UDapter   61.86   48.35   Norwegian                 UDPipe    92.09   90.01
                         T OWER    62.18   45.60                             UDify     94.34   92.37
                                                   Nynorsk (no)              T OWER    93.96   91.65
                         UDPipe     53.3   33.38
Kazakh                   UDify     74.77   63.66   Norwegian                 UDPipe    68.08   60.07
KTB (kk)                 UDapter   74.13   60.74                             UDify      75.4    69.6
                         T OWER    73.70   59.88   NynorskLIA (no)           T OWER    75.43   69.82
Komi Permyak UH (koi)    UDapter   36.89   23.05   Old French                UDPipe    91.74   86.83
                         T OWER    42.36   25.81                             UDify     91.74   86.65
                                                   SRCMF (fro)               T OWER    89.75   83.48
Komi Zyrian IKDP (kpv)   UDify     36.01   22.12
                         T OWER    40.87   24.71   Persian                   UDPipe    90.05   86.66
                         UDify     28.85   12.99   Seraji (fa)               UDify     89.59   85.84
Komi Zyrian                                                                  T OWER    91.29   87.43
Lattice (kpv)            UDapter    28.4    12.5
                         T OWER    33.29   17.33

                                              4886
Treebank              Method    UAS     LAS     Treebank                Method    UAS     LAS
Polish                UDPipe    96.58   94.76   Spanish PUD (es)        UDify     90.45   83.08
LFG (pl)              UDify     96.67   94.58                           T OWER    89.66   80.23
                      T OWER    97.06   95.18
                                                Swedish                 UDPipe    86.07   81.86
Polish PDB (pl)       T OWER    94.99   89.95   LinES (sv)              UDify     88.77   85.49
                                                                        T OWER    88.63   85.07
Polish PUD (pl)       T OWER    94.13   87.44
                                                Swedish PUD (sv)        UDify     89.17    86.1
Portuguese            UDPipe    91.36   89.04                           T OWER    89.20   84.95
Bosque (pt)           UDify     91.37   87.84
                      T OWER    91.50   88.29                           UDPipe    89.63   86.61
                                                Swedish                 UDify     91.91   89.03
Portuguese            UDPipe    93.01   91.63   Talbanken (sv)          UDapter   92.62   90.26
GSD (pt)              UDify     94.22   92.54                           T OWER    89.70   86.60
                      T OWER    93.80   91.98
                                                Swedish Sign Language   UDPipe    50.35   37.94
Portuguese PUD (pt)   UDify     87.02   80.17                           UDify     40.43   26.95
                      T OWER    87.27   77.86   SSLC (swl)              T OWER    31.56   20.57
Romanian              UDPipe    89.12    84.2                           UDapter   59.74   45.49
                      UDify     90.36   85.26   Swiss Ger. UZH (gsw)    T OWER    55.61   40.17
Nonstandard (ro)      T OWER    90.59   84.41
                                                Tagalog                 UDify     64.04   40.07
Romanian              UDPipe    91.31   86.74                           UDapter   84.78   69.52
                      UDify     93.16   88.56   TRG (tl)                T OWER    91.78   74.32
RRT (ro)              T OWER    93.61   87.70
                                                                        UDPipe    74.11   66.37
Romanian                                        Tamil                   UDify     79.34   71.29
                      T OWER    91.19   86.75   TTB (ta)
SiMoNERo (ro)                                                           UDapter   70.28   46.05
                                                                        T OWER    71.28   64.36
Russian               UDPipe    88.15   84.37
GSD (ru)              UDify     90.71   86.03                           UDPipe    91.26   85.02
                      T OWER    91.85   88.28   Telugu                  UDify     92.23   83.91
                                                MTG (te)                UDapter   83.52    71.1
Russian PUD (ru)      UDify     93.51   87.14                           T OWER    90.43   81.97
                      T OWER    94.59   88.26
                                                Thai PUD (th)           UDify     49.05   26.06
                      UDPipe     93.8   92.32                           T OWER    78.23   53.80
Russian               UDify     94.83   93.13
SynTagRus (ru)        UDapter   94.04   92.24   Turkish GB (tr)         T OWER    75.36   59.39
                      T OWER    95.28   93.75
                                                                        UDPipe    74.19   67.56
Russian               UDPipe    75.45   69.11   Turkish                 UDify     74.56   67.44
Taiga (ru)            UDify     84.02    77.8   IMST (tr)               UDapter   76.97   69.63
                      T OWER    84.83   77.71                           T OWER    77.90   70.00
Sanskrit              UDify     40.21   18.56                           UDify     67.68   46.07
                      UDapter   44.32   22.22   Turkish PUD (tr)        T OWER    62.29   41.57
UFAL (sa)             T OWER    63.05   44.66
                                                Ukrainian               UDPipe    88.29   85.25
Scottish Gaelic                                                         UDify     92.83    90.3
                      T OWER    81.32   73.82   IU (uk)                 T OWER    92.54   89.89
ARCOSG (gd)
Serbian               UDPipe     92.7   89.27                           UDPipe    45.58   34.54
                      UDify     95.68   91.95   Upper Sorbian           UDify     71.55   62.82
SET (sr)              T OWER    94.36   90.93   UFAL (hsb)              UDapter   62.28    54.2
                                                                        T OWER    70.98   60.90
Slovak                UDPipe    89.82    86.9
                      UDify     95.92   93.87   Urdu                    UDPipe     87.5   81.62
SNK (sk)              T OWER    93.77   90.87                           UDify     88.43   82.84
                                                UDTB (ur)               T OWER    87.43   81.62
Slovenian             UDPipe    92.96   91.16
                      UDify     94.74   93.07   Uyghur                  UDPipe    78.46   67.09
SSJ (sl)              T OWER    94.91   93.50                           UDify     65.89    48.8
                                                UDT (ug)                T OWER    79.11   66.41
Slovenian             UDPipe    73.51   67.51
                      UDify     80.37   75.03   Vietnamese              UDPipe    70.38   62.56
SST (sl)              T OWER    78.64   73.10                           UDify     74.11    66.0
                                                VTB (vi)                T OWER    72.40   63.50
Spanish               UDPipe    92.34   90.26
                      UDify     92.99    90.5   Warlpiri                UDify     21.66    7.96
AnCora (es)           T OWER    92.67   90.44                           UDapter    24.2    12.1
                                                UFAL (wbp)              T OWER    31.85   16.24
Spanish               UDPipe    90.71   88.03
                      UDify     90.82   87.23   Welsh CCG (cy)          UDapter   70.75   54.43
GSD (es)              T OWER    92.12   89.64                           T OWER    77.22   57.56
                                                Wolof
                                                                        T OWER    69.06   58.13
                                                WTB (wo)

                                           4887
Figure 3: Dendrogram of the full syntax-based hierarchical clustering of 89 languages from UD v2.5. Languages
are denoted with their ISO 639-1 codes.

                                                   4888
You can also read