Neural Modeling for Named Entities and Morphology (NEMO2)

Page created by Wesley Ortega
Neural Modeling for Named Entities and Morphology (NEMO2)
Neural Modeling for Named Entities and Morphology (NEMO2 )

                                                                       Dan Bareket1,2 and Reut Tsarfaty1
                                                                       Bar Ilan University, Ramat-Gan, Israel
                                                  Open Media and Information Lab (OMILab), The Open University of Israel, Israel

                                                                Abstract                               Despite a common initial impression from lat-
                                                                                                    est NER performance, brought about by neural
                                              Named Entity Recognition (NER) is a fun-              models on the main English NER benchmarks
                                              damental NLP task, commonly formulated                — CoNLL 2003 (Tjong Kim Sang, 2003) and
arXiv:2007.15620v2 [cs.CL] 10 May 2021

                                              as classification over a sequence of tokens.          OntoNotes (Weischedel et al., 2013) — the NER
                                              Morphologically-Rich Languages (MRLs)                 task in real-world settings is far from solved.
                                              pose a challenge to this basic formulation,
                                                                                                    Specifically, NER performance is shown to greatly
                                              as the boundaries of Named Entities do not
                                              necessarily coincide with token boundaries,
                                                                                                    diminish when moving to other domains (Luan
                                              rather, they respect morphological bound-             et al., 2018; Song et al., 2018), when address-
                                              aries. To address NER in MRLs we then                 ing the long tail of rare, unseen, and new user-
                                              need to answer two fundamental questions,             generated entities (Derczynski et al., 2017), and
                                              namely, what are the basic units to be la-            when handling languages with fundamentally dif-
                                              beled, and how can these units be detected            ferent structure than English. In particular, there is
                                              and classified in realistic settings, i.e., where     no readily available and empirically verified neu-
                                              no gold morphology is available. We em-
                                                                                                    ral modeling strategy for Neural NER in those lan-
                                              pirically investigate these questions on a
                                              novel NER benchmark, with parallel token-             guages with complex word-internal structure, also
                                              level and morpheme-level NER annotations,             known as morphologically-rich languages.
                                              which we develop for Modern Hebrew,                      Morphologically-rich languages (MRL) (Tsar-
                                              a morphologically rich-and-ambiguous lan-             faty et al., 2010; Seddah et al., 2013; Tsarfaty
                                              guage. Our results show that explicitly mod-          et al., 2020) are languages in which substantial in-
                                              eling morphological boundaries leads to im-
                                                                                                    formation concerning the arrangement of words
                                              proved NER performance, and that a novel
                                              hybrid architecture, in which NER precedes            into phrases and the relations between them is
                                              and prunes morphological decomposition,               expressed at word level, rather than in a fixed
                                              greatly outperforms the standard pipeline,            word-order or a rigid structure. The extended
                                              where morphological decomposition strictly            amount of information expressed at word-level
                                              precedes NER, setting a new performance               and the morpho-phonological processes creating
                                              bar for both Hebrew NER and Hebrew mor-               these words result in high token-internal com-
                                              phological decomposition tasks.
                                                                                                    plexity, which poses serious challenges to the ba-
                                                                                                    sic formulation of NER as classification of raw,
                                                                                                    space-delimited, tokens. Specifically, while NER
                                         1   Introduction
                                                                                                    in English is formulated as the sequence labeling
                                         Named Entity Recognition (NER) is a fundamen-              of space-delimited tokens, in MRLs a single to-
                                         tal task in the area of Information Extraction (IE),       ken may include multiple meaning-bearing units,
                                         in which mentions of Named Entities (NE) are ex-           henceforth morphemes, only some of which are
                                         tracted and classified in naturally-occurring texts.       relevant for the entity mention at hand.
                                         This task is most commonly formulated as a se-                In this paper we formulate two questions con-
                                         quence labeling task, where extraction takes the           cerning neural modelling strategies for NER in
                                         form of assigning each input token with a label            MRLs, namely: (i) what should be the granularity
                                         that marks the boundaries of the NE (e.g., B,I,O),         of the units to be labeled? Space-delimited tokens
                                         and classification takes the form of assigning la-         or finer-grain morphological segments? and, (ii)
                                         bels to indicate entity type (P ER , O RG , L OC, etc.).   how can we effectively encode, and accurately de-
tect, the morphological segments that are relevant     2    Research Questions: NER for MRLs
to NER, and specifically in realistic settings, when
                                                       In MRLs, words are internally complex, and word
gold morphological boundaries are not available?
                                                       boundaries do not generally coincide with the
   To empirically investigate these questions we
                                                       boundaries of more basic meaning-bearing units.
develop a novel parallel benchmark, containing
                                                       This fact has critical ramifications for sequence
parallel token-level and morpheme-level NER an-
                                                       labeling tasks in MRLs in general, and for NER
notations for texts in Modern Hebrew — a mor-
                                                       in MRLs in particular. Consider, for instance, the
phologically rich and morphologically ambiguous
                                                       three-token Hebrew phrase in (1):2
language, which is known to be notoriously hard
to parse (More et al., 2019; Tsarfaty et al., 2019).    (1) !N‫טסנו מתאילנד לסי‬
   Our results show that morpheme-based NER is               tasnu      mithailand               lesin
superior to token-based NER, which encourages                flew.1PL      from-Thailand         to-China
a segmentation-first pipeline. At the same time,             ‘we flew from Thailand to China’
we demonstrate that token-based NER improves
morphological segmentation in realistic scenarios,     It is clear that !‫תאילנד‬/thailand (Thailand) and
encouraging a NER-first pipeline. While these          !N‫סי‬/sin (China) are NEs, and in English, each NE
two findings may appear contradictory, we aim           is its own token. In the Hebrew phrase though,
here to offer a climax; a hybrid architecture where     neither NE constitutes a single token. In either
the token-based NER predictions precede and             case, the NE occupies only one of two morphemes
prune the space of morphological decomposition          in the token, the other being a case-assigning
options, while the actual morpheme-based NER            preposition. This simple example demonstrates
takes place only after the morphological decom-         an extremely frequent phenomenon in MRLs such
position. We empirically show that the hybrid ar-       as Hebrew, Arabic or Turkish, that the adequate
chitecture we propose outperforms all token-based       boundaries for NEs do not coincide with token
and morpheme-based model variants of Hebrew             boundaries, and tokens must be segmented in or-
NER on our benchmark, and it further outper-            der to obtain accurate NE boundaries.3
forms all previously reported results on Hebrew            The segmentation of tokens and the identifica-
NER and morphological decomposition. Our er-            tion of adequate NE boundaries is however far
ror analysis further demonstrates that morpheme-        from trivial, due to complex morpho-phonological
based models better generalize, that is, they con-      and orthographic processes in some MRLs (Vania
tribute to recognizing the long tail of entities un-    et al., 2018; Klein and Tsarfaty, 2020). This means
seen during training (out-of-vocabulary, OOV), in       that the morphemes that compose NEs are not nec-
particular those unseen entities that turn out to be    essarily transparent in the character sequence of
composed of previously seen morphemes.                  the raw tokens. Consider for example phrase (2):
   The contribution of this paper is thus mani-         (2) !N‫ לבית הלב‬Z‫המרו‬
fold. First, we define key architectural ques-               hamerotz     labayit                 halavan
tions for Neural NER modeling in MRLs and                    the-race         to-house.DEF        the-white
chart the space of modeling options. Second,                 ‘the race to the White House’
we deliver a new and novel parallel benchmark
that allows one to empirically compare and con-        Here, the full form of the NE !N‫ הבית הלב‬/ habayit
trast the morpheme vs. token modeling strate-          halavan (the White House), is not present in the ut-
gies. Third, we show consistent advantages for         terances, only the sub-string !N‫ בית הלב‬/ bayit hala-
morpheme-based NER, demonstrating the impor-           van ((the) White House) is present in (2) — due to
tance of morphologically-aware modeling. Next              2
                                                             Glossing conventions are in accord with the Leipzig
we present a novel hybrid architecture which           Glossing Rules (Comrie et al., 2008).
demonstrates an even further improved perfor-                We use the term morphological segmentation (or seg-
mance on both NER and morphological decom-             mentation) to refer to splitting raw tokens into morphological
                                                       segments, each carrying a single Part-Of-Speech tag. That
position tasks. Our results for Hebrew present a       is, we segment away prepositions, determiners, subordination
new bar on these tasks, outperforming the reported     markers and multiple kinds of pronominal clitics, that attach
state-of-the-art results on various benchmarks.1       to their hosts via complex morpho-phonological processes.
                                                       Throughout this work, we use the terms morphological seg-
       Data & code:    ment, morpheme, or segment interchangeably.
phonetic and orthographic processes suppressing                (syntactic or semantic) context. The challenge, in
the definite article !‫ה‬/ha in certain environments.            a nutshell, is as follows: in order to detect accu-
In this and many other cases, it is not only that              rately NE boundaries, we need to segment the raw
NE boundaries do not coincide with token bound-                token first, however, in order to segment tokens
aries, they do not coincide with characters or sub-            correctly, we need to know the greater semantic
strings of the token either. This calls for accessing          content, including, e.g., the participating entities.
the more basic meaning-bearing units of the token,             How can we break out of this apparent loop?
that is, to decompose the tokens into morphemes.                  Finally, MRLs are often characterized by an ex-
   Unfortunately though, the morphological de-                 tremely sparse lexicon, consisting of a long-tail of
composition of surface tokens may be very chal-                out-of-vocabulary (OOV) entities unseen during
lenging due to extreme morphological ambiguity.                training (Czarnowska et al., 2019). Even in cases
The sequence of morphemes composing a token                    where all morphemes are present in the training
is not always directly recoverable from its char-              data, morphological compositions of seen mor-
acter sequence, and is not known in advance.4                  phemes may yield tokens and entities which were
This means that for every raw space-delimited to-              unseen during training. Take for example the ut-
ken, there are many conceivable readings which                 terance in (4), which the reader may inspect as fa-
impose different segmentations, yielding different             miliar:
sets of potential NE boundaries. Consider for ex-
ample the token !‫( לבני‬lbny) in different contexts:            (4) !‫ לתאילנד‬N‫טסנו מסי‬
                                                                    tasnu      misin            lethailand
 (3) (a) !‫השרה לבני‬
                                                                    flew.1PL    from-China      to-Thailand
     hasara         livni
                                                                    ’we flew from China to Thailand’
       the-minister      [Livni]P ER
       ‘Minister [Livni]P ER ’                                 Example (4) is in fact example (1) with a switched
       (b) !Z‫לבני גנ‬                                           flight direction. This subtle change creates two
       le-beny        gantz                                    new surface tokens !N‫מסי‬, !‫ לתאילנד‬which might
       for-[Benny      Gantz]P ER                              not have been seen during training, even if ex-
       ‘for [Benny Gantz]P ER ’                                ample (1) had been observed. Morphological
       (c) !‫לבני היקר‬                                          compositions of an entity with prepositions, con-
       li-bni               hayakar                            junctions, definite markers, possessive clitics and
       for-son.POSS.1SG        the-dear                        more, cause mentions of seen entities to have un-
       ‘for my dear son’                                       familiar surface forms, which often fail to be ac-
       (d) !‫לבני חימר‬                                          curately detected and analyzed.
       livney         kheymar                                     Given the aforementioned complexities, in or-
       brick.CS         clay
                                                               der to solve NER for MRLs we ought to answer
       ‘clay bricks’                                           the following fundamental modeling questions:
In (3a) the token !‫ לבני‬is completely consumed                 Q1. Units: What are the discrete units upon which
as a labeled NE. In (3b) !‫ לבני‬is only partly con-             we need to set NE boundaries in MRLs? Are they
sumed by an NE, and in (3c) and (3d) the token                 tokens? characters? morphemes? a representation
is entirely out of an NE context. In (3c) the to-              containing multiple levels of granularity?
ken is composed of several morphemes, and in                   Q2. Architecture: When employing morphemes
(3d) it consists of a single morpheme. These are               in NER, the classical approach is “segmentation-
only some of the possible decompositions of this               first". However, segmentation errors are detri-
surface token, other alternatives may still be avail-          mental and downstream NER cannot recover from
able. As shown by Goldberg and Tsarfaty (2008);                them. How is it best to set up the pipeline so that
Green and Manning (2010); Seeker and Çetinoğlu                segmentation and NER could interact?
(2015); Habash and Rambow (2005); More et al.                  Q3. Generalization: How do the different mod-
(2019), and others, the correct morphological de-              eling choices affect NER generalization in MRLs?
composition becomes apparent only in the larger                How can we address the long tail of OOV NEs
      This ambiguity gets magnified by the fact that Semitic
                                                               in MRLs? Which modeling strategy best handles
languages that use abjads, like Hebrew and Arabic, lack cap-   pseudo-OOV entities that result from a previously
italization altogether and suppress all vowels (diacritics).   unseen composition of already seen morphemes?
Nickname       Input                Lit            Output
3     Formalizing NER for MRLs
                                                          token-single   !Z‫המרו‬               the-race       O
                                                                            !‫לבית‬             to-house.DEF   B_ORG
To answer the aforementioned questions, we chart                             !N‫הלב‬            the-white      E_ORG
and formalize the space of modeling options for           token-multi     !Z‫המרו‬              the-race       O+O
neural NER in MRLs. We cast NER as a Sequence                                 !‫לבית‬           to-house.DEF   O + B_ORG + I_ORG
                                                                               !N‫הלב‬          the-white      I_ORG + E_ORG
Labelling task and formalize it as f : X → Y,             morpheme                    !‫ה‬      the            O
where x ∈ X is a sequence x1 , ..., xn of n dis-                                !Z‫מרו‬         race           O
                                                                                       !‫ל‬     to             O
crete strings from some vocabulary xi ∈ Σ, and                                          !‫ה‬    the            B_ORG
y ∈ Y is a sequence y1 , .., yn of the same length,                               !‫בית‬        house          I_ORG
where yi ∈ Labels, and Labels is a finite set of la-                                     !‫ה‬   the            I_ORG
                                                                                   !N‫לב‬       white          E-ORG
bels composed of the BIOSE tags (a.k.a., BIOLU
as described in Ratinov and Roth (2009)). Every          Table 1: Input/output for token-single, token-multi and
non-O label is also enriched with an entity type         morpheme models for example (2) in Sec. 2.
label. Our list of types is presented in Table 2.

3.1    Token-Based or Morpheme-Based?                    precise morphological boundaries. This is illus-
                                                         trated at the middle of Table 1. A downstream ap-
Our first modeling question concerns the discrete        plication may require (possibly noisy) heuristics to
units upon which to set the NE boundaries. That          determine the precise NE boundaries of each indi-
is, what is the formal definition of the input vocab-    vidual label in the multi-label for an input token.
ulary Σ for the sequence labeling task?                     Another possible scenario is a morpheme-based
   The simplest scenario, adopted in most NER            scenario, assigning a label l ∈ L for each segment:
studies, assumes token-based input, where each
token admits a single label — hence token-single:                                  NERmorph : M → L
             NERtoken-single : W → L                     Here, M = {m∗ |m ∈ Morphemes} is the set of
                                                         sequences of morphological segments in the lan-
Here, W = {w∗ |w ∈ Σ} is the set of all possible
                                                         guage, and L = {l∗ |l ∈ Labels} is the set of label
token sequences in the language and L = {l∗ |l ∈
                                                         sequences as defined above. The upshot of this
Labels} is the set of all possible label sequences
                                                         scenario is that NE boundaries are precise. An ex-
over the label set defined above. Each token gets
                                                         ample is given in the bottom row of Table 1. But,
assigned a single label, so the input and output se-
                                                         since each token may contain many meaningful
quences are of the same length. The drawback of
                                                         morphological segments, the length of the token
this scenario is that since the input for token-single
                                                         sequence is not the same as the length of morpho-
incorporates no morphological boundaries, the ex-
                                                         logical segments to be labeled, and the model as-
act boundaries of the NEs remain underspecified.
                                                         sumes prior morphological segmentation — which
This case is exemplified at the top row of Table 1.
                                                         in realistic scenarios is not necessarily available.
   There is another conceivable scenario, where
the input is again the sequence of space-delimited       3.2   Realistic Morphological Decomposition
tokens, and the output consists of complex la-
bels (henceforth multi-labels) reflecting, for each      A major caveat with morpheme-based modeling
token, the labels of its constituent morphemes;          strategies is that they often assume an ideal sce-
henceforth, a token-multi scenario:                      nario of gold morphological decomposition of the
                                                         space-delimited tokens into morphological seg-
             NERtoken-multi : W → L∗                     ments (cf. Nivre et al. (2007); Pradhan et al.
                                                         (2012)). But in reality, gold morphological de-
Here, W = {w∗ |w ∈ Σ} is the set of sequences of         composition is not known in advance, it has to be
tokens as in token-single. Each token is assigned a      predicted automatically, and prediction errors may
multi-label, i.e., a sequence (l∗ ∈ L) which indi-       propagate to contaminate the downstream task.
cates the labels of the token’s morphemes in order.         Our second modeling question therefore con-
The output is a sequence of such multi-labels, one       cerns the interaction between the morphological
multi-label per token. This variant incorporates         decomposition and the NER tasks: how would it
morphological information concerning the num-            be best to set up the pipeline so that the prediction
ber and order of labeled morphemes, but lacks the        of the two tasks can interact?
Both M DStandard and M DHybrid are disam-
                                                                biguation architectures that result in a morpheme
                                                                sequence M ∈ M. The latter benefits from the
                                                                NER signal, while the former doesn’t. The se-
                                                                quence M ∈ M can be used in one of two ways.
                                                                We can use M as input to a morpheme model to
Figure 1: Lattice for a partial list of analyses of the He-
brew tokens !N‫ לבית הלב‬corresponding to Table 1. Bold
                                                                output morpheme labels. Or, we can rely on the
nodes are token boundaries. Light nodes are segment             output of the token-multi model and align the to-
boundaries. Every path through the lattice is a single          ken’s multi-label with the segments in M .
morphological analysis. The bold path is a single NE.              In what follows, we want to empirically assess
                                                                the effect of different modeling choices (token-
   To answer this, we define morphological de-                  single, token-multi, morpheme) and disambigua-
composition as consisting of two subtasks: mor-                 tion architectures (Standard, Hybrid) on the per-
phological analysis (MA) and morphological dis-                 formance of NER in MRLs. To this end, we need
ambiguation (MD). We view sentence-based MA                     a corpus that allows training and evaluating NER
as:                                                             at both token and morpheme-level granularity.
              M A : W → P(M)
                                                                4   The Data: A Novel NER Corpus
Here W = {w∗ |w ∈ Σ} is the set of possi-
ble token sequences as before, M = {m∗ |m ∈                     This work empirically investigates NER modeling
M orphemes} is the set of possible morpheme se-                 strategies in Hebrew, a Semitic language known
quences, and P(M) is the set of subsets of M.                   for its complex and highly ambiguous morphol-
The role of M A is then to assign a token sequence              ogy. Ben-Mordecai (2005), the only previous
w ∈ W with all of its possible morphological de-                work on Hebrew NER to date, annotated space-
composition options. We represent this set of al-               delimited tokens, basing their guidelines on the
ternatives in a dense structure that we call a lattice          CoNLL 2003 shared task (Chinchor et al., 1999).
(exemplified in Figure 1). MD is the task of pick-                 Popular Arabic NER corpora also label space-
ing the single correct morphological path M ∈ M                 delimited tokens (ANERcorp (Benajiba et al.,
through the MA lattice of a given sentence:                     2007), AQMAR (Mohit et al., 2012), TWEETS
                                                                (Darwish, 2013)), with the exception of the Ara-
                   M D : P(M) → M                               bic portion of OntoNotes (Weischedel et al., 2013)
                                                                and ACE (LDC, 2008) which annotate NER la-
   Now, assume x ∈ W is a surface sentence in                   bels on gold morphologically pre-segmented texts.
the language, with its morphological decomposi-                 However, these works do not provide a compre-
tion initially unknown and underspecified. In a                 hensive analysis on the performance gaps between
Standard pipeline, MA strictly precedes MD:                     morpheme-based and token-based scenarios.
           M DStandard : M = M D(M A(x))                           In agglutinative languages as Turkish, token
                                                                segmentation is always performed before NER
The main problem here is that MD errors may                     (Tür et al. (2003); Küçük and Can (2019), re-
propagate to contaminate the NER output.                        enforcing the need to contrast the token-based sce-
   We propose a novel Hybrid alternative, in which              nario, widely adopted for Semitic languages, with
we inject a task-specific signal, in this case NER,5            the morpheme-based scenarios in other MRLs.
to constrain the search for M through the lattice:                 Our first contribution is thus a parallel corpus
                                                                for Hebrew NER, one version consists of gold-
 M DHybrid : M = M D(M A(x)  NERtoken (x))                     labeled tokens and the other consists of gold-
                                                                labeled morphemes, for the same text. For this,
Here, the restriction M A(x)  N ER(x) indi-                    we performed gold NE annotation of the Hebrew
cates pruning the lattice structure M A(x) to con-              Treebank (Sima’an et al., 2001), based on the
tain only MD options that are compatible with the               6,143 morpho-syntactically analyzed sentences of
token-based NER predictions, and only then apply                the HAARETZ corpus, to create both token-level
M D to the pruned lattice.                                      and morpheme-level variants, as illustrated at the
       We can do this for any sequence labeling task in MRLs.   topmost and lowest rows of Table 1, respectively.
train      dev       test
Annotation Scheme We started off with the
                                                                           Sentences                     4, 937     500       706
guidelines of Ben-Mordecai (2005), from which                              Tokens                        93, 504    8, 531    12, 619
we deviate in three main ways. First, we label NE                          Morphemes                     127, 031   11, 301   16, 828
boundaries and their types on sequences of mor-                            All mentions                  6, 282     499       932
                                                                           Type: Person (P ER )          2, 128     193       267
phemes, in addition to the space-delimited token                           Type: Organization (O RG )    2, 043     119       408
annotations.6 Secondly, we use the finer-grained                           Type: Geo-Political (G PE )   1, 377     121       195
entity categories list of ACE (LDC, 2008).7 Fi-                            Type: Location (L OC )        331        28        41
                                                                           Type: Facility (FAC )         163        12        11
nally, we allow nested entity mentions, as in Finkel                       Type: Work-of-Art (W OA )     114        9         6
and Manning (2009); Benikova et al. (2014).8                               Type: Event (E VE )           57         12        0
                                                                           Type: Product (D UC )         36         2         3
                                                                           Type: Language (A NG )        33         3         1
Annotation Cycle As Fort et al. (2009) put it,
examples and rules would never cover all possible                 Table 2: Basic Corpus Statistics. Standard HTB Splits.
cases because of the specificity of natural language
and the ambiguity of formulation. To address this
                                                                     Clarifications and Refinements: In the end of
we employed the cyclic approach of agile annota-
                                                                  each cycle we held a clarification talk between A,
tion as offered by Alex et al. (2010). Every cycle
                                                                  B and C, in which issues that came up during the
consisted of: annotation, evaluation and curation,
                                                                  cycle were discussed. Following that talk we re-
clarification and refinements. We used WebAnno
                                                                  fined the guidelines and updated the annotators,
(Yimam et al., 2013) as our annotation interface.
                                                                  which went on to the next cycle. In the end we per-
   The Initial Annotation Cycle was a two-stage pi-
                                                                  formed a final curation run to make sentences from
lot with 12 participants, divided into 2 teams of
                                                                  earlier cycles comply with later refinements.10
6. The teams received the same guidelines, with
                                                                     Inter-Annotator Agreement (IAA) IAA is
the exception of the specifications of entity bound-
                                                                  commonly measured using κ-statistic. However,
aries. One team was guided to annotate the mini-
                                                                  Pyysalo et al. (2007) show that it is not suitable
mal string that designates the entity. The other was
                                                                  for evaluating inter-annotator agreement in NER.
guided to tag the maximal string which can still be
                                                                  Instead, an F1 metric on entity mentions has in re-
considered as the entity. Our agreement analysis
                                                                  cent years been adopted for this purpose (Zhang,
showed that the minimal guideline generally led to
                                                                  2013). This metric allows for computing pair-wise
more consistent annotations. Based on this result
                                                                  IAA using standard F1 score by treating one anno-
(as well as low-level refinements) from the pilot,
                                                                  tator as gold and the other as the prediction.
we devised the full version of the guidelines.9
                                                                     Our full corpus pair-wise F1 scores are:
   Annotation, Evaluation and Curation: Every
                                                                  IAA(A,B)=89, IAA(B,C)=92, IAA(A,C)=96. Ta-
annotation cycle was performed by two annotators
                                                                  ble 2 presents final corpus statistics.
(A, B) and an annotation manager/curator (C). We
                                                                     Annotation Costs The annotation took on av-
annotated the full corpus in 7 cycles. We eval-
                                                                  erage about 35 seconds per sentence, and thus a
uated the annotation in two ways, manual cura-
                                                                  total of 60 hours for all sentences in the corpus for
tion and automatic evaluation. After each anno-
                                                                  each annotator. Six clarification talks were held
tation step, the curator manually reviewed every
                                                                  between the cycles, which lasted from thirty min-
sentence in which disagreements arose, as well
                                                                  utes to an hour. Giving a total of about 130 work
as specific points of difficulty pointed out by the
                                                                  hours of expert annotators.11
annotators. The inter-annotator agreement met-
ric described below was also used to quantitatively               5        Experimental Settings
gauge the progress and quality of the annotation.
                                                                  Goal We set out to empirically evaluate the rep-
      A single NE is always continuous. Token-morpheme dis-       resentation alternatives for the input/output se-
crepancies do not lead to discontinuous NEs.
      Entity categories are listed in Table 2. We dropped the     quences (token-single, token-multi, morpheme)
NORP category, since it introduced complexity concerning          and the effect of different architectures (Standard,
the distinction between adjectives and group names. L AW          Hybrid) on the performance of NER for Hebrew.
did not appear in our corpus.
    8                                                                 10
      Nested labels are are not modeled in this paper, but they        A, B and C annotations are published to enable research
are published with the corpus, to allow for further research.     on learning with disagreements (Plank et al., 2014).
    9                                                               11
      The complete annotation guide is publicly available at           The corpus is available at                           OnlpLab/NEMO-Corpus.
Figure 2: The token-single and token-multi Models.        Figure 3: The morpheme Model. The input and output
The input and output correspond to rows 1,2 in Tab. 1.    correspond to row 3 in Tab. 1. Triangles indicate string
Triangles indicate string embeddings. Circles indicate    embeddings. Circles indicate char-based encoding.
char-based encoding.

                                                          we experiment with CharLSTM, CharCNN or
Modeling Variants All experiments use the cor-            NoChar, that is, no character embedding at all.
pus we just described and employ a standard                  We pre-trained all token-based or morpheme-
Bi-LSTM-CRF architecture for implementing the             based embeddings on the Hebrew Wikipedia
neural sequence labeling task (Huang et al., 2015).       dump of Goldberg (2014). For morpheme-based
Our basic architecture12 is composed of an embed-         embeddings, we decompose the input using More
ding layer for the input and a 2-layer Bi-LSTM            et al. (2019), and use the morphological seg-
followed by a CRF inference layer — for which             ments as the embedding units.13 We compare
we test three modeling variants.                          GloVe (Pennington et al., 2014) and fastText (Bo-
   Figures 2–3 present the variants we employ.            janowski et al., 2017). We hypothesize that since
Figure 2 shows the token-based variants, token-           FastText uses sub-string information, it will be
single and token-multi. The former outputs a sin-         more useful for analyzing OOVs.
gle BIOSE label per token, and the latter outputs a
multi-label per token — a concatenation of BIOSE          Hyper parameters Following Reimers and
labels of the morphemes composing the token.              Gurevych (2017); Yang et al. (2018), we per-
Figure 3 shows the morpheme-based variant for             formed hyper-parameter tuning for each of our
the same input phrase. It has the same basic archi-       model variants. We performed hyper-parameter
tecture, but now the input consists of morphologi-        tuning on the dev set in a number of rounds of ran-
cal segments instead of tokens. The model outputs         dom search, independently on every input/output
a single BIOSE label for each morphological seg-          and char-embedding architecture. Table 3 shows
ment in the input.                                        our selected hyper-parameters.14 The Char CNN
   In all modeling variants, the input may be en-         window size is particularly interesting as it was
coded in two ways: (a) String-level embeddings              13
                                                                 Embeddings and Wikipedia corpus also available in:
(token-based or morpheme-based) optionally ini- 
tialized with pre-trained embeddings. (b) Char-                A few interesting empirical observations diverging from
                                                          those of Reimers and Gurevych (2017); Yang et al. (2018) are
level embeddings, trained simultaneously with the         worth mentioning. We found that a lower Learning Rate than
main task (cf. Ma and Hovy (2016); Chiu and               the one recommended by Yang et al. (2018) (0.015), led to
Nichols (2015); Lample et al. (2016)). For char-          better results and less occurrences of divergence. We further
based encoding (of either tokens or morphemes)            found that raising the number of Epochs from 100 to 200 did
                                                          not result in over-fitting, and significantly improved NER re-
       Using the NCRF++ suite of Yang and Zhang (2018).   sults. We used for evaluation the weights from the best epoch.
Parameter           Value    Parameter             Value            against the gold morphological boundaries.
  Optimizer           SGD      *LR (token-single)    0.01             For morpheme and token-single models, this
  *Batch Size         8        *LR (token-multi)     0.005
  LR decay            0.05     *LR (morpheme)        0.01
                                                                      is a straightforward F1 calculation against
  Epochs              200      Dropout               0.5              gold spans. Note for token-single we are ex-
  Bi-LSTM layers      2        *CharCNN window       7                pected to pay a price for boundary mismatch.
  *Word Emb Dim       300      Char Emb dim          30               For token-multi, we know the number and or-
  Word Hidden Dim     200      *Char Hidden Dim      70
                                                                      der of labels, so we align the labels in the
Table 3: Summary of Hyper-Parameter Tuning. The *                     multi-label of the token with the morphemes
indicates divergence from the NCRF++ proposed setup                   in its morphological decomposition.16
and empirical findings (Yang and Zhang, 2018).
                                                               For all experiments and metrics, we report mean
                                                               and confidence interval (0.95) over ten runs.
not treated as a hyper-parameter in Reimers and
Gurevych (2017), Yang et al. (2018). However,                  Input-Output Scenarios We experiment with
given the token-internal complexity in MRLs we                 two kinds of input settings: token-based, where the
conjecture that the window size over characters                input consists of the sequence of space-delimited
might make a crucial effect. In our experiments                tokens, and morpheme-based, where the input
we found that a larger window (7) increased the                consists of morphological segments. For the mor-
performance. For MRLs, further research into this              pheme input, there are three input variants:
hyper-parameter might be of interest.                                 (i) Morph-gold: where the morphological se-
Evaluation Standard NER studies typically in-                         quence is produced by an expert (idealistic).
voke the CoNLL evaluation script that anchors                         (ii) Morph-standard: where the morpho-
NEs in token positions (Tjong Kim Sang, 2003).                        logical sequence is produced by a standard
However, it is inadequate for our purposes because                    segmentation-first pipeline (realistic).
we want to compare entities across token-based vs.                    (iii) Morph-hybrid: where the morphological
morpheme-based settings. To this end, we use a                        sequence is produced by the hybrid architec-
revised evaluation procedure, which anchors the                       ture we propose (realistic).
entity in its form rather than its index. Specifi-                In the token-multi case we can perform
cally, we report F1 scores on strict, exact-match of           morpheme-based evaluation by aligning individ-
the surface forms of the entity mentions. I.e., the            ual labels in the multi-label with the morpheme
gold and predicted NE spans must exactly match                 sequence of the respective token. Again we have
in their form, boundaries, and entity type. In all             three options as to which morphemes to use:
experiments, we report both token-level F-scores
and morpheme-level F-scores, for all models.                          (i) Tok-multi-gold: The multi-label is aligned
                                                                      with morphemes produced by an expert (ide-
   • Token-Level evaluation. For the sake of                          alistic).
     backwards compatibility with previous work                       (ii) Tok-multi-standard: The multi-label is
     on Hebrew NER, we first define token-level                       aligned with morphemes produced by a stan-
     evaluation. For token-single this is a straight-                 dard pipeline (realistic).
     forward calculation of F1 against gold spans.                    (iii) Tok-multi-hybrid: The multi-label is
     For token-multi and morpheme, we need to                         aligned with morphemes produced by the hy-
     map the predicted label sequence of that to-                     brid architecture we propose (realistic).
     ken to a single label, and we do so using
     linguistically-informed rules we devise (as               Pipeline Scenarios Assume an input sentence x.
     elaborated in Appendix A).15                              In the Standard pipeline we use YAP,17 the cur-
                                                               rent state-of-the-art morpho-syntactic parser for
   • Morpheme-Level evaluation. Our ultimate                   Hebrew (More et al., 2019), for the predicted seg-
     goal is to obtain precise boundaries of the               mentation M = M D(M A(x)). In the Hybrid
     NEs. Thus, our main metric evaluates NEs                    16
                                                                    In case of a misalignment (in the number of morphemes
    In the morpheme case we might encounter “illegal” label    and labels) we match the label-morpheme pairs from the final
sequences in case of a prediction error. We employ similar     one backwards, and pad unpaired morphemes with O labels.
linguistically-informed heuristics to recover from that (See        For other languages this may be done using models for
Appendix A).                                                   canonical segmentation as in (Kann et al., 2016).
Figure 4: Token-level Eval. on Dev w/ Gold Segmen-            Figure 6: Token-Level Evaluation in Realistic Sce-
tation. CharCNN for morph, CharLSTM for tok.                  narios on Dev, comparing Gold, Standard and Hybrid
                                                              Morphology. CharCNN for morph, CharLSTM for
                                                              tok. Results for Gold, token-single and token-multi are
                                                              taken from Fig 4.

Figure 5: Morph-Level Eval. on Dev w/ Gold Segmen-
tation. CharCNN for morph, CharLSTM for tok.

                                                              Figure 7: Morph-Level Evaluation in Realistic Sce-
pipeline, we use YAP to first generate complete               narios on Dev, comparing Gold, Standard and Hybrid
morphological lattices M A(x). Then, to obtain                Morphology. CharCNN for morph, CharLSTM for
M A(x)  N ER(x) we omit lattice paths where                  tok. Results for Gold, token-single and token-multi are
the number of morphemes in the token decompo-                 taken from Fig 5.
sition does not conform with the number of labels
in the multi-label of NERtoken-multi (x). Then, we
apply YAP to obtain M D(M A(x)  N ER(x))                     terestingly, explicit modeling of morphemes leads
on the constrained lattice. In predicted morphol-             to better NER performance even when evaluated
ogy scenarios (either Standard or Hybrid), we use             against token-level boundaries. As expected, the
the same model weights as trained on the gold seg-            performance gaps between variants are smaller
ments, but feed predicted morphemes as input.18               with fastText than they are with embeddings that
                                                              are unaware of characters (GloVe) or with no pre-
6        Results                                              training at all. We further pursue this in Sec. 6.3.
                                                                 Figure 5 shows the morpheme-level evaluation
6.1       The Units: Tokens vs. Morphemes
                                                              for the same model variants as in Table 4. The
Figure 4 shows the token-level evaluation for the             most obvious trend here is the drop in the per-
different model variants we defined. We see                   formance of the token-single model. This is ex-
that morpheme models perform significantly better             pected, reflecting the inadequacy of token bound-
than the token-single and token-multi variants. In-           aries for identifying accurate boundaries for NER.
    We do not re-train the morpheme models with predicted
                                                              Interestingly, morpheme and token-multi models
segmentation, which might achieve better performance (e.g..   keep a similar level of performance as in token-
jackknifing). We leave this for future work.                  level evaluation, only slightly lower. Their per-
Figure 8: Entity Mention Counts and Ratio by Cate-       Figure 9: Token-Level Eval on Dev by OOTV Cate-
gory and OOTV Category, for Dev Set.                     gory. Using fastText and CharLSTM.

formance gap is also maintained, with morpheme              • Lexical: Unknown mentions caused by an
performing better than token-multi. An obvious                unknown token which consists of a single
caveat is that these results are obtained with gold           morpheme. This is a strictly lexical unknown
morphology. What happens in realistic scenarios?              with no morphological composition (most
                                                              English unknowns are in this category).
6.2   The Architecture: Pipeline vs. Hybrid
                                                            • Compositional: Unknown mentions caused
Figure 6 shows the token-level evaluation results             by an unknown token which consists of mul-
in realistic scenarios. We first observe a significant        tiple known morphemes. These are un-
drop for morpheme models when Standard pre-                   knowns introduced strictly by morphological
dicted segmentation is introduced instead of gold.            composition, with no lexical unknowns.
This means that MD errors are indeed detrimen-
tal for the downstream task, in a non-negligible            • LexComp: Unknown mentions caused by an
rate. Second, we observe that much of this perfor-            unknown token consisting of multiple mor-
mance gap is recovered with the Hybrid pipeline.              phemes, of which (at least) one morpheme
It is noteworthy that while morph hybrid lags be-             was not seen during training. In such cases,
hind morph gold, it is still consistently better than         both unknown morphological composition
token-based models, token-single and token-multi.             and lexical unknowns are involved.
   Figure 7 shows morpheme-level evaluation re-             We group NEs based on these categories, and
sults for the same scenarios as in Table 6. All          evaluate each group separately. We consider men-
trends from the token-level evaluation persist, in-      tions that do not fall into any category as Known.
cluding a drop for all models with predicted seg-           Figure 8 shows the distributions of entity men-
mentation relative to gold, with the hybrid variant      tions in the dev set by entity type and OOTV cat-
recovering much of the gap. Again morph gold             egory. OOTV categories that involve composition
outperforms token-multi, but morph hybrid shows          (Comp and LexComp) are spread across all cate-
great advantages over all tok-multi variants. This       gories but one, and in some they even make up
performance gap between morph (gold or hybrid)           more than half of all mentions.
and tok-multi indicates that explicit morphological         Figure 9 shows token-level evaluation19 with
modeling is indeed crucial for accurate NER.             fastText embeddings, grouped by OOTV type. We
                                                         first observe that indeed unknown NEs that are due
6.3   Morphologically-Aware OOV Evaluation
                                                         to morphological composition (Comp and Lex-
As discussed in Section 2, morphological compo-          Comp) proved the most challenging for all models.
sition introduces an extremely sparse word-level         We also find that in strictly Compositional OOTV
“long-tail” in MRLs. In order to gauge this phe-           19
                                                             This section focuses on token-level evaluation, which is
nomenon and its effects on NER performance,              a permissive evaluation metric, allowing us to compare the
we categorize unseen, out-of-training-vocabulary         models on a more level playing field, where all models (in-
(OOTV) mentions into 3 categories:                       cluding token-single) have an equal opportunity to perform.
mentions, morpheme-based models exhibit their                   Eval     Model                dev           test
                                                                Morph-   morph gold           80.03 ± 0.4   79.10 ± 0.6
most significant performance advantage, support-                Level    morph hybrid         78.51 ± 0.5   77.11 ± 0.7
ing the hypothesis that explicit morphology helps                        morph standard       72.79 ± 0.5   69.52 ± 0.6
to generalize. We finally observe that token-multi                       token-multi hybrid   75.70 ± 0.5   74.64 ± 0.3
                                                                Token-   morph gold           80.30 ± 0.5   79.28 ± 0.6
models perform better than token-single models
                                                                Level    morph hybrid         79.04 ± 0.5   77.64 ± 0.7
for these NEs (in contrast with the trend for non-                       morph standard       74.52 ± 0.7   73.53 ± 0.8
compositional NEs). This corroborates the hy-                            token-multi          77.59 ± 0.4   77.75 ± 0.3
                                                                         token-single         78.15 ± 0.3   77.15 ± 0.6
pothesis that even partial modeling of morphology
(as in token-multi compared to token-single) is bet-          Table 4: Test vs. Dev: Results with fastText for all
ter than none, leading to better generalization.              Models. morph-gold presents an ideal upper-bound.

String-level vs. Character-level Embeddings
                                                              morpheme-based models, indicating that not all
To further understand the generalization capacity
                                                              morphological information is captured by these
of different modeling alternatives in MRLs, we
                                                              vectors. For fastText with char-based embed-
probe into the interplay of string-based and char-
                                                              dings the gap between token-multi and morpheme
based embeddings in treating OOTV NEs.
                                                              greatly diminishes, but is still well above token-
   Figure 10 presents 12 plots, each of which
                                                              single. This suggests biasing the model to learn
presents the level of performance (y-axes) for all
                                                              about morphology (either via multi-labels or by
models (x-axes). Token-based models are on the
                                                              incorporating morphological boundaries) has ad-
left of each x-axes, morpheme-based are on the
                                                              vantages for analysing OOTV entities, beyond the
right. We plot results with and without charac-
                                                              contribution of char-based embeddings alone.
ter embeddings,20 in orange and blue respectively.
                                                                 All in all, the biggest advantage of morpheme-
The plots are organized in a large grid, with the
                                                              based models over token-based models is their
type of NE on the y-axes (Known, Lex, Comp, Lex-
                                                              ability to generalize from observed tokens
Comp), and the type of pre-training on the x-axes
                                                              to composition-related OOTV (Comp/LexComp).
(No pre-training, GloVe, fastText) .
                                                              While character-based embeddings do help token-
   At the top-most row, plotting the accuracy for
                                                              based models generalize, the contribution of mod-
Known NEs, we see a high level of performance
                                                              eling morphology is indispensable, above and be-
for all pre-training methods, with not much dif-
                                                              yond the contribution of char-based embeddings.
ferences between the type of pre-training, with
or without the character embeddings. Moving                   6.4   Setting in the Greater Context
further down to the row of Lexical unseen NEs,
char-based representations lead to significant ad-            Test Set Results Table 4 confirms our best re-
vantages when we assume no pre-training, but                  sults on the Test set. The trends are kept, though
with GloVe pre-training the performance substan-              results on Test are lower than on Dev. The morph
tially increases, and with fastText the differences           gold scenario still provides an upperbound of the
in performance with/without char-embeddings al-               performance, but it is not realistic. For the realistic
most entirely diminish, indicating the char-based             scenarios, morph hybrid generally outperforms all
embeddings are somewhat redundant in this case.               other alternatives. The only divergence is that in
   The two lower rows in the large grid show the              token-level evaluation, token-multi performs on a
performance for Comp and LexComp unseen NEs,                  par with morph hybrid on the Test set.
which are ubiquitous in MRLs. For Compositional               Results on MD Tasks. While the Hybrid
NEs, pre-training closes only part of the gap be-             pipeline achieves superior performance on NER,
tween token-based and morpheme-based models.                  it also improves the state-of-the-art on other tasks
Adding char-based representations indeed helps                in the pipeline. Table 5 shows the Seg+POS results
the token-based models, but crucially does not                of our Hybrid pipeline scenario, compared with
close the gap with the morpheme-based variants.               the Standard pipeline which replicates the pipeline
   Finally, for LexComp NEs at the lowest row, we             of More et al. (2019). We use the metrics defined
again see that adding GloVe pre-training and char-            by More et al. (2019). We show substantial im-
based embeddings does not close the gap with                  provements for the Hybrid pipeline over the results
     For brevity we only show char LSTM (vs. no char repre-   of More et al. (2019), and also outperforming the
sentation), there was no significant difference with CNN.     Test results of Seker and Tsarfaty (2020).
Figure 10: Token-Level Eval. on Dev for Different OOTV Types, Char- and Word-Embeddings.

                                                 Seg+POS                                Precision   Recall   F1
 dev    Standard (More et al., 2019)             92.36      Ben-Mordecai (2005)         84.54       74.31    79.10
        Ptr-Network (Seker and Tsarfaty, 2020)   93.90      MEMM+HMM+REGEX
        Hybrid (This work)                       93.12      This work                   86.84       82.6     84.71
                                                            token-single +FT+CharLSTM   ±0.5        ±0.9     ±0.5
 test   Standard (More et al., 2019)             89.08
                                                            This work                   86.93       83.59    85.22
        Ptr-Network (Seker and Tsarfaty, 2020)   90.49
                                                            morph-Hybrid +FT+CharLSTM   ±0.6        ±0.8     ±0.5
        Hybrid (This work)                       90.89
                                                           Table 6: NER Comparison with Ben-Mordecai (2005).
Table 5: Morphological Segmentation & POS scores.

Comparison with Prior Art. Table 6 presents                we performed three 75%-25% random train/test
our results on the Hebrew NER corpus of Ben-               splits, and used the same seven NE categories
Mordecai (2005) compared to their model, which             (P ER ,L OC ,O RG ,T IME ,DATE ,P ERCENT,M ONEY ).
uses a hand-crafted feature-engineered MEMM                We trained a token-single model on the original
with regular-expression rule-based enhancements            space-delimited tokens and a morpheme model on
and an entity lexicon. Like Ben-Mordecai (2005)            automatically segmented morphemes we obtained
using our best segmentation model (Hybrid MD                However, our preliminary experiments with a
on our trained token-multi model, as in Table 5).        lattice-based Pointer-network for token segmen-
Since their annotation includes only token-level         tation and NER labeling shows that this is not a
boundaries, all of the results we report conform         straightforward task. Contrary to POS tags, which
with token-level evaluation.                             are constrained by the MA, every NER label can
   Table 6 presents the results of these experi-         potentially go with any segment, and this leads to
ments. Both models significantly outperform the          a combinatorial explosion of the search space rep-
previous state-of-the-art by Ben-Mordecai (2005),        resented by the lattice. As a result, the NER pre-
setting a new performance bar on this earlier            dictions are brittle to learn, and the complexity of
benchmark. Moreover, we again observe an em-             the resulting model is computationally prohibitive.
pirical advantage when explicitly modeling mor-             A different approach to joint sequence segmen-
phemes, even with the automatic noisy segmenta-          tation and labeling can be applying the neural
tion that is used for the morpheme-based training.       model directly on the character-sequence of the
                                                         input stream. Such an approach is for instance
7   Discussion: Joint Modeling                           the char-based labeling as segmentation setup pro-
    Alternatives and Future Work                         posed by Shao et al. (2017). Shao et al. use a
                                                         character-based Bi-RNN-CRF to output a single
The present study provides the motivation and the
                                                         label-per-char which indicates both word bound-
necessary foundations for comparing morpheme-
                                                         ary (using BIES sequence labels) and the POS
based and token-based modeling for NER. While
                                                         tags. This method is also used in their universal
our findings clearly demonstrate the advantages of
                                                         segmentation paper, (Shao et al., 2018). However,
morpheme-based modeling for NER in a morpho-
                                                         as seen in the results of Shao et al. (2018), char-
logically rich language, it is clear that our pro-
                                                         based labeling for segmenting Semitic languages
posed Hybrid architecture is not the only modeling
                                                         lags far behind all other languages, precisely be-
alternative for linking NER and morphology.
                                                         cause morphological boundaries are not explicit in
   For example, a previous study by Güngör
                                                         the character sequences.
et al. (2018) addresses joint neural modeling of
morphological segmentation and NER labeling,                Additional proposals are those of Kong et al.
proposing a multi-task learning (MTL) approach           (2015); Kemos et al. (2019). First, Kong et al.
for joint MD and NER in Turkish. They employ             (2015) proposed to solve e.g. Chinese segmen-
separate Bi-LSTM networks for the MD and NER             tation and POS tagging using dynamic program-
tasks, with a shared loss to allow for joint learning.   ming with neural encoding, by using a Bi-LSTM
Their results indicate improved NER performance,         to encode the character input, and then feed it
with no improvement in the MD results. Contrary          to a semi-markov CRF to obtain probabilities for
to our proposal, they view MD and NER as dis-            the different segmentation options. Kemos et al.
tinct tasks, assuming a single NER label per token,      (2019) propose an approach similar to Kong et al.
and not providing disambiguated morpheme-level           (2015) for joint segmentation and tagging but add
boundaries for the NER task. More generally, they        convolution layers on top of the Bi-LSTM encod-
test only token-based NER labeling and do not at-        ings to obtain segment features hierarchically and
tend to the question of input/output granularity in      then feed them to the semi-markov CRF.
their models.                                               Preliminary experiments we conducted confirm
   A different approach for joint NER and mor-           that char-based joint segmentation and NER label-
phology is jointly predicting the segmentation           ing for Hebrew, either using char-based labeling
and labels for each token in the input stream.           or a seq2seq architecture, still lags behind our re-
This is the approach taken, for instance, by the         ported results. We conjecture that this is due to the
lattice-based Pointer-Network of Seker and Tsar-         complex morpho-phonological and orthographic
faty (2020). As shown in Table 5, their results for      processed in Semitic languages. Going into char-
morphological segmentation and POS tagging are           based modeling nuances and offering a sound joint
on a par with our reported results and, at least in      solution for a language like Hebrew is an impor-
principle, it should be possible to extend the Seker     tant matter that merits its own investigation. Such
and Tsarfaty (2020) approach to yield also NER           work is feasible now given the new corpus, how-
predictions.                                             ever, it is out of the scope of the current study.
All in all, the design of sophisticated joint mod-    References
eling strategies for morpheme-based NER poses
                                                         Bea Alex, Claire Grover, Rongzhou Shen, and Mi-
fascinating questions — for which our work pro-
                                                           jail Kabadjov. 2010. Agile corpus annotation
vides a solid foundation (data, protocols, metrics,
                                                           in practice: An overview of manual and auto-
strong baselines). More work is needed for inves-
                                                           matic annotation of CVs. In Proceedings of the
tigating joint modeling of NER and morphology,
                                                           Fourth Linguistic Annotation Workshop, pages
in the directions portrayed in this Section, yet it is
                                                           29–37, Uppsala, Sweden. Association for Com-
beyond the scope of this paper, and we leave this
                                                           putational Linguistics.
investigation for future work.
   Finally, while the joint approach is appealing,       Naama Ben-Mordecai. 2005. Hebrew named en-
we argue that the elegance of our Hybrid solution          tity recognition. Master’s thesis, Department of
is precisely in providing a clear and well-defined         Computer Science, Ben-Gurion University.
interface between MD and NER through which the
two tasks can interact, while still keeping the dis-     Yassine Benajiba, Paolo Rosso, and José Miguel
tinct models simple, robust, and efficiently train-        BenedíRuiz. 2007. Anersys: An arabic named
able. It also has the advantage of allowing us to          entity recognition system based on maximum
seamlessly integrate sequence labelling with any           entropy. In Computational Linguistics and
lattice-based MA, in a plug-and-play language-             Intelligent Text Processing, pages 143–153,
agnostic fashion, towards obtaining further advan-         Berlin, Heidelberg. Springer Berlin Heidelberg.
tages on both of these tasks.
                                                         Darina Benikova, Chris Biemann, and Marc
8   Conclusion                                             Reznicek. 2014. NoSta-D named entity an-
                                                           notation for German: Guidelines and dataset.
This work addresses the modeling challenges of             In Proceedings of the Ninth International Con-
Neural NER in MRLs. We deliver a parallel token-           ference on Language Resources and Evalua-
vs-morpheme NER corpus for Modern Hebrew,                  tion (LREC-2014), pages 2524–2531, Reyk-
that allows one to assess NER modeling strate-             javik, Iceland. European Languages Resources
gies in morphologically rich-and-ambiguous en-             Association (ELRA).
vironments. Our experiments show that while
NER benefits from morphological decomposition,           Piotr Bojanowski, Edouard Grave, Armand Joulin,
downstream results are sensitive to segmentation           and Tomas Mikolov. 2017. Enriching word vec-
errors. We thus propose a Hybrid architecture in           tors with subword information. Transactions of
which NER precedes and prunes the morpholog-               the Association for Computational Linguistics,
ical decomposition. This approach greatly out-             5:135–146.
performs a Standard pipeline in realistic (non-
                                                         N. Chinchor, E. Brown, L. Ferro, and P. Robinson.
gold) scenarios. Our analysis further shows that
                                                           1999. Named entity recognition task definition.
morpheme-based models better recognize OOVs
                                                           Technical Report Version 1.4, The MITRE Cor-
that result from morphological composition. All
                                                           poration and SAIC.
in all we deliver new state-of-the-art results for
Hebrew NER and MD, along with a novel bench-             Jason P. C. Chiu and Eric Nichols. 2015. Named
mark, to encourage further investigation into the          entity recognition with bidirectional lstm-cnns.
interaction between NER and morphology.                    CoRR, abs/1511.08308.

Acknowledgments                                          Bernard Comrie, Martin Haspelmath, and
                                                           Balthasar Bickel. 2008. The leipzig glossing
We are grateful to the BIU-NLP lab members as              rules: Conventions for interlinear morpheme-
well as 6 anonymous reviewers for their insight-           by-morpheme glosses. Department of Linguis-
ful remarks. We further thank Daphna Amit and              tics of the Max Planck Institute for Evolutionary
Zef Segal for their meticulous annotation and pro-         Anthropology & the Department of Linguistics
found discussions. This research is funded by an           of the University of Leipzig.
ISF Individual Grant (1739/26) and an ERC Start-
ing Grant (677352), for which we are grateful.           Paula Czarnowska, Sebastian Ruder, Edouard
                                                           Grave, Ryan Cotterell, and Ann Copestake.
2019. Don’t forget the long tail! a comprehen-       and analysis. In Proceedings of the 23rd In-
  sive analysis of morphological generalization in     ternational Conference on Computational Lin-
  bilingual lexicon induction. In Proceedings of       guistics (Coling 2010), pages 394–402, Beijing,
  the 2019 Conference on Empirical Methods in          China. Coling 2010 Organizing Committee.
  Natural Language Processing and the 9th Inter-
  national Joint Conference on Natural Language      Onur Güngör, Suzan Üsküdarli, and Tunga
  Processing (EMNLP-IJCNLP), pages 974–983,            Güngör. 2018. Improving named entity recog-
  Hong Kong, China. Association for Computa-           nition by jointly learning to disambiguate mor-
  tional Linguistics.                                  phological tags. CoRR, abs/1807.06683.

Kareem Darwish. 2013. Named entity recognition       Nizar Habash and Owen Rambow. 2005. Arabic
  using cross-lingual resources: Arabic as an ex-      tokenization, part-of-speech tagging and mor-
  ample. In Proceedings of the 51st Annual Meet-       phological disambiguation in one fell swoop.
  ing of the Association for Computational Lin-        In Proceedings of the 43rd Annual Meeting of
  guistics (Volume 1: Long Papers), pages 1558–        the Association for Computational Linguistics
  1567, Sofia, Bulgaria. Association for Compu-        (ACL’05), pages 573–580, Ann Arbor, Michi-
  tational Linguistics.                                gan. Association for Computational Linguistics.

Leon Derczynski, Eric Nichols, Marieke van Erp,      Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi-
  and Nut Limsopatham. 2017. Results of the            rectional LSTM-CRF models for sequence tag-
  WNUT2017 shared task on novel and emerging           ging. CoRR, abs/1508.01991.
  entity recognition. In Proceedings of the 3rd
  Workshop on Noisy User-generated Text, pages       Katharina Kann, Ryan Cotterell, and Hinrich
  140–147, Copenhagen, Denmark. Association            Schütze. 2016. Neural morphological analy-
  for Computational Linguistics.                       sis: Encoding-decoding canonical segments. In
                                                       Proceedings of the 2016 Conference on Empir-
Jenny Rose Finkel and Christopher D. Manning.          ical Methods in Natural Language Processing,
  2009. Nested named entity recognition. In Pro-       pages 961–967, Austin, Texas. Association for
  ceedings of the 2009 Conference on Empirical         Computational Linguistics.
  Methods in Natural Language Processing: Vol-
  ume 1 - Volume 1, EMNLP ’09, pages 141–150,        Apostolos Kemos, Heike Adel, and Hinrich
  Stroudsburg, PA, USA. Association for Compu-         Schütze. 2019. Neural semi-Markov condi-
  tational Linguistics.                                tional random fields for robust character-based
                                                       part-of-speech tagging. In Proceedings of the
Karën Fort, Maud Ehrmann, and Adeline
                                                       2019 Conference of the North American Chap-
  Nazarenko. 2009. Towards a methodology for
                                                       ter of the Association for Computational Lin-
  named entities annotation. In Proceedings of
                                                       guistics: Human Language Technologies, Vol-
  the Third Linguistic Annotation Workshop (LAW
                                                       ume 1 (Long and Short Papers), pages 2736–
  III), pages 142–145, Suntec, Singapore. Associ-
                                                       2743, Minneapolis, Minnesota. Association for
  ation for Computational Linguistics.
                                                       Computational Linguistics.
Yoav Goldberg. 2014. Hebrew wikipedia depen-
                                                     Stav Klein and Reut Tsarfaty. 2020. Getting the
  dency parsed corpus, v.1.0. Technical Report
                                                       ##life out of living: How adequate are word-
                                                       pieces for modelling complex morphology? In
Yoav Goldberg and Reut Tsarfaty. 2008. A single        Proceedings of the 17th SIGMORPHON Work-
  generative model for joint morphological seg-        shop on Computational Research in Phonetics,
  mentation and syntactic parsing. In Proceed-         Phonology, and Morphology, pages 204–209,
  ings of ACL-08: HLT, pages 371–379, Colum-           Online. Association for Computational Linguis-
  bus, Ohio. Association for Computational Lin-        tics.
                                                     Lingpeng Kong, Chris Dyer, and Noah A Smith.
Spence Green and Christopher D. Manning. 2010.         2015. Segmental recurrent neural networks.
  Better Arabic parsing: Baselines, evaluations,       arXiv preprint arXiv:1511.06018.
You can also read