Neural Modeling for Named Entities and Morphology (NEMO2)
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Neural Modeling for Named Entities and Morphology (NEMO2 ) Dan Bareket1,2 and Reut Tsarfaty1 1 Bar Ilan University, Ramat-Gan, Israel 2 Open Media and Information Lab (OMILab), The Open University of Israel, Israel dbareket@gmail.com, reut.tsarfaty@biu.ac.il Abstract Despite a common initial impression from lat- est NER performance, brought about by neural Named Entity Recognition (NER) is a fun- models on the main English NER benchmarks damental NLP task, commonly formulated — CoNLL 2003 (Tjong Kim Sang, 2003) and arXiv:2007.15620v2 [cs.CL] 10 May 2021 as classification over a sequence of tokens. OntoNotes (Weischedel et al., 2013) — the NER Morphologically-Rich Languages (MRLs) task in real-world settings is far from solved. pose a challenge to this basic formulation, Specifically, NER performance is shown to greatly as the boundaries of Named Entities do not necessarily coincide with token boundaries, diminish when moving to other domains (Luan rather, they respect morphological bound- et al., 2018; Song et al., 2018), when address- aries. To address NER in MRLs we then ing the long tail of rare, unseen, and new user- need to answer two fundamental questions, generated entities (Derczynski et al., 2017), and namely, what are the basic units to be la- when handling languages with fundamentally dif- beled, and how can these units be detected ferent structure than English. In particular, there is and classified in realistic settings, i.e., where no readily available and empirically verified neu- no gold morphology is available. We em- ral modeling strategy for Neural NER in those lan- pirically investigate these questions on a novel NER benchmark, with parallel token- guages with complex word-internal structure, also level and morpheme-level NER annotations, known as morphologically-rich languages. which we develop for Modern Hebrew, Morphologically-rich languages (MRL) (Tsar- a morphologically rich-and-ambiguous lan- faty et al., 2010; Seddah et al., 2013; Tsarfaty guage. Our results show that explicitly mod- et al., 2020) are languages in which substantial in- eling morphological boundaries leads to im- formation concerning the arrangement of words proved NER performance, and that a novel hybrid architecture, in which NER precedes into phrases and the relations between them is and prunes morphological decomposition, expressed at word level, rather than in a fixed greatly outperforms the standard pipeline, word-order or a rigid structure. The extended where morphological decomposition strictly amount of information expressed at word-level precedes NER, setting a new performance and the morpho-phonological processes creating bar for both Hebrew NER and Hebrew mor- these words result in high token-internal com- phological decomposition tasks. plexity, which poses serious challenges to the ba- sic formulation of NER as classification of raw, space-delimited, tokens. Specifically, while NER 1 Introduction in English is formulated as the sequence labeling Named Entity Recognition (NER) is a fundamen- of space-delimited tokens, in MRLs a single to- tal task in the area of Information Extraction (IE), ken may include multiple meaning-bearing units, in which mentions of Named Entities (NE) are ex- henceforth morphemes, only some of which are tracted and classified in naturally-occurring texts. relevant for the entity mention at hand. This task is most commonly formulated as a se- In this paper we formulate two questions con- quence labeling task, where extraction takes the cerning neural modelling strategies for NER in form of assigning each input token with a label MRLs, namely: (i) what should be the granularity that marks the boundaries of the NE (e.g., B,I,O), of the units to be labeled? Space-delimited tokens and classification takes the form of assigning la- or finer-grain morphological segments? and, (ii) bels to indicate entity type (P ER , O RG , L OC, etc.). how can we effectively encode, and accurately de-
tect, the morphological segments that are relevant 2 Research Questions: NER for MRLs to NER, and specifically in realistic settings, when In MRLs, words are internally complex, and word gold morphological boundaries are not available? boundaries do not generally coincide with the To empirically investigate these questions we boundaries of more basic meaning-bearing units. develop a novel parallel benchmark, containing This fact has critical ramifications for sequence parallel token-level and morpheme-level NER an- labeling tasks in MRLs in general, and for NER notations for texts in Modern Hebrew — a mor- in MRLs in particular. Consider, for instance, the phologically rich and morphologically ambiguous three-token Hebrew phrase in (1):2 language, which is known to be notoriously hard to parse (More et al., 2019; Tsarfaty et al., 2019). (1) !Nטסנו מתאילנד לסי Our results show that morpheme-based NER is tasnu mithailand lesin superior to token-based NER, which encourages flew.1PL from-Thailand to-China a segmentation-first pipeline. At the same time, ‘we flew from Thailand to China’ we demonstrate that token-based NER improves morphological segmentation in realistic scenarios, It is clear that !תאילנד/thailand (Thailand) and encouraging a NER-first pipeline. While these !Nסי/sin (China) are NEs, and in English, each NE two findings may appear contradictory, we aim is its own token. In the Hebrew phrase though, here to offer a climax; a hybrid architecture where neither NE constitutes a single token. In either the token-based NER predictions precede and case, the NE occupies only one of two morphemes prune the space of morphological decomposition in the token, the other being a case-assigning options, while the actual morpheme-based NER preposition. This simple example demonstrates takes place only after the morphological decom- an extremely frequent phenomenon in MRLs such position. We empirically show that the hybrid ar- as Hebrew, Arabic or Turkish, that the adequate chitecture we propose outperforms all token-based boundaries for NEs do not coincide with token and morpheme-based model variants of Hebrew boundaries, and tokens must be segmented in or- NER on our benchmark, and it further outper- der to obtain accurate NE boundaries.3 forms all previously reported results on Hebrew The segmentation of tokens and the identifica- NER and morphological decomposition. Our er- tion of adequate NE boundaries is however far ror analysis further demonstrates that morpheme- from trivial, due to complex morpho-phonological based models better generalize, that is, they con- and orthographic processes in some MRLs (Vania tribute to recognizing the long tail of entities un- et al., 2018; Klein and Tsarfaty, 2020). This means seen during training (out-of-vocabulary, OOV), in that the morphemes that compose NEs are not nec- particular those unseen entities that turn out to be essarily transparent in the character sequence of composed of previously seen morphemes. the raw tokens. Consider for example phrase (2): The contribution of this paper is thus mani- (2) !N לבית הלבZהמרו fold. First, we define key architectural ques- hamerotz labayit halavan tions for Neural NER modeling in MRLs and the-race to-house.DEF the-white chart the space of modeling options. Second, ‘the race to the White House’ we deliver a new and novel parallel benchmark that allows one to empirically compare and con- Here, the full form of the NE !N הבית הלב/ habayit trast the morpheme vs. token modeling strate- halavan (the White House), is not present in the ut- gies. Third, we show consistent advantages for terances, only the sub-string !N בית הלב/ bayit hala- morpheme-based NER, demonstrating the impor- van ((the) White House) is present in (2) — due to tance of morphologically-aware modeling. Next 2 Glossing conventions are in accord with the Leipzig we present a novel hybrid architecture which Glossing Rules (Comrie et al., 2008). 3 demonstrates an even further improved perfor- We use the term morphological segmentation (or seg- mance on both NER and morphological decom- mentation) to refer to splitting raw tokens into morphological segments, each carrying a single Part-Of-Speech tag. That position tasks. Our results for Hebrew present a is, we segment away prepositions, determiners, subordination new bar on these tasks, outperforming the reported markers and multiple kinds of pronominal clitics, that attach state-of-the-art results on various benchmarks.1 to their hosts via complex morpho-phonological processes. Throughout this work, we use the terms morphological seg- 1 Data & code: https://github.com/OnlpLab/NEMO ment, morpheme, or segment interchangeably.
phonetic and orthographic processes suppressing (syntactic or semantic) context. The challenge, in the definite article !ה/ha in certain environments. a nutshell, is as follows: in order to detect accu- In this and many other cases, it is not only that rately NE boundaries, we need to segment the raw NE boundaries do not coincide with token bound- token first, however, in order to segment tokens aries, they do not coincide with characters or sub- correctly, we need to know the greater semantic strings of the token either. This calls for accessing content, including, e.g., the participating entities. the more basic meaning-bearing units of the token, How can we break out of this apparent loop? that is, to decompose the tokens into morphemes. Finally, MRLs are often characterized by an ex- Unfortunately though, the morphological de- tremely sparse lexicon, consisting of a long-tail of composition of surface tokens may be very chal- out-of-vocabulary (OOV) entities unseen during lenging due to extreme morphological ambiguity. training (Czarnowska et al., 2019). Even in cases The sequence of morphemes composing a token where all morphemes are present in the training is not always directly recoverable from its char- data, morphological compositions of seen mor- acter sequence, and is not known in advance.4 phemes may yield tokens and entities which were This means that for every raw space-delimited to- unseen during training. Take for example the ut- ken, there are many conceivable readings which terance in (4), which the reader may inspect as fa- impose different segmentations, yielding different miliar: sets of potential NE boundaries. Consider for ex- ample the token !( לבניlbny) in different contexts: (4) ! לתאילנדNטסנו מסי tasnu misin lethailand (3) (a) !השרה לבני flew.1PL from-China to-Thailand hasara livni ’we flew from China to Thailand’ the-minister [Livni]P ER ‘Minister [Livni]P ER ’ Example (4) is in fact example (1) with a switched (b) !Zלבני גנ flight direction. This subtle change creates two le-beny gantz new surface tokens !Nמסי, ! לתאילנדwhich might for-[Benny Gantz]P ER not have been seen during training, even if ex- ‘for [Benny Gantz]P ER ’ ample (1) had been observed. Morphological (c) !לבני היקר compositions of an entity with prepositions, con- li-bni hayakar junctions, definite markers, possessive clitics and for-son.POSS.1SG the-dear more, cause mentions of seen entities to have un- ‘for my dear son’ familiar surface forms, which often fail to be ac- (d) !לבני חימר curately detected and analyzed. livney kheymar Given the aforementioned complexities, in or- brick.CS clay der to solve NER for MRLs we ought to answer ‘clay bricks’ the following fundamental modeling questions: In (3a) the token ! לבניis completely consumed Q1. Units: What are the discrete units upon which as a labeled NE. In (3b) ! לבניis only partly con- we need to set NE boundaries in MRLs? Are they sumed by an NE, and in (3c) and (3d) the token tokens? characters? morphemes? a representation is entirely out of an NE context. In (3c) the to- containing multiple levels of granularity? ken is composed of several morphemes, and in Q2. Architecture: When employing morphemes (3d) it consists of a single morpheme. These are in NER, the classical approach is “segmentation- only some of the possible decompositions of this first". However, segmentation errors are detri- surface token, other alternatives may still be avail- mental and downstream NER cannot recover from able. As shown by Goldberg and Tsarfaty (2008); them. How is it best to set up the pipeline so that Green and Manning (2010); Seeker and Çetinoğlu segmentation and NER could interact? (2015); Habash and Rambow (2005); More et al. Q3. Generalization: How do the different mod- (2019), and others, the correct morphological de- eling choices affect NER generalization in MRLs? composition becomes apparent only in the larger How can we address the long tail of OOV NEs 4 This ambiguity gets magnified by the fact that Semitic in MRLs? Which modeling strategy best handles languages that use abjads, like Hebrew and Arabic, lack cap- pseudo-OOV entities that result from a previously italization altogether and suppress all vowels (diacritics). unseen composition of already seen morphemes?
Nickname Input Lit Output 3 Formalizing NER for MRLs token-single !Zהמרו the-race O !לבית to-house.DEF B_ORG To answer the aforementioned questions, we chart !Nהלב the-white E_ORG and formalize the space of modeling options for token-multi !Zהמרו the-race O+O neural NER in MRLs. We cast NER as a Sequence !לבית to-house.DEF O + B_ORG + I_ORG !Nהלב the-white I_ORG + E_ORG Labelling task and formalize it as f : X → Y, morpheme !ה the O where x ∈ X is a sequence x1 , ..., xn of n dis- !Zמרו race O !ל to O crete strings from some vocabulary xi ∈ Σ, and !ה the B_ORG y ∈ Y is a sequence y1 , .., yn of the same length, !בית house I_ORG where yi ∈ Labels, and Labels is a finite set of la- !ה the I_ORG !Nלב white E-ORG bels composed of the BIOSE tags (a.k.a., BIOLU as described in Ratinov and Roth (2009)). Every Table 1: Input/output for token-single, token-multi and non-O label is also enriched with an entity type morpheme models for example (2) in Sec. 2. label. Our list of types is presented in Table 2. 3.1 Token-Based or Morpheme-Based? precise morphological boundaries. This is illus- trated at the middle of Table 1. A downstream ap- Our first modeling question concerns the discrete plication may require (possibly noisy) heuristics to units upon which to set the NE boundaries. That determine the precise NE boundaries of each indi- is, what is the formal definition of the input vocab- vidual label in the multi-label for an input token. ulary Σ for the sequence labeling task? Another possible scenario is a morpheme-based The simplest scenario, adopted in most NER scenario, assigning a label l ∈ L for each segment: studies, assumes token-based input, where each token admits a single label — hence token-single: NERmorph : M → L NERtoken-single : W → L Here, M = {m∗ |m ∈ Morphemes} is the set of sequences of morphological segments in the lan- Here, W = {w∗ |w ∈ Σ} is the set of all possible guage, and L = {l∗ |l ∈ Labels} is the set of label token sequences in the language and L = {l∗ |l ∈ sequences as defined above. The upshot of this Labels} is the set of all possible label sequences scenario is that NE boundaries are precise. An ex- over the label set defined above. Each token gets ample is given in the bottom row of Table 1. But, assigned a single label, so the input and output se- since each token may contain many meaningful quences are of the same length. The drawback of morphological segments, the length of the token this scenario is that since the input for token-single sequence is not the same as the length of morpho- incorporates no morphological boundaries, the ex- logical segments to be labeled, and the model as- act boundaries of the NEs remain underspecified. sumes prior morphological segmentation — which This case is exemplified at the top row of Table 1. in realistic scenarios is not necessarily available. There is another conceivable scenario, where the input is again the sequence of space-delimited 3.2 Realistic Morphological Decomposition tokens, and the output consists of complex la- bels (henceforth multi-labels) reflecting, for each A major caveat with morpheme-based modeling token, the labels of its constituent morphemes; strategies is that they often assume an ideal sce- henceforth, a token-multi scenario: nario of gold morphological decomposition of the space-delimited tokens into morphological seg- NERtoken-multi : W → L∗ ments (cf. Nivre et al. (2007); Pradhan et al. (2012)). But in reality, gold morphological de- Here, W = {w∗ |w ∈ Σ} is the set of sequences of composition is not known in advance, it has to be tokens as in token-single. Each token is assigned a predicted automatically, and prediction errors may multi-label, i.e., a sequence (l∗ ∈ L) which indi- propagate to contaminate the downstream task. cates the labels of the token’s morphemes in order. Our second modeling question therefore con- The output is a sequence of such multi-labels, one cerns the interaction between the morphological multi-label per token. This variant incorporates decomposition and the NER tasks: how would it morphological information concerning the num- be best to set up the pipeline so that the prediction ber and order of labeled morphemes, but lacks the of the two tasks can interact?
Both M DStandard and M DHybrid are disam- biguation architectures that result in a morpheme sequence M ∈ M. The latter benefits from the NER signal, while the former doesn’t. The se- quence M ∈ M can be used in one of two ways. We can use M as input to a morpheme model to Figure 1: Lattice for a partial list of analyses of the He- brew tokens !N לבית הלבcorresponding to Table 1. Bold output morpheme labels. Or, we can rely on the nodes are token boundaries. Light nodes are segment output of the token-multi model and align the to- boundaries. Every path through the lattice is a single ken’s multi-label with the segments in M . morphological analysis. The bold path is a single NE. In what follows, we want to empirically assess the effect of different modeling choices (token- To answer this, we define morphological de- single, token-multi, morpheme) and disambigua- composition as consisting of two subtasks: mor- tion architectures (Standard, Hybrid) on the per- phological analysis (MA) and morphological dis- formance of NER in MRLs. To this end, we need ambiguation (MD). We view sentence-based MA a corpus that allows training and evaluating NER as: at both token and morpheme-level granularity. M A : W → P(M) 4 The Data: A Novel NER Corpus Here W = {w∗ |w ∈ Σ} is the set of possi- ble token sequences as before, M = {m∗ |m ∈ This work empirically investigates NER modeling M orphemes} is the set of possible morpheme se- strategies in Hebrew, a Semitic language known quences, and P(M) is the set of subsets of M. for its complex and highly ambiguous morphol- The role of M A is then to assign a token sequence ogy. Ben-Mordecai (2005), the only previous w ∈ W with all of its possible morphological de- work on Hebrew NER to date, annotated space- composition options. We represent this set of al- delimited tokens, basing their guidelines on the ternatives in a dense structure that we call a lattice CoNLL 2003 shared task (Chinchor et al., 1999). (exemplified in Figure 1). MD is the task of pick- Popular Arabic NER corpora also label space- ing the single correct morphological path M ∈ M delimited tokens (ANERcorp (Benajiba et al., through the MA lattice of a given sentence: 2007), AQMAR (Mohit et al., 2012), TWEETS (Darwish, 2013)), with the exception of the Ara- M D : P(M) → M bic portion of OntoNotes (Weischedel et al., 2013) and ACE (LDC, 2008) which annotate NER la- Now, assume x ∈ W is a surface sentence in bels on gold morphologically pre-segmented texts. the language, with its morphological decomposi- However, these works do not provide a compre- tion initially unknown and underspecified. In a hensive analysis on the performance gaps between Standard pipeline, MA strictly precedes MD: morpheme-based and token-based scenarios. M DStandard : M = M D(M A(x)) In agglutinative languages as Turkish, token segmentation is always performed before NER The main problem here is that MD errors may (Tür et al. (2003); Küçük and Can (2019), re- propagate to contaminate the NER output. enforcing the need to contrast the token-based sce- We propose a novel Hybrid alternative, in which nario, widely adopted for Semitic languages, with we inject a task-specific signal, in this case NER,5 the morpheme-based scenarios in other MRLs. to constrain the search for M through the lattice: Our first contribution is thus a parallel corpus for Hebrew NER, one version consists of gold- M DHybrid : M = M D(M A(x) NERtoken (x)) labeled tokens and the other consists of gold- labeled morphemes, for the same text. For this, Here, the restriction M A(x) N ER(x) indi- we performed gold NE annotation of the Hebrew cates pruning the lattice structure M A(x) to con- Treebank (Sima’an et al., 2001), based on the tain only MD options that are compatible with the 6,143 morpho-syntactically analyzed sentences of token-based NER predictions, and only then apply the HAARETZ corpus, to create both token-level M D to the pruned lattice. and morpheme-level variants, as illustrated at the 5 We can do this for any sequence labeling task in MRLs. topmost and lowest rows of Table 1, respectively.
train dev test Annotation Scheme We started off with the Sentences 4, 937 500 706 guidelines of Ben-Mordecai (2005), from which Tokens 93, 504 8, 531 12, 619 we deviate in three main ways. First, we label NE Morphemes 127, 031 11, 301 16, 828 boundaries and their types on sequences of mor- All mentions 6, 282 499 932 Type: Person (P ER ) 2, 128 193 267 phemes, in addition to the space-delimited token Type: Organization (O RG ) 2, 043 119 408 annotations.6 Secondly, we use the finer-grained Type: Geo-Political (G PE ) 1, 377 121 195 entity categories list of ACE (LDC, 2008).7 Fi- Type: Location (L OC ) 331 28 41 Type: Facility (FAC ) 163 12 11 nally, we allow nested entity mentions, as in Finkel Type: Work-of-Art (W OA ) 114 9 6 and Manning (2009); Benikova et al. (2014).8 Type: Event (E VE ) 57 12 0 Type: Product (D UC ) 36 2 3 Type: Language (A NG ) 33 3 1 Annotation Cycle As Fort et al. (2009) put it, examples and rules would never cover all possible Table 2: Basic Corpus Statistics. Standard HTB Splits. cases because of the specificity of natural language and the ambiguity of formulation. To address this Clarifications and Refinements: In the end of we employed the cyclic approach of agile annota- each cycle we held a clarification talk between A, tion as offered by Alex et al. (2010). Every cycle B and C, in which issues that came up during the consisted of: annotation, evaluation and curation, cycle were discussed. Following that talk we re- clarification and refinements. We used WebAnno fined the guidelines and updated the annotators, (Yimam et al., 2013) as our annotation interface. which went on to the next cycle. In the end we per- The Initial Annotation Cycle was a two-stage pi- formed a final curation run to make sentences from lot with 12 participants, divided into 2 teams of earlier cycles comply with later refinements.10 6. The teams received the same guidelines, with Inter-Annotator Agreement (IAA) IAA is the exception of the specifications of entity bound- commonly measured using κ-statistic. However, aries. One team was guided to annotate the mini- Pyysalo et al. (2007) show that it is not suitable mal string that designates the entity. The other was for evaluating inter-annotator agreement in NER. guided to tag the maximal string which can still be Instead, an F1 metric on entity mentions has in re- considered as the entity. Our agreement analysis cent years been adopted for this purpose (Zhang, showed that the minimal guideline generally led to 2013). This metric allows for computing pair-wise more consistent annotations. Based on this result IAA using standard F1 score by treating one anno- (as well as low-level refinements) from the pilot, tator as gold and the other as the prediction. we devised the full version of the guidelines.9 Our full corpus pair-wise F1 scores are: Annotation, Evaluation and Curation: Every IAA(A,B)=89, IAA(B,C)=92, IAA(A,C)=96. Ta- annotation cycle was performed by two annotators ble 2 presents final corpus statistics. (A, B) and an annotation manager/curator (C). We Annotation Costs The annotation took on av- annotated the full corpus in 7 cycles. We eval- erage about 35 seconds per sentence, and thus a uated the annotation in two ways, manual cura- total of 60 hours for all sentences in the corpus for tion and automatic evaluation. After each anno- each annotator. Six clarification talks were held tation step, the curator manually reviewed every between the cycles, which lasted from thirty min- sentence in which disagreements arose, as well utes to an hour. Giving a total of about 130 work as specific points of difficulty pointed out by the hours of expert annotators.11 annotators. The inter-annotator agreement met- ric described below was also used to quantitatively 5 Experimental Settings gauge the progress and quality of the annotation. Goal We set out to empirically evaluate the rep- 6 A single NE is always continuous. Token-morpheme dis- resentation alternatives for the input/output se- crepancies do not lead to discontinuous NEs. 7 Entity categories are listed in Table 2. We dropped the quences (token-single, token-multi, morpheme) NORP category, since it introduced complexity concerning and the effect of different architectures (Standard, the distinction between adjectives and group names. L AW Hybrid) on the performance of NER for Hebrew. did not appear in our corpus. 8 10 Nested labels are are not modeled in this paper, but they A, B and C annotations are published to enable research are published with the corpus, to allow for further research. on learning with disagreements (Plank et al., 2014). 9 11 The complete annotation guide is publicly available at The corpus is available at https://github.com/ https://github.com/OnlpLab/NEMO-Corpus. OnlpLab/NEMO-Corpus.
Figure 2: The token-single and token-multi Models. Figure 3: The morpheme Model. The input and output The input and output correspond to rows 1,2 in Tab. 1. correspond to row 3 in Tab. 1. Triangles indicate string Triangles indicate string embeddings. Circles indicate embeddings. Circles indicate char-based encoding. char-based encoding. we experiment with CharLSTM, CharCNN or Modeling Variants All experiments use the cor- NoChar, that is, no character embedding at all. pus we just described and employ a standard We pre-trained all token-based or morpheme- Bi-LSTM-CRF architecture for implementing the based embeddings on the Hebrew Wikipedia neural sequence labeling task (Huang et al., 2015). dump of Goldberg (2014). For morpheme-based Our basic architecture12 is composed of an embed- embeddings, we decompose the input using More ding layer for the input and a 2-layer Bi-LSTM et al. (2019), and use the morphological seg- followed by a CRF inference layer — for which ments as the embedding units.13 We compare we test three modeling variants. GloVe (Pennington et al., 2014) and fastText (Bo- Figures 2–3 present the variants we employ. janowski et al., 2017). We hypothesize that since Figure 2 shows the token-based variants, token- FastText uses sub-string information, it will be single and token-multi. The former outputs a sin- more useful for analyzing OOVs. gle BIOSE label per token, and the latter outputs a multi-label per token — a concatenation of BIOSE Hyper parameters Following Reimers and labels of the morphemes composing the token. Gurevych (2017); Yang et al. (2018), we per- Figure 3 shows the morpheme-based variant for formed hyper-parameter tuning for each of our the same input phrase. It has the same basic archi- model variants. We performed hyper-parameter tecture, but now the input consists of morphologi- tuning on the dev set in a number of rounds of ran- cal segments instead of tokens. The model outputs dom search, independently on every input/output a single BIOSE label for each morphological seg- and char-embedding architecture. Table 3 shows ment in the input. our selected hyper-parameters.14 The Char CNN In all modeling variants, the input may be en- window size is particularly interesting as it was coded in two ways: (a) String-level embeddings 13 Embeddings and Wikipedia corpus also available in: (token-based or morpheme-based) optionally ini- https://github.com/OnlpLab/NEMO 14 tialized with pre-trained embeddings. (b) Char- A few interesting empirical observations diverging from those of Reimers and Gurevych (2017); Yang et al. (2018) are level embeddings, trained simultaneously with the worth mentioning. We found that a lower Learning Rate than main task (cf. Ma and Hovy (2016); Chiu and the one recommended by Yang et al. (2018) (0.015), led to Nichols (2015); Lample et al. (2016)). For char- better results and less occurrences of divergence. We further based encoding (of either tokens or morphemes) found that raising the number of Epochs from 100 to 200 did not result in over-fitting, and significantly improved NER re- 12 Using the NCRF++ suite of Yang and Zhang (2018). sults. We used for evaluation the weights from the best epoch.
Parameter Value Parameter Value against the gold morphological boundaries. Optimizer SGD *LR (token-single) 0.01 For morpheme and token-single models, this *Batch Size 8 *LR (token-multi) 0.005 LR decay 0.05 *LR (morpheme) 0.01 is a straightforward F1 calculation against Epochs 200 Dropout 0.5 gold spans. Note for token-single we are ex- Bi-LSTM layers 2 *CharCNN window 7 pected to pay a price for boundary mismatch. *Word Emb Dim 300 Char Emb dim 30 For token-multi, we know the number and or- Word Hidden Dim 200 *Char Hidden Dim 70 der of labels, so we align the labels in the Table 3: Summary of Hyper-Parameter Tuning. The * multi-label of the token with the morphemes indicates divergence from the NCRF++ proposed setup in its morphological decomposition.16 and empirical findings (Yang and Zhang, 2018). For all experiments and metrics, we report mean and confidence interval (0.95) over ten runs. not treated as a hyper-parameter in Reimers and Gurevych (2017), Yang et al. (2018). However, Input-Output Scenarios We experiment with given the token-internal complexity in MRLs we two kinds of input settings: token-based, where the conjecture that the window size over characters input consists of the sequence of space-delimited might make a crucial effect. In our experiments tokens, and morpheme-based, where the input we found that a larger window (7) increased the consists of morphological segments. For the mor- performance. For MRLs, further research into this pheme input, there are three input variants: hyper-parameter might be of interest. (i) Morph-gold: where the morphological se- Evaluation Standard NER studies typically in- quence is produced by an expert (idealistic). voke the CoNLL evaluation script that anchors (ii) Morph-standard: where the morpho- NEs in token positions (Tjong Kim Sang, 2003). logical sequence is produced by a standard However, it is inadequate for our purposes because segmentation-first pipeline (realistic). we want to compare entities across token-based vs. (iii) Morph-hybrid: where the morphological morpheme-based settings. To this end, we use a sequence is produced by the hybrid architec- revised evaluation procedure, which anchors the ture we propose (realistic). entity in its form rather than its index. Specifi- In the token-multi case we can perform cally, we report F1 scores on strict, exact-match of morpheme-based evaluation by aligning individ- the surface forms of the entity mentions. I.e., the ual labels in the multi-label with the morpheme gold and predicted NE spans must exactly match sequence of the respective token. Again we have in their form, boundaries, and entity type. In all three options as to which morphemes to use: experiments, we report both token-level F-scores and morpheme-level F-scores, for all models. (i) Tok-multi-gold: The multi-label is aligned with morphemes produced by an expert (ide- • Token-Level evaluation. For the sake of alistic). backwards compatibility with previous work (ii) Tok-multi-standard: The multi-label is on Hebrew NER, we first define token-level aligned with morphemes produced by a stan- evaluation. For token-single this is a straight- dard pipeline (realistic). forward calculation of F1 against gold spans. (iii) Tok-multi-hybrid: The multi-label is For token-multi and morpheme, we need to aligned with morphemes produced by the hy- map the predicted label sequence of that to- brid architecture we propose (realistic). ken to a single label, and we do so using linguistically-informed rules we devise (as Pipeline Scenarios Assume an input sentence x. elaborated in Appendix A).15 In the Standard pipeline we use YAP,17 the cur- rent state-of-the-art morpho-syntactic parser for • Morpheme-Level evaluation. Our ultimate Hebrew (More et al., 2019), for the predicted seg- goal is to obtain precise boundaries of the mentation M = M D(M A(x)). In the Hybrid NEs. Thus, our main metric evaluates NEs 16 In case of a misalignment (in the number of morphemes 15 In the morpheme case we might encounter “illegal” label and labels) we match the label-morpheme pairs from the final sequences in case of a prediction error. We employ similar one backwards, and pad unpaired morphemes with O labels. 17 linguistically-informed heuristics to recover from that (See For other languages this may be done using models for Appendix A). canonical segmentation as in (Kann et al., 2016).
Figure 4: Token-level Eval. on Dev w/ Gold Segmen- Figure 6: Token-Level Evaluation in Realistic Sce- tation. CharCNN for morph, CharLSTM for tok. narios on Dev, comparing Gold, Standard and Hybrid Morphology. CharCNN for morph, CharLSTM for tok. Results for Gold, token-single and token-multi are taken from Fig 4. Figure 5: Morph-Level Eval. on Dev w/ Gold Segmen- tation. CharCNN for morph, CharLSTM for tok. Figure 7: Morph-Level Evaluation in Realistic Sce- pipeline, we use YAP to first generate complete narios on Dev, comparing Gold, Standard and Hybrid morphological lattices M A(x). Then, to obtain Morphology. CharCNN for morph, CharLSTM for M A(x) N ER(x) we omit lattice paths where tok. Results for Gold, token-single and token-multi are the number of morphemes in the token decompo- taken from Fig 5. sition does not conform with the number of labels in the multi-label of NERtoken-multi (x). Then, we apply YAP to obtain M D(M A(x) N ER(x)) terestingly, explicit modeling of morphemes leads on the constrained lattice. In predicted morphol- to better NER performance even when evaluated ogy scenarios (either Standard or Hybrid), we use against token-level boundaries. As expected, the the same model weights as trained on the gold seg- performance gaps between variants are smaller ments, but feed predicted morphemes as input.18 with fastText than they are with embeddings that are unaware of characters (GloVe) or with no pre- 6 Results training at all. We further pursue this in Sec. 6.3. Figure 5 shows the morpheme-level evaluation 6.1 The Units: Tokens vs. Morphemes for the same model variants as in Table 4. The Figure 4 shows the token-level evaluation for the most obvious trend here is the drop in the per- different model variants we defined. We see formance of the token-single model. This is ex- that morpheme models perform significantly better pected, reflecting the inadequacy of token bound- than the token-single and token-multi variants. In- aries for identifying accurate boundaries for NER. 18 We do not re-train the morpheme models with predicted Interestingly, morpheme and token-multi models segmentation, which might achieve better performance (e.g.. keep a similar level of performance as in token- jackknifing). We leave this for future work. level evaluation, only slightly lower. Their per-
Figure 8: Entity Mention Counts and Ratio by Cate- Figure 9: Token-Level Eval on Dev by OOTV Cate- gory and OOTV Category, for Dev Set. gory. Using fastText and CharLSTM. formance gap is also maintained, with morpheme • Lexical: Unknown mentions caused by an performing better than token-multi. An obvious unknown token which consists of a single caveat is that these results are obtained with gold morpheme. This is a strictly lexical unknown morphology. What happens in realistic scenarios? with no morphological composition (most English unknowns are in this category). 6.2 The Architecture: Pipeline vs. Hybrid • Compositional: Unknown mentions caused Figure 6 shows the token-level evaluation results by an unknown token which consists of mul- in realistic scenarios. We first observe a significant tiple known morphemes. These are un- drop for morpheme models when Standard pre- knowns introduced strictly by morphological dicted segmentation is introduced instead of gold. composition, with no lexical unknowns. This means that MD errors are indeed detrimen- tal for the downstream task, in a non-negligible • LexComp: Unknown mentions caused by an rate. Second, we observe that much of this perfor- unknown token consisting of multiple mor- mance gap is recovered with the Hybrid pipeline. phemes, of which (at least) one morpheme It is noteworthy that while morph hybrid lags be- was not seen during training. In such cases, hind morph gold, it is still consistently better than both unknown morphological composition token-based models, token-single and token-multi. and lexical unknowns are involved. Figure 7 shows morpheme-level evaluation re- We group NEs based on these categories, and sults for the same scenarios as in Table 6. All evaluate each group separately. We consider men- trends from the token-level evaluation persist, in- tions that do not fall into any category as Known. cluding a drop for all models with predicted seg- Figure 8 shows the distributions of entity men- mentation relative to gold, with the hybrid variant tions in the dev set by entity type and OOTV cat- recovering much of the gap. Again morph gold egory. OOTV categories that involve composition outperforms token-multi, but morph hybrid shows (Comp and LexComp) are spread across all cate- great advantages over all tok-multi variants. This gories but one, and in some they even make up performance gap between morph (gold or hybrid) more than half of all mentions. and tok-multi indicates that explicit morphological Figure 9 shows token-level evaluation19 with modeling is indeed crucial for accurate NER. fastText embeddings, grouped by OOTV type. We first observe that indeed unknown NEs that are due 6.3 Morphologically-Aware OOV Evaluation to morphological composition (Comp and Lex- As discussed in Section 2, morphological compo- Comp) proved the most challenging for all models. sition introduces an extremely sparse word-level We also find that in strictly Compositional OOTV “long-tail” in MRLs. In order to gauge this phe- 19 This section focuses on token-level evaluation, which is nomenon and its effects on NER performance, a permissive evaluation metric, allowing us to compare the we categorize unseen, out-of-training-vocabulary models on a more level playing field, where all models (in- (OOTV) mentions into 3 categories: cluding token-single) have an equal opportunity to perform.
mentions, morpheme-based models exhibit their Eval Model dev test Morph- morph gold 80.03 ± 0.4 79.10 ± 0.6 most significant performance advantage, support- Level morph hybrid 78.51 ± 0.5 77.11 ± 0.7 ing the hypothesis that explicit morphology helps morph standard 72.79 ± 0.5 69.52 ± 0.6 to generalize. We finally observe that token-multi token-multi hybrid 75.70 ± 0.5 74.64 ± 0.3 Token- morph gold 80.30 ± 0.5 79.28 ± 0.6 models perform better than token-single models Level morph hybrid 79.04 ± 0.5 77.64 ± 0.7 for these NEs (in contrast with the trend for non- morph standard 74.52 ± 0.7 73.53 ± 0.8 compositional NEs). This corroborates the hy- token-multi 77.59 ± 0.4 77.75 ± 0.3 token-single 78.15 ± 0.3 77.15 ± 0.6 pothesis that even partial modeling of morphology (as in token-multi compared to token-single) is bet- Table 4: Test vs. Dev: Results with fastText for all ter than none, leading to better generalization. Models. morph-gold presents an ideal upper-bound. String-level vs. Character-level Embeddings morpheme-based models, indicating that not all To further understand the generalization capacity morphological information is captured by these of different modeling alternatives in MRLs, we vectors. For fastText with char-based embed- probe into the interplay of string-based and char- dings the gap between token-multi and morpheme based embeddings in treating OOTV NEs. greatly diminishes, but is still well above token- Figure 10 presents 12 plots, each of which single. This suggests biasing the model to learn presents the level of performance (y-axes) for all about morphology (either via multi-labels or by models (x-axes). Token-based models are on the incorporating morphological boundaries) has ad- left of each x-axes, morpheme-based are on the vantages for analysing OOTV entities, beyond the right. We plot results with and without charac- contribution of char-based embeddings alone. ter embeddings,20 in orange and blue respectively. All in all, the biggest advantage of morpheme- The plots are organized in a large grid, with the based models over token-based models is their type of NE on the y-axes (Known, Lex, Comp, Lex- ability to generalize from observed tokens Comp), and the type of pre-training on the x-axes to composition-related OOTV (Comp/LexComp). (No pre-training, GloVe, fastText) . While character-based embeddings do help token- At the top-most row, plotting the accuracy for based models generalize, the contribution of mod- Known NEs, we see a high level of performance eling morphology is indispensable, above and be- for all pre-training methods, with not much dif- yond the contribution of char-based embeddings. ferences between the type of pre-training, with or without the character embeddings. Moving 6.4 Setting in the Greater Context further down to the row of Lexical unseen NEs, char-based representations lead to significant ad- Test Set Results Table 4 confirms our best re- vantages when we assume no pre-training, but sults on the Test set. The trends are kept, though with GloVe pre-training the performance substan- results on Test are lower than on Dev. The morph tially increases, and with fastText the differences gold scenario still provides an upperbound of the in performance with/without char-embeddings al- performance, but it is not realistic. For the realistic most entirely diminish, indicating the char-based scenarios, morph hybrid generally outperforms all embeddings are somewhat redundant in this case. other alternatives. The only divergence is that in The two lower rows in the large grid show the token-level evaluation, token-multi performs on a performance for Comp and LexComp unseen NEs, par with morph hybrid on the Test set. which are ubiquitous in MRLs. For Compositional Results on MD Tasks. While the Hybrid NEs, pre-training closes only part of the gap be- pipeline achieves superior performance on NER, tween token-based and morpheme-based models. it also improves the state-of-the-art on other tasks Adding char-based representations indeed helps in the pipeline. Table 5 shows the Seg+POS results the token-based models, but crucially does not of our Hybrid pipeline scenario, compared with close the gap with the morpheme-based variants. the Standard pipeline which replicates the pipeline Finally, for LexComp NEs at the lowest row, we of More et al. (2019). We use the metrics defined again see that adding GloVe pre-training and char- by More et al. (2019). We show substantial im- based embeddings does not close the gap with provements for the Hybrid pipeline over the results 20 For brevity we only show char LSTM (vs. no char repre- of More et al. (2019), and also outperforming the sentation), there was no significant difference with CNN. Test results of Seker and Tsarfaty (2020).
Figure 10: Token-Level Eval. on Dev for Different OOTV Types, Char- and Word-Embeddings. Seg+POS Precision Recall F1 dev Standard (More et al., 2019) 92.36 Ben-Mordecai (2005) 84.54 74.31 79.10 Ptr-Network (Seker and Tsarfaty, 2020) 93.90 MEMM+HMM+REGEX Hybrid (This work) 93.12 This work 86.84 82.6 84.71 token-single +FT+CharLSTM ±0.5 ±0.9 ±0.5 test Standard (More et al., 2019) 89.08 This work 86.93 83.59 85.22 Ptr-Network (Seker and Tsarfaty, 2020) 90.49 morph-Hybrid +FT+CharLSTM ±0.6 ±0.8 ±0.5 Hybrid (This work) 90.89 Table 6: NER Comparison with Ben-Mordecai (2005). Table 5: Morphological Segmentation & POS scores. Comparison with Prior Art. Table 6 presents we performed three 75%-25% random train/test our results on the Hebrew NER corpus of Ben- splits, and used the same seven NE categories Mordecai (2005) compared to their model, which (P ER ,L OC ,O RG ,T IME ,DATE ,P ERCENT,M ONEY ). uses a hand-crafted feature-engineered MEMM We trained a token-single model on the original with regular-expression rule-based enhancements space-delimited tokens and a morpheme model on and an entity lexicon. Like Ben-Mordecai (2005) automatically segmented morphemes we obtained
using our best segmentation model (Hybrid MD However, our preliminary experiments with a on our trained token-multi model, as in Table 5). lattice-based Pointer-network for token segmen- Since their annotation includes only token-level tation and NER labeling shows that this is not a boundaries, all of the results we report conform straightforward task. Contrary to POS tags, which with token-level evaluation. are constrained by the MA, every NER label can Table 6 presents the results of these experi- potentially go with any segment, and this leads to ments. Both models significantly outperform the a combinatorial explosion of the search space rep- previous state-of-the-art by Ben-Mordecai (2005), resented by the lattice. As a result, the NER pre- setting a new performance bar on this earlier dictions are brittle to learn, and the complexity of benchmark. Moreover, we again observe an em- the resulting model is computationally prohibitive. pirical advantage when explicitly modeling mor- A different approach to joint sequence segmen- phemes, even with the automatic noisy segmenta- tation and labeling can be applying the neural tion that is used for the morpheme-based training. model directly on the character-sequence of the input stream. Such an approach is for instance 7 Discussion: Joint Modeling the char-based labeling as segmentation setup pro- Alternatives and Future Work posed by Shao et al. (2017). Shao et al. use a character-based Bi-RNN-CRF to output a single The present study provides the motivation and the label-per-char which indicates both word bound- necessary foundations for comparing morpheme- ary (using BIES sequence labels) and the POS based and token-based modeling for NER. While tags. This method is also used in their universal our findings clearly demonstrate the advantages of segmentation paper, (Shao et al., 2018). However, morpheme-based modeling for NER in a morpho- as seen in the results of Shao et al. (2018), char- logically rich language, it is clear that our pro- based labeling for segmenting Semitic languages posed Hybrid architecture is not the only modeling lags far behind all other languages, precisely be- alternative for linking NER and morphology. cause morphological boundaries are not explicit in For example, a previous study by Güngör the character sequences. et al. (2018) addresses joint neural modeling of morphological segmentation and NER labeling, Additional proposals are those of Kong et al. proposing a multi-task learning (MTL) approach (2015); Kemos et al. (2019). First, Kong et al. for joint MD and NER in Turkish. They employ (2015) proposed to solve e.g. Chinese segmen- separate Bi-LSTM networks for the MD and NER tation and POS tagging using dynamic program- tasks, with a shared loss to allow for joint learning. ming with neural encoding, by using a Bi-LSTM Their results indicate improved NER performance, to encode the character input, and then feed it with no improvement in the MD results. Contrary to a semi-markov CRF to obtain probabilities for to our proposal, they view MD and NER as dis- the different segmentation options. Kemos et al. tinct tasks, assuming a single NER label per token, (2019) propose an approach similar to Kong et al. and not providing disambiguated morpheme-level (2015) for joint segmentation and tagging but add boundaries for the NER task. More generally, they convolution layers on top of the Bi-LSTM encod- test only token-based NER labeling and do not at- ings to obtain segment features hierarchically and tend to the question of input/output granularity in then feed them to the semi-markov CRF. their models. Preliminary experiments we conducted confirm A different approach for joint NER and mor- that char-based joint segmentation and NER label- phology is jointly predicting the segmentation ing for Hebrew, either using char-based labeling and labels for each token in the input stream. or a seq2seq architecture, still lags behind our re- This is the approach taken, for instance, by the ported results. We conjecture that this is due to the lattice-based Pointer-Network of Seker and Tsar- complex morpho-phonological and orthographic faty (2020). As shown in Table 5, their results for processed in Semitic languages. Going into char- morphological segmentation and POS tagging are based modeling nuances and offering a sound joint on a par with our reported results and, at least in solution for a language like Hebrew is an impor- principle, it should be possible to extend the Seker tant matter that merits its own investigation. Such and Tsarfaty (2020) approach to yield also NER work is feasible now given the new corpus, how- predictions. ever, it is out of the scope of the current study.
All in all, the design of sophisticated joint mod- References eling strategies for morpheme-based NER poses Bea Alex, Claire Grover, Rongzhou Shen, and Mi- fascinating questions — for which our work pro- jail Kabadjov. 2010. Agile corpus annotation vides a solid foundation (data, protocols, metrics, in practice: An overview of manual and auto- strong baselines). More work is needed for inves- matic annotation of CVs. In Proceedings of the tigating joint modeling of NER and morphology, Fourth Linguistic Annotation Workshop, pages in the directions portrayed in this Section, yet it is 29–37, Uppsala, Sweden. Association for Com- beyond the scope of this paper, and we leave this putational Linguistics. investigation for future work. Finally, while the joint approach is appealing, Naama Ben-Mordecai. 2005. Hebrew named en- we argue that the elegance of our Hybrid solution tity recognition. Master’s thesis, Department of is precisely in providing a clear and well-defined Computer Science, Ben-Gurion University. interface between MD and NER through which the two tasks can interact, while still keeping the dis- Yassine Benajiba, Paolo Rosso, and José Miguel tinct models simple, robust, and efficiently train- BenedíRuiz. 2007. Anersys: An arabic named able. It also has the advantage of allowing us to entity recognition system based on maximum seamlessly integrate sequence labelling with any entropy. In Computational Linguistics and lattice-based MA, in a plug-and-play language- Intelligent Text Processing, pages 143–153, agnostic fashion, towards obtaining further advan- Berlin, Heidelberg. Springer Berlin Heidelberg. tages on both of these tasks. Darina Benikova, Chris Biemann, and Marc 8 Conclusion Reznicek. 2014. NoSta-D named entity an- notation for German: Guidelines and dataset. This work addresses the modeling challenges of In Proceedings of the Ninth International Con- Neural NER in MRLs. We deliver a parallel token- ference on Language Resources and Evalua- vs-morpheme NER corpus for Modern Hebrew, tion (LREC-2014), pages 2524–2531, Reyk- that allows one to assess NER modeling strate- javik, Iceland. European Languages Resources gies in morphologically rich-and-ambiguous en- Association (ELRA). vironments. Our experiments show that while NER benefits from morphological decomposition, Piotr Bojanowski, Edouard Grave, Armand Joulin, downstream results are sensitive to segmentation and Tomas Mikolov. 2017. Enriching word vec- errors. We thus propose a Hybrid architecture in tors with subword information. Transactions of which NER precedes and prunes the morpholog- the Association for Computational Linguistics, ical decomposition. This approach greatly out- 5:135–146. performs a Standard pipeline in realistic (non- N. Chinchor, E. Brown, L. Ferro, and P. Robinson. gold) scenarios. Our analysis further shows that 1999. Named entity recognition task definition. morpheme-based models better recognize OOVs Technical Report Version 1.4, The MITRE Cor- that result from morphological composition. All poration and SAIC. in all we deliver new state-of-the-art results for Hebrew NER and MD, along with a novel bench- Jason P. C. Chiu and Eric Nichols. 2015. Named mark, to encourage further investigation into the entity recognition with bidirectional lstm-cnns. interaction between NER and morphology. CoRR, abs/1511.08308. Acknowledgments Bernard Comrie, Martin Haspelmath, and Balthasar Bickel. 2008. The leipzig glossing We are grateful to the BIU-NLP lab members as rules: Conventions for interlinear morpheme- well as 6 anonymous reviewers for their insight- by-morpheme glosses. Department of Linguis- ful remarks. We further thank Daphna Amit and tics of the Max Planck Institute for Evolutionary Zef Segal for their meticulous annotation and pro- Anthropology & the Department of Linguistics found discussions. This research is funded by an of the University of Leipzig. ISF Individual Grant (1739/26) and an ERC Start- ing Grant (677352), for which we are grateful. Paula Czarnowska, Sebastian Ruder, Edouard Grave, Ryan Cotterell, and Ann Copestake.
2019. Don’t forget the long tail! a comprehen- and analysis. In Proceedings of the 23rd In- sive analysis of morphological generalization in ternational Conference on Computational Lin- bilingual lexicon induction. In Proceedings of guistics (Coling 2010), pages 394–402, Beijing, the 2019 Conference on Empirical Methods in China. Coling 2010 Organizing Committee. Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Onur Güngör, Suzan Üsküdarli, and Tunga Processing (EMNLP-IJCNLP), pages 974–983, Güngör. 2018. Improving named entity recog- Hong Kong, China. Association for Computa- nition by jointly learning to disambiguate mor- tional Linguistics. phological tags. CoRR, abs/1807.06683. Kareem Darwish. 2013. Named entity recognition Nizar Habash and Owen Rambow. 2005. Arabic using cross-lingual resources: Arabic as an ex- tokenization, part-of-speech tagging and mor- ample. In Proceedings of the 51st Annual Meet- phological disambiguation in one fell swoop. ing of the Association for Computational Lin- In Proceedings of the 43rd Annual Meeting of guistics (Volume 1: Long Papers), pages 1558– the Association for Computational Linguistics 1567, Sofia, Bulgaria. Association for Compu- (ACL’05), pages 573–580, Ann Arbor, Michi- tational Linguistics. gan. Association for Computational Linguistics. Leon Derczynski, Eric Nichols, Marieke van Erp, Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- and Nut Limsopatham. 2017. Results of the rectional LSTM-CRF models for sequence tag- WNUT2017 shared task on novel and emerging ging. CoRR, abs/1508.01991. entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages Katharina Kann, Ryan Cotterell, and Hinrich 140–147, Copenhagen, Denmark. Association Schütze. 2016. Neural morphological analy- for Computational Linguistics. sis: Encoding-decoding canonical segments. In Proceedings of the 2016 Conference on Empir- Jenny Rose Finkel and Christopher D. Manning. ical Methods in Natural Language Processing, 2009. Nested named entity recognition. In Pro- pages 961–967, Austin, Texas. Association for ceedings of the 2009 Conference on Empirical Computational Linguistics. Methods in Natural Language Processing: Vol- ume 1 - Volume 1, EMNLP ’09, pages 141–150, Apostolos Kemos, Heike Adel, and Hinrich Stroudsburg, PA, USA. Association for Compu- Schütze. 2019. Neural semi-Markov condi- tational Linguistics. tional random fields for robust character-based part-of-speech tagging. In Proceedings of the Karën Fort, Maud Ehrmann, and Adeline 2019 Conference of the North American Chap- Nazarenko. 2009. Towards a methodology for ter of the Association for Computational Lin- named entities annotation. In Proceedings of guistics: Human Language Technologies, Vol- the Third Linguistic Annotation Workshop (LAW ume 1 (Long and Short Papers), pages 2736– III), pages 142–145, Suntec, Singapore. Associ- 2743, Minneapolis, Minnesota. Association for ation for Computational Linguistics. Computational Linguistics. Yoav Goldberg. 2014. Hebrew wikipedia depen- Stav Klein and Reut Tsarfaty. 2020. Getting the dency parsed corpus, v.1.0. Technical Report ##life out of living: How adequate are word- V.1.0. pieces for modelling complex morphology? In Yoav Goldberg and Reut Tsarfaty. 2008. A single Proceedings of the 17th SIGMORPHON Work- generative model for joint morphological seg- shop on Computational Research in Phonetics, mentation and syntactic parsing. In Proceed- Phonology, and Morphology, pages 204–209, ings of ACL-08: HLT, pages 371–379, Colum- Online. Association for Computational Linguis- bus, Ohio. Association for Computational Lin- tics. guistics. Lingpeng Kong, Chris Dyer, and Noah A Smith. Spence Green and Christopher D. Manning. 2010. 2015. Segmental recurrent neural networks. Better Arabic parsing: Baselines, evaluations, arXiv preprint arXiv:1511.06018.
You can also read