Pooled Contextualized Embeddings for Named Entity Recognition - Alan Akbik

Page created by Cecil Mckinney
 
CONTINUE READING
Pooled Contextualized Embeddings for Named Entity Recognition

          Alan Akbik             Tanja Bergmann        Roland Vollgraf
        Zalando Research          Zalando Research    Zalando Research
        Mühlenstraße 25          Mühlenstraße 25     Mühlenstraße 25
          10243 Berlin              10243 Berlin        10243 Berlin
                        {firstname.lastname}@zalando.de

                     Abstract                               B-PER      E-PER         S-LOC          S-ORG
                                                            Fung Permadi ( Taiwan ) v Indra
    Contextual string embeddings are a recent type
    of contextualized word embedding that were        Figure 1: Example sentence that provides underspecified
    shown to yield state-of-the-art results when      context. This leads to an underspecified contextual word em-
    utilized in a range of sequence labeling tasks.   bedding for the string “Indra” that ultimately causes a mis-
    They are based on character-level language        classification of “Indra” as an organization (ORG) instead of
                                                      person (PER) in a downstream NER task.
    models which treat text as distributions over
    characters and are capable of generating em-
    beddings for any string of characters within      proach they refer to as contextual string embed-
    any textual context. However, such purely         dings. They leverage pre-trained character-level
    character-based approaches struggle to pro-       language models from which they extract hidden
    duce meaningful embeddings if a rare string       states at the beginning and end character positions
    is used in a underspecified context. To ad-
                                                      of each word to produce embeddings for any string
    dress this drawback, we propose a method in
    which we dynamically aggregate contextual-        of characters in a sentential context. They showed
    ized embeddings of each unique string that        these embeddings to yield state-of-the-art results
    we encounter. We then use a pooling oper-         when utilized in sequence labeling tasks such as
    ation to distill a global word representation     named entity recognition (NER) or part-of-speech
    from all contextualized instances. We eval-       (PoS) tagging.
    uate these pooled contextualized embeddings
                                                      Underspecified contexts. However, such contex-
    on common named entity recognition (NER)
    tasks such as CoNLL-03 and WNUT and show          tualized character-level models suffer from an in-
    that our approach significantly improves the      herent weakness when encountering rare words in
    state-of-the-art for NER. We make all code and    an underspecified context. Consider the example
    pre-trained models available to the research      text segment shown in Figure 1: “Fung Permadi
    community for use and reproduction.               (Taiwan) v Indra”, from the English C O NLL-03
                                                      test data split (Tjong Kim Sang and De Meulder,
1   Introduction                                      2003). If we consider the word “Indra” to be rare
Word embeddings are a crucial component in            (meaning no prior occurrence in the corpus used
many NLP approaches (Mikolov et al., 2013; Pen-       to generate word embeddings), the underspecified
nington et al., 2014) since they capture latent se-   context allows this word to be interpreted as either
mantics of words and thus allow models to bet-        a person or an organization. This leads to an un-
ter train and generalize. Recent work has moved       derspecified embedding that ultimately causes an
away from the original “one word, one embed-          incorrect classification of “Indra” as an organiza-
ding” paradigm to investigate contextualized em-      tion in a downstream NER task.
bedding models (Peters et al., 2017, 2018; Akbik      Pooled Contextual Embeddings. In this paper,
et al., 2018). Such approaches produce different      we present a simple but effective approach to ad-
embeddings for the same word depending on its         dress this issue. We intuit that entities are nor-
context and are thus capable of capturing latent      mally only used in underspecified contexts if they
contextualized semantics of ambiguous words.          are expected to be known to the reader. That is,
   Recently, Akbik et al. (2018) proposed a           they are either more clearly introduced in an ear-
character-level contextualized embeddings ap-         lier sentence, or part of general in-domain knowl-
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

100                                                                                                                                                       150
                              current sentence                                                              memory
101                                                                                embcontext Indra2                                                      151
102                                                embproposed Indra                                                                                      152
103                                                                                                                                                       153
104                                                    pooling                                                                                            154
105                                              concatenation
                                                            figure2-crop.pdf                                                                              155
106                                                                                 I n d r a      W i j a y a       b e a t     O n g     Ew e           156
107                                                embcontext Indra1                                                                                      157
108                                                                                                                                                       158
                                                                                            embcontext Indra3
109                                                                                                                                                       159
110                                                                                                                                                       160
111                           Character Language Model                                                                                                    161
                                         Figure 2: PLACEHOLDER Illustration of character-level RNN language model.
112                                                                                                                                                       162
113                                                                                                                                                       163
            F u n g          P e r m a d i            v      I n d r a              A n d     I n d r a         s a i d   t h a t . . .
114           edge a reader is expected to have. Indeed, the                    embedding to the memory for this word (line 3).                           164
115           string  “Indra”    in the  C O NLL-03       data also
      Figure 2: Example of how we generate our proposed embedding   occurs      We (emb
                                                                                     then call  the pooling operation over all contex-
                                                                                           proposed ) for the word “Indra” in the example text
                                                                                                                                                          165
116   segment in the earlier
               “Fung   Permadi  sentence   “Indra
                                   v Indra”.         Wijaya a(Indonesia)
                                                We extract                      tualized
                                                                contextual string          embeddings
                                                                                    embedding               for this
                                                                                                  (embcontext        word
                                                                                                                 ) for  this in the and
                                                                                                                              word   memory
                                                                                                                                         retrieve from    166
117   the memory    all embeddings
              beat Ong     Ewe Hock”.that      wereon
                                           Based     produced
                                                         this, wefor this string(line
                                                                   propose        on previous    sentences.
                                                                                       4) to compute            We pool
                                                                                                          the pooled       and concatenate
                                                                                                                       contextualized     em- all local   167
118
      contextualized embeddings
              an approach             to produce
                               in which            the final embedding.
                                           we dynamically        aggregate      bedding. Finally, we concatenate the original con-                        168
119           contextualized embeddings of each unique string                   textual embedding together with the pooled repre-                         169
120
      edge a that
               reader     is expected
                   we encounter              to have.
                                     as we process          Indeed,
                                                       a dataset.      the sentation,
                                                                   We then       sentencetocontext
                                                                                                 ensure that(seeboth
                                                                                                                  Akbik
                                                                                                                      localetandal.global
                                                                                                                                    (2018)).
                                                                                                                                           in- It also    170
121   string “Indra” in the C O NLL-03 data also occurs
              use a pooling    operation    to distill a global  word  rep-      requires a memory that records for each unique
                                                                                terpretations    are  represented    (line  5).  This  means              171
122           resentation
      in the earlier        from all“Indra
                        sentence         contextualized
                                                 Wijaya instances
                                                            (Indonesia)that     that
                                                                                 wordtheallresulting
                                                                                              previous   pooled    contextualized
                                                                                                             contextual               embed- and a
                                                                                                                              embeddings,                 172
123
              we  use  in
      beat Ong Ewe Hock”.  combination      with  the   current  contextu-      ding   has  twice   the  dimensionality
                                                                                 pool() operation to pool embedding vectors. of  the original             173
              alized representation as new word embedding.                      embedding.
124      Based on    this, we propose an approach in which
                 We evaluate our proposed embedding approach
                                                                                     This is illustrated in Algorithm 1: to embed a                       174
125                                                                                                                                                       175
      we dynamically        aggregate      contextualized
              on the task of named entity recognition on the      embed-         word
                                                                                Algorithm (in 1a Compute
                                                                                                   sentential      context),
                                                                                                              pooled   embedding  we first call the
126                                                                                                                                                       176
      dings ofC Oeach
                  NLL-03 unique      stringGerman
                               (English,      that weand   encounter
                                                               Dutch) andas Input:
                                                                                 embed()       function
                                                                                         sentence,           (line 2) and add the resulting
                                                                                                        memory
127                                                                                                                                                       177
128
      we process
              WNUT  a dataset.
                         datasets.WeInthen      use a we
                                          all cases,    pooling    opera-
                                                             find that  our      embedding
                                                                                  1: for word in sentence do for this word (line 3).
                                                                                                  to   the   memory                                       178
              approach     outperforms       previous     approaches
      tion to distill a global word representation from all            and       We
                                                                                  2: then
                                                                                        embcall    the←
                                                                                              context    pooling operation over all contex-
129                                                                                                                                                       179
              yields new
      contextualized        state-of-the-art
                          instances      that we scores.
                                                     use in Wecombina-
                                                                contribute       tualizedembed(word)
                                                                                             embeddingswithin   for this    word in the memory
                                                                                                                       sentence
130                                                                                                                                                       180
              our approach and all pre-trained models to the                      3:    add  emb             to memory[word]
131   tion with the current contextualized               representation          (line  4)  to  compute
                                                                                                    context the pooled contextualized em-
                                                                                                                                                          181
              open source F LAIR1 framework, to ensure repro-                     4:    emb           ← pool(memory[word])
132   as new ducibility
               word embedding.             Our approach thus pro-
                           of these results.                                     bedding. pooled
                                                                                              Finally, we concatenate the original con-                   182
                                                                                  5:    word.embedding ←
133   duces evolving word representations that change                            textual embedding
                                                                                            concat(embpooled together     with the
                                                                                                                  , embcontext   ) pooled repre-
                                                                                                                                                          183
134           2 as
      over time    Method
                       more instances of the same word are                       sentation,
                                                                                  6: end for to ensure that both local and global in-
                                                                                                                                                          184
135                                                                                                                                                       185
      observed in the data.                                                     terpretations are represented (line 5). This means
136          Our proposed approach dynamically builds up a                                                                                                186
        We evaluate
             “memory”our  of proposed
                              contextualizedembedding
                                                embeddings approach
                                                                and ap-         that  the resulting
                                                                                  Crucially,           pooled
                                                                                              our approach    contextualized
                                                                                                           expands  the memoryembed-
137                                                                                                                                                       187
      on the plies
              task a of  named
                      pooling       entity torecognition
                                operation                      on con-
                                                distill a global    the         each
                                                                                ding has twice the dimensionality the
                                                                                     time  we  embed a word. Therefore,    sameoriginal
                                                                                                                        of the
138                                                                                                                                                       188
      C O NLL-03     (English,
             textualized   embeddingGerman       and
                                          for each word.Dutch)     and
                                                            It requires         word  in the
                                                                                embedding.   same context may have different em-
139                                                                                                                                                       189
                                                                                beddings over time as the memory is built up.
140   WNUTandatasets.
                 embed() function       that produces
                             In all cases,       we finda contextual-
                                                             that our                                                                                     190
                                                                                Pooling operations. Per default, we use mean
141
             ized  embedding     for  a  given
      approach outperforms previous approaches  word   in a sentence
                                                                   and           Algorithm 1 Compute pooled embedding                                     191
             context (see Akbik et al. (2018)). It also requires                pooling to average a word’s contextualized em-
142   yields new state-of-the-art scores. We contribute                          Input: vectors.
                                                                                bedding    sentence,Wememory
                                                                                                        also experiment with min                          192
             a memory that records for each unique word all
143   our approach
             previous and     all pre-trained
                        contextual   embeddings,models
                                                    and a pool()to the
                                                                     op-        and   for pooling
                                                                                  1: max    word intosentence
                                                                                                       compute a vector
                                                                                                                   do consisting                          193
144
      open source      F LAIR   1 framework (Akbik et al.,                      of2:
                                                                                   all element-wise  minimum   or maximum     values.                     194
             eration to pool embedding vectors.                                           embcontext ←
145
      2019), to This
                ensure    reproducibility
                      is illustrated            of these
                                      in Algorithm         results.
                                                       1: to  embed a           Training downstream models. When training                                 195
146
                                                                                            embed(word) within sentence
                                                                                downstream task models (such as for NER), we                              196
            word (in a sentential context), we first call the                     3:      add  emb          toover
                                                                                                               memory[word]
147                                                                             typically  make  many  passes
                                                                                                    context        the training data.                     197
148
      2    Method
            embed() function (line 2) and add the resulting
                                                                                As4:Algorithm
                                                                                          embpooled
                                                                                                2 shows, pool(memory[word])
                                                                                                      ← we  reset the memory at the                       198
149
                   1
                       https://github.com/zalandoresearch/flair                 beginning
                                                                                  5:        of each pass over the
                                                                                          word.embedding        ←training data (line                      199
      Our proposed approach (see Figure 2) dynami-
                                                                     concat(embpooled , embcontext )
      cally builds up a “memory” of contextualized em-
                                                             6: end for
      beddings and applies a pooling operation to distill
                                                          2
      a global contextualized embedding for each word.
      It requires an embed() function that produces a       Pooling operations. We experiment with differ-
      contextualized embedding for a given word in a        ent pooling operations: mean pooling to average
                                                            a word’s contextualized embedding vectors, and
         1
           https://github.com/zalandoresearch/flair         min and max pooling to compute a vector of all
Approach                                      C O NLL-03 E N        C O NLL-03 D E   C O NLL-03 N L     WNUT-17
     Pooled Contextualized Embeddingsmin           93.18 ± 0.09          88.27 ± 0.30     90.12 ± 0.14       49.07 ± 0.31
     Pooled Contextualized Embeddingsmax           93.13 ± 0.09          88.05 ± 0.25     90.26 ± 0.10       49.05 ± 0.26
     Pooled Contextualized Embeddingsmean          93.10 ± 0.11          87.69 ± 0.27     90.44 ± 0.20       49.59 ± 0.41
     Contextual String Emb. (Akbik et al., 2018)   92.86 ± 0.08          87.41 ± 0.13     90.16 ± 0.26       49.49 ± 0.75
     best published
     BERT (Devlin et al., 2018)†                   92.8
     CVT+Multitask (Clark et al., 2018)†           92.6
     ELMo (Peters et al., 2018)†                   92.22
     Stacked Multitask (Aguilar et al., 2018)†                                                               45.55
     Character-LSTM (Lample et al., 2016)†         90.94                 78.76            81.74

Table 1: Comparative evaluation of proposed approach with different pooling operations (min, max, mean) against current
state-of-the-art approaches on four named entity recognition tasks († indicates reported numbers). The numbers indicate that
our approach outperforms all other approaches on the CoNLL datasets, and matches baseline results on WNUT.

element-wise minimum or maximum values.                           contextual string embeddings for many languages.
Training downstream models. When training                         To F LAIR, we add an implementation of our pro-
downstream task models (such as for NER), we                      posed pooled contextualized embeddings.
typically make many passes over the training data.                Hyperparameters. For our experiments, we fol-
As Algorithm 2 shows, we reset the memory at the                  low the training and evaluation procedure outlined
beginning of each pass over the training data (line               in Akbik et al. (2018) and follow most hyperpa-
2), so that it is build up from scratch at each epoch.            rameter suggestions as given by the in-depth study
                                                                  presented in Reimers and Gurevych (2017). That
Algorithm 2 Training                                              is, we use an LSTM with 256 hidden states and
    1: for epoch in epochs do                                     one layer (Hochreiter and Schmidhuber, 1997), a
    2:   memory ← map of word to list                             locked dropout value of 0.5, a word dropout of
    3:   train and evaluate as usual                              0.05, and train using SGD with an annealing rate
    4: end for                                                    of 0.5 and a patience of 3. We perform model se-
                                                                  lection over the learning rate ∈ {0.01, 0.05, 0.1}
   This approach ensures that the downstream task
                                                                  and mini-batch size ∈ {8, 16, 32}, choosing the
model learns to leverage pooled embeddings that
                                                                  model with the best F-measure on the validation
are built up (e.g. evolve) over time. It also ensures
                                                                  set. Following Peters et al. (2017), we then re-
that pooled embeddings during training are only
                                                                  peat the experiment 5 times with different random
computed over training data. After training, (i.e.
                                                                  seeds, and train using both train and development
during NER prediction), we do not reset embed-
                                                                  set, reporting both average performance and stan-
dings and instead allow our approach to keep ex-
                                                                  dard deviation over these runs on the test set as
panding the memory and evolve the embeddings.
                                                                  final performance.
3        Experiments                                              Standard word embeddings. The default setup
                                                                  of Akbik et al. (2018) recommends contextual
We verify our proposed approach in four named                     string embeddings to be used in combination with
entity recognition (NER) tasks: We use the En-                    standard word embeddings. We use G LOV E em-
glish, German and Dutch evaluation setups of the                  beddings (Pennington et al., 2014) for the English
C O NLL-03 shared task (Tjong Kim Sang and                        tasks and FAST T EXT embeddings (Bojanowski
De Meulder, 2003) to evaluate our approach on                     et al., 2017) for all newswire tasks.
classic newswire data, and the WNUT-17 task on                    Baselines. Our baseline are contextual string em-
emerging entity detection (Derczynski et al., 2017)               beddings without pooling, i.e. the original setup
to evaluate our approach in a noisy user-generated                proposed in Akbik et al. (2018)2 . By compar-
data setting with few repeated entity mentions.                   ing against this baseline, we isolate the impact of
3.1       Experimental Setup                                      our proposed pooled contextualized embeddings.
We use the open source F LAIR framework in                           2
                                                                       Our reproduced numbers are slightly lower than we re-
all our experiments. It implements the stan-                      ported in Akbik et al. (2018) where we used the official
                                                                  C O NLL-03 evaluation script over BILOES tagged entities.
dard BiLSTM-CRF sequence labeling architec-                       This introduced errors since this script was not designed for
ture (Huang et al., 2015) and includes pre-trained                S-tagged entities.
Approach                                     C O NLL-03 E N     C O NLL-03 D E      C O NLL-03 N L      WNUT-17
      Pooled Contextualized Embeddings (only)      92.42 ± 0.07       86.21 ± 0.07        88.25 ± 0.11        44.29 ± 0.59
      Contextual String Embeddings (only)          91.81 ± 0.12       85.25 ± 0.21        86.71 ± 0.12        43.43 ± 0.93

Table 2: Ablation experiment using contextual string embeddings without word embeddings. We find a more significant
impact on evaluation numbers across all datasets, illustrating the need for capturing global next to contextualized semantics.

In addition, we list the best reported numbers for                provements of pooling vis-a-vis the baseline ap-
the four tasks. This includes the recent BERT ap-                 proach in this setup. This indicates that pooled
proach using bidirectional transformers by Devlin                 contextualized embeddings capture global seman-
et al. (2018), the semi-supervised multitask learn-               tics words similar in nature to classical word em-
ing approach by Clark et al. (2018), the ELMo                     beddings.
word-level language modeling approach by Peters
et al. (2018), and the best published numbers for                 4       Discussion and Conclusion
WNUT-17 (Aguilar et al., 2018) and German and
Dutch C O NLL-03 (Lample et al., 2016).                           We presented a simple but effective approach that
                                                                  addresses the problem of embedding rare strings in
3.2     Results                                                   underspecified contexts. Our experimental evalu-
Our experimental results are summarized in Ta-                    ation shows that this approach improves the state-
ble 1 for each of the four tasks.                                 of-the-art across named entity recognition tasks,
New state-of-the-art scores. We find that our ap-                 enabling us to report new state-of-the-art scores
proach outperforms all previously published re-                   for C O NLL-03 NER and WNUT emerging entity
sults, raising the state-of-the-art for C O NLL-03                detection. These results indicate that our embed-
on English to 93.18 F1-score (↑0.32 pp vs. previ-                 ding approach is well suited for NER.
ous best), German to 88.27 (↑0.86 pp) and Dutch                   Evolving embeddings. Our dynamic aggrega-
to 90.44 (↑0.28 pp). The consistent improvements                  tion approach means that embeddings for the same
against the contextual string embeddings baseline                 words will change over time, even when used in
indicate that our approach is generally a viable op-              exactly the same contexts. Assuming that entity
tion for embedding entities in sequence labeling.                 names are more often used in well-specified con-
Less pronounced impact on WNUT-17. How-                           texts, their pooled embeddings will improve as
ever, we also find no significant improvements on                 more data is processed. The embedding model
the WNUT-17 task on emerging entities. Depend-                    thus continues to “learn” from data even after the
ing on the pooling operation, we find compara-                    training of the downstream NER model is com-
ble results to the baseline. This result is expected              plete and it is used in prediction mode. We con-
since most entities appear only few times in this                 sider this idea of constantly evolving representa-
dataset, giving our approach little evidence to ag-               tions a very promising research direction.
gregate and pool. Nevertheless, since recent work                 Future work. Our pooling operation makes
has not yet experimented with contextual embed-                   the conceptual simplification that all previous in-
dings on WNUT, as side result we report a new                     stances of a word are equally important. However,
state-of-the-art of 49.59 F1 vs. the previous best                we may find more recent mentions of a word - such
reported number of 45.55 (Aguilar et al., 2018).                  as words within the same document or news cycle
Pooling operations. Comparing the pooling op-                     - to be more important for creating embeddings
erations discussed in Section 2, we generally find                than mentions that belong to other documents or
similar results. As Table 1 shows, min pooling                    news cycles. Future work will therefore examine
performs best for English and German CoNLL,                       methods to learn weighted poolings of previous
while mean pooling is best for Dutch and WNUT.                    mentions. We will also investigate applicability of
                                                                  our proposed embeddings to tasks beside NER.
3.3     Ablation: Character Embeddings Only                       Public release. We contribute our code to the
To better isolate the impact of our proposed ap-                  F LAIR framework3 . This allows full reproduction
proach, we run experiments in which we do not                     of all experiments presented in this paper, and al-
use any classic word embeddings, but rather rely                      3
                                                                       The proposed embedding is added to F LAIR in release
solely on contextual string embeddings. As Ta-                    0.4.1. as the PooledFlairEmbeddings class (see Akbik
ble 2 shows, we observe more pronounced im-                       et al. (2019) for more details).
lows the research community to use our embed-                  Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
dings for training downstream task models.                       ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
                                                                 Neural architectures for named entity recognition.
                                                                 In Proceedings of the 2016 Conference of the North
Acknowledgements                                                 American Chapter of the Association for Computa-
                                                                 tional Linguistics: Human Language Technologies,
We would like to thank the anonymous reviewers for their         pages 260–270. Association for Computational Lin-
helpful comments. This project has received funding from the     guistics.
European Union’s Horizon 2020 research and innovation pro-
                                                               Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
gramme under grant agreement no 732328 (“FashionBrain”).
                                                                 rado, and Jeff Dean. 2013. Distributed representa-
                                                                 tions of words and phrases and their compositional-
                                                                 ity. In Advances in neural information processing
References                                                       systems, pages 3111–3119.
Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Mona             Jeffrey Pennington, Richard Socher, and Christopher
  Diab, Julia Hirschberg, and Thamar Solorio. 2018.               Manning. 2014. Glove: Global vectors for word
  Named entity recognition on code-switched data:                 representation. In Proceedings of the 2014 confer-
  Overview of the calcs 2018 shared task. In Proceed-             ence on empirical methods in natural language pro-
  ings of the Third Workshop on Computational Ap-                 cessing (EMNLP), pages 1532–1543.
  proaches to Linguistic Code-Switching, pages 138–
  147.                                                         Matthew Peters, Waleed Ammar, Chandra Bhagavat-
                                                                ula, and Russell Power. 2017. Semi-supervised se-
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif               quence tagging with bidirectional language mod-
  Rasul, Stefan Schweter, and Roland Vollgraf. 2019.            els. In Proceedings of the 55th Annual Meeting of
  Flair: An easy-to-use framework for state-of-the-             the Association for Computational Linguistics (Vol-
  art nlp. In NAACL, 2019 Annual Conference of                  ume 1: Long Papers), pages 1756–1765, Vancouver,
  the North American Chapter of the Association for             Canada. Association for Computational Linguistics.
  Computational Linguistics: System Demonstrations.
                                                               Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Alan Akbik, Duncan Blythe, and Roland Vollgraf.                 Gardner, Christopher Clark, Kenton Lee, and Luke
  2018. Contextual string embeddings for sequence               Zettlemoyer. 2018. Deep contextualized word rep-
  labeling. In COLING 2018, 27th International Con-             resentations. In Proc. of NAACL.
  ference on Computational Linguistics, pages 1638–            Nils Reimers and Iryna Gurevych. 2017. Reporting
  1649.                                                          Score Distributions Makes a Difference: Perfor-
                                                                 mance Study of LSTM-networks for Sequence Tag-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and              ging. In Proceedings of the 2017 Conference on
   Tomas Mikolov. 2017. Enriching word vectors with              Empirical Methods in Natural Language Processing
   subword information. Transactions of the Associa-             (EMNLP), pages 338–348, Copenhagen, Denmark.
   tion for Computational Linguistics, 5:135–146.
                                                               Erik F Tjong Kim Sang and Fien De Meulder.
Kevin Clark, Minh-Thang Luong, Christopher D. Man-                2003. Introduction to the CoNLL-2003 shared task:
  ning, and Quoc V. Le. 2018. Semi-supervised                     Language-independent named entity recognition. In
  sequence modeling with cross-view training. In                  Proceedings of the seventh conference on Natural
  EMNLP.                                                          language learning at HLT-NAACL 2003-Volume 4,
                                                                  pages 142–147. Association for Computational Lin-
Leon Derczynski, Eric Nichols, Marieke van Erp, and               guistics.
  Nut Limsopatham. 2017. Results of the wnut2017
  shared task on novel and emerging entity recogni-
  tion. In Proceedings of the 3rd Workshop on Noisy
  User-generated Text, pages 140–147.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2018. Bert: Pre-training of deep
   bidirectional transformers for language understand-
   ing. arXiv preprint arXiv:1810.04805.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  Long short-term memory. Neural computation,
  9(8):1735–1780.

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
  tional lstm-crf models for sequence tagging. arXiv
  preprint arXiv:1508.01991.
You can also read