CAMEL TOOLS: AN OPEN SOURCE PYTHON TOOLKIT FOR ARABIC NATURAL LANGUAGE PROCESSING

Page created by Jessie Vega
 
CONTINUE READING
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 7022–7032
                                                                                                                              Marseille, 11–16 May 2020
                                                                            c European Language Resources Association (ELRA), licensed under CC-BY-NC

                           CAMeL Tools: An Open Source Python Toolkit
                             for Arabic Natural Language Processing

               Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima Taji, Mai Oudah,
             Bashar Alhafni, Go Inoue, Fadhl Eryani, Alexander Erdmann, Nizar Habash
                          Computational Approaches to Modeling Language (CAMeL) Lab
                                     New York University Abu Dhabi, UAE
                    {oobeid,nasser.zalmout,salamkhalifa,dima.taji,mai.oudah,
                   alhafni,go.inoue,fadhl.eryani,ae1541,nizar.habash}@nyu.edu

                                                               Abstract
We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently
provides utilities for pre-processing, morphological modeling, dialect identification, named entity recognition and sentiment analysis. In
this paper, we describe the design of CAMeL Tools and the functionalities it provides.

Keywords: Arabic, Open Source, Morphology, Named Entity Recognition, Sentiment Analysis, Dialect Identification

                     1.    Introduction                                marks for short vowels and consonantal gemination. While
Over the last two decades, there have been many efforts to             the missing diacritics are not a major challenge to literate
develop resources to support Arabic natural language pro-              native adults, their absence is the main source of ambiguity
cessing (NLP). Some of these resources target specific NLP             in Arabic NLP.
tasks such as tokenization, diacritization or sentiment anal-          Morphological Richness Arabic has a rich inflectional
ysis, while others attempt to target multiple tasks jointly.           morphology system involving gender, number, person, as-
Most resources focus on a specific variety of Arabic such              pect, mood, case, state and voice, in addition to the cliti-
as Modern Standard Arabic (MSA), or Egyptian Arabic.                   cization of a number of pronouns and particles (conjunc-
These resource also vary in terms of the programming lan-              tions, prepositions, definite article, etc). This richness leads
guages they use, the types of interfaces they provide, the             to MSA verbs with upwards of 5,400 forms.
data representations and standards they utilize, and the de-           Dialectal Variation While MSA is the official language
gree of public availability (e.g., open-source, commercial,            of culture and education in the Arab World, it is no one’s
unavailable). These variations make it difficult to write new          native tongue. A number of different local dialects (such as
applications that combine multiple resources.                          Egyptian, Levantine, and Gulf) are the de facto daily lan-
To address many of these issues, we present CAMeL Tools,               guages of the Arab World – both off-line and on-line. Ara-
an open-source Python toolkit that supports Arabic and                 bic dialects differ significantly in terms of their phonology,
Arabic dialect pre-processing, morphological modeling, di-             morphology and lexicon from each other and from MSA
alect identification, named entity recognition and sentiment           to the point that using MSA tools and resources for pro-
analysis. CAMeL Tools provides command-line interfaces                 cessing dialects is not sufficient. For example, Khalifa et
(CLIs) and application programming interfaces (APIs) cov-              al. (2016a) report that using a state-of-the-art tool for MSA
ering these utilities.                                                 morphological disambiguation on Gulf Arabic returns POS
The rest of the paper is organized as follows. We present              tag accuracy at about 72%, compared to the performance
some background on the difficulty of processing Arabic text            on MSA, which is 96% (Pasha et al., 2014).
(Section 2), and then discuss previous work on a variety of
                                                                       Orthographic Inconsistency MSA and Arabic dialects,
Arabic NLP tasks (Section 3). We describe the design and
                                                                       as encountered online, show a lot of spelling inconsistency.
implementation of CAMeL Tools (Section 4) and provide
                                                                       Zaghouani et al. (2014) report that 32% of words in MSA
more details about each of its components (Sections 6 to 9).
                                                                       comments online have spelling errors. Habash et al. (2018)
Finally, we discuss some future additions to CAMeL Tools
                                                                       presented 27 encountered ways to write an Egyptian Arabic
(Section 10).                                                                                                      ®J J.Ó mbyqwl-
                                                                       word meaning ‘he does not say it’: e.g., €AêËñ
                                                                                                  ®J J.Ó mbyqlhAš (≈ 1,000), and
                                                                       hAš1 (≈ 26,000 times), €AêÊ
       2.    Arabic Linguistics Background
                                                                          ñJ.Ó mbŵlhAš (less than 10). Habash et al. (2018)
                                                                       €AêË
Aside of obvious issues such as resource poverty, Arabic
                                                                       proposed a conventional orthography for dialectal Arabic
poses a number of challenges to NLP: orthographic ambi-
                                                                       (CODA), a set of guidelines for consistent spelling of Ara-
guity, morphological richness, dialectal variations, and or-
                                                                       bic dialects for NLP. In addition, dialectal Arabic text is
thographic noise (Habash, 2010). While these are not nec-
                                                                       also known to appear on social media in a non-standard ro-
essarily unique issues to Arabic, their combination makes
                                                                       manization called Arabizi (Darwish, 2014).
Arabic processing particularly challenging.
Orthographic Ambiguity Arabic is generally written us-                     1
                                                                          Arabic transliteration is presented in the Habash-Soudi-
ing the Arabic Abjad script which uses optional diacritical            Buckwalter scheme (Habash et al., 2007).

                                                                  7022
(a)       Input                                 ... ém .k €A       ¯ øX ᣠ@ ,ÕËAªË@ € A¿ HAJ
                                                              JËAÓð AJ“Q                       ®’ áÓ Ik
                                                                                                           . Qk AJËA¢ @ éÊË YÒmÌ '@
                         AlHmd llh ǍyTAlyA xrjt mn tSfyAt kÂs AlςAlm, Aðn dý frStnA w mAln___Aš Hjh̄...
 (b)      English        Thank God Italy is out of the world cup, I think this is our chance and we do not have an excuse
 (c)   Preprocessing     AlHmd llh AyTAlyA xrjt mn tSfyAt kAs AlEAlm , AZn dy frStnA wmAlnA$ Hjh ...
 (d) Morphological       Diac                 furSitnA
                                                                   Q¯
                                                                AJ“      faraS∼atnA                AJ“ Q¯   furSatunA           AJ“Q¯
          Analysis       Gloss                     Opportunity                       Compress                             Holiday
              ¯
           AJ“Q         Lemma                furSah̄          é“Q¯ raS∼                                P    furSah̄               é“Q¯
           frStnA        D3 Tokenization furSah̄ +nA          AK+ é“Q¯   fa+ raS∼at +nA    AK+ I “ P+ ¬      furSah̄u +nA      AK+ é“Q¯   ...
                         POS                          noun                             verb                                noun
                         Dialect                     Egyptian                          MSA                                 MSA
 (e) Disambiguation                                     3
 (f)    Sentiment   positive
 (g)       NER           LOC:   AJËA¢ @ (ǍyTAlyA ‘Italy’)
 (h)     Dialect ID      Cairo: 95.0%, Doha: 1.9%, Jeddah: 1.5%, Aswan: 0.8%, Khartoum: 0.3%, Other: 0.5%, MSA: 0.0%

                     Table 1: An example showing the outputs of the different CAMeL Tools components

The above phenomena lead to highly complex readings that                  alyzers producing form-based, functional, and lexical fea-
vary across many dimensions. Table 1(d) shows three anal-                 tures as well as morpheme forms (Buckwalter, 2002; Smrž,
yses with different lemma readings and features for the                   2007; Boudlal et al., 2010; Alkuhlani and Habash, 2011;
       
word AJ“Q¯ frStnA. Two of the three readings have the same               Boudchiche et al., 2017).
POS (noun) but vary in terms of the English gloss and di-                 Systems such as MADA (Habash et al., 2009) and
alect. The third reading, a verb, has a different meaning,                MADAMIRA (Pasha et al., 2014) disambiguate the anal-
morphological analysis and tokenization.                                  yses that are produced by a morphological analyzer. Zal-
                                                                          mout and Habash (2017) outline a neural model that fol-
                    3.   Previous Work                                    lows the same approach. Alternatively, Farasa (Abdelali et
Arabic NLP, and NLP in general, involves many tasks,                      al., 2016) relies on probabilistic models to produce high to-
some of which may be related to others. Tasks like mor-                   kenization accuracy, and YAMAMA (Khalifa et al., 2016b)
phological modeling, parsing, tokenization, segmentation,                 uses the same approach to produce MADAMIRA-like dis-
lemmatization, stemming and named entity recognition                      ambiguated analyses. For Arabic dialects, Zalmout et al.
(NER) tend to serve as building block for higher level                    (2018) present a neural system that does morphological tag-
applications such as translation, spelling correction, and                ging and disambiguation for Egyptian Arabic.
question-answering systems. Other tasks, such as dialect
identification (DID) and sentiment analysis (SA), aim to                  Named Entity Recognition Wile some Arabic NER sys-
compute different characteristics of input text either for an-            tems have used rule-based approaches (Shaalan and Raza,
alytical purposes or as additional input for some of the pre-             2009; Zaghouani, 2012; Aboaoga and Aziz, 2013), most
viously mentioned tasks.                                                  recent Arabic NER systems integrate machine learning in
In this section we discuss those tasks currently addressed                their architectures (Benajiba and Rosso, 2008; AbdelRah-
by CAMeL Tools. We also discuss some of the software                      man et al., 2010; Mohammed and Omar, 2012; Oudah and
suites that have inspired the design of CAMeL Tools.                      Shaalan, 2012; Darwish, 2013; El Bazi and Laachfoubi,
                                                                          2019). Benajiba and Rosso (2007) developed ANERsys,
3.1.   NLP Tasks                                                          one of the earliest NER systems for Arabic. They built
We briefly survey the different efforts for a variety of Ara-             their own linguistic resources, which have become stan-
bic NLP tasks. Specifically, we discuss those tasks that are              dard in Arabic NER literature: ANERcorp (i.e. an anno-
relevant to CAMeL Tools at the time of publishing. These                  tated corpus for Person, Location and Organization names)
include morphological modeling, NER, DID and SA.                          and ANERgazet (i.e. Person, Location and Organization
                                                                          gazetteers). In CAMeL Tools, we make use of all publicly
Morphological Modeling Efforts on Arabic morpholog-
                                                                          available training data and compare our system to the work
ical modeling have varied over a number of dimensions,
                                                                          of Benajiba and Rosso (2008).
ranging from very abstract and linguistically rich repre-
sentations paired with word-form derivation rules (Beesley                Dialect Identification Salameh et al. (2018) introduced
et al., 1989; Beesley, 1996; Habash and Rambow, 2006;                     a fine-grained DID system covering the dialects of 25 cities
Smrž, 2007) to pre-compiled representations of the differ-                from several countries across the Arab world. Elfardy and
ent components needed for morphological analysis (Buck-                   Diab (2012) presented a set of guidelines for token-level
walter, 2002; Graff et al., 2009; Habash, 2007). Previous                 identification of dialectness. They later proposed a super-
efforts also show a wide range for the depth of information               vised approach for identifying whether a given sentence is
that morphological analyzers can produce, with very shal-                 prevalently MSA or Egyptian (Elfardy and Diab, 2013) us-
low analyzers on one end (Vanni and Zajac, 1996), to an-                  ing the Arabic Online Commentary Dataset (Zaidan and

                                                                   7023
CAMeL Tools         MADAMIRA               Stanford CoreNLP            Farasa
       Suite Type                             Arabic Specific      Arabic Specific           Multilingual          Arabic Specific
       Language                                   Python                Java           Java with Python bindings        Java
       CLI                                           3                   3                         3                     3
       API                                           3                   3                         3                     3
       Exposed Pre-processing                        3
       Morphological Modeling                        3                       3
       Morphological Disambiguation                  3                       3                    3                      3
       POS Tagging                                   3                       3                    3                      3
       Diacritization                                3                       3                                           3
       Tokenization/Segementation/Stemming           3                       3                    3                      3
       Lemmatization                                 3                       3                                           3
       Named Entity Recognition                      3                       3                    3                      3
       Sentiment Analysis                            3
       Dialect ID                                    3
       Parsing                                Work in progress                                    3                      3

                Table 2: Feature comparison of CAMeL Tools, MADAMIRA, Stanford CoreNLP and Farasa.

Callison-Burch, 2011). Their system (Elfardy and Diab,               we discuss the capabilities of the mentioned tools and we
2012) combines a token-level DID approach with other fea-            provide a rough comparison in Table 2.
tures to train a Naive-Bayes classifier. Sadat et al. (2014)         MADAMIRA provides a single, yet configurable, mode
presented a bi-gram character-level model to identify the            of operation revolving around it’s morphological analyzer.
dialect of sentences in the social media context among di-           When disambiguation is enabled, features such as POS
alects of 18 Arab countries. More recently, discriminating           tagging, tokenization, segmentation, lemmatization, NER
between Arabic Dialects has been the goal of a dedicated             and base-phrase chunking become available. MADAMIRA
shared task (Zampieri et al., 2017), encouraging researchers         was designed specifically for Arabic and supports both
to submit systems to recognize the dialect of speech tran-           MSA and Egyptian and primarily provides a CLI, a server
scripts along with acoustic features for dialects of four main       mode, and a Java API.
regions: Egyptian, Gulf, Levantine and North African, in
addition to MSA. The dataset used in these tasks is differ-          Farasa is a collection of Java libraries and CLIs for
ent from the dataset we use in this work in its genre, size          MSA.2 These include separate tools for diacritization, seg-
and the dialects covered.                                            mentation, POS tagging, parsing, and NER. This allows
                                                                     for a more flexible usage of components compared to
Sentiment Analysis Arabic SA is a well studied prob-                 MADAMIRA.
lem that has many proposed solutions. These solutions
span a wide array of methods such as developing lexicon-             Stanford CoreNLP is a multilingual Java library, CLI
based conventional machine learning models (Badaro et                and server providing multiple NLP components with vary-
al., 2014; Abdul-Mageed and Diab, 2014) or deep learn-               ing support for different languages. Arabic support is
ing models (Abu Farha and Magdy, 2019; Badaro et al.,                provided for parsing, tokenization, POS tagging, sentence
2018; Baly et al., 2017). More recently, fine-tuning large           splitting and NER (a rule-based system using regular ex-
pre-trained language models has achieved state-of-the-art            pressions). An official Python library is also available (Qi
results on various NLP tasks. ElJundi et al. (2019) devel-           et al., 2018) which provides official bindings to CoreNLP
oped a universal language model for Arabic (hULMonA)                 as well as providing neural implementations for some of
which was pre-trained on a large corpus of Wikipedia sen-            their components. It is worth noting that the commonly
tences and compared its performance to multilingual BERT             used Natural Language Toolkit (NLTK) (Loper and Bird,
(mBERT) (Devlin et al., 2018) on the task of Arabic SA               2002) provides Arabic support through unofficial bindings
after fine-tuning. They have shown that although mBERT               to Stanford CoreNLP.
was only trained on Modern Standard Arabic (MSA), it can
still achieve state-of-the-art results on some datasets if it’s                   4.   Design and Implementation
fine-tuned on dialectal Arabic data. Furthermore, Antoun et          CAMeL Tools is an open-source package consisting of a
al. (2020) pre-trained an Arabic specific BERT (AraBERT)             set of Python APIs with accompanying command-line tools
and were able to achieve state-of-the-art results on several         that are thin wrap these APIs. In this section we discuss
downstream tasks.                                                    the design decisions behind CAMeL Tools and how it was
                                                                     implemented.
3.2.     Software Suites
While most efforts focus on individual tasks, there are few          4.1.        Motivation
that provide multiple capabilities in the form of a unified          There are two main issues with currently available tools that
toolkit. These can be classified as Arabic specific, such as         prompted us to develop CAMeL Tools and influenced its
MADAMIRA (Pasha et al., 2014) and Farasa (Abdelali et                design:
al., 2016; Darwish and Mubarak, 2016) or multi-lingual,
                                                                         2
such as Stanford CoreNLP (Manning et al., 2014). Below                       http://qatsdemo.cloudapp.net/farasa/

                                                                  7024
Fragmented Packages and Standards Tools tend to fo-                 added and older ones being improved. As such, it is diffi-
cus on one task with no easy way to glue together different         cult to accurately report on the performance of each compo-
packages. Some tools do provide APIs, making this easier,           nent. In this paper, we will report the state of CAMeL Tools
but others only provide command-line tools. This leads to           at the time of publishing and updates will be published on
overhead writing glue code including interfaces to different        dedicated web page.4
packages and parsers for different output formats.                  At the moment, CAMeL Tools provides utilities for pre-
Lack of Flexibility Most tools don’t expose intermedi-              processing, morphological analysis and disambiguation,
ate functionality. This makes it difficult to reuse interme-        DID, NER and Sentiment Analysis. In the following sec-
diate output to reduce redundant computation. Addition-             tions, we discuss these components in more detail.
ally, functions for pre-processing input don’t tend to be ex-
posed to users which makes it difficult to replicate any pre-                    5.    Pre-processing Utilities
processing performed internally.                                    CAMeL Tools provides a set of pre-processing utilities that
                                                                    are common in Arabic NLP but get re-implemented fre-
4.2.   Design Philosophy                                            quently. Different tools and packages tend to do slightly
Here we identify the driving principles behind the design           different pre-processing steps that are often not well docu-
of CAMeL Tools. These have been largely inspired by the             mented or exposed for use on their own. By providing these
designs of MADAMIRA, Farasa, CoreNLP and NLTK, and                  utilities as part of the package, we hope to reduce the over-
include our own personal requirements.                              head of writing Arabic NLP applications and insure that
                                                                    pre-processing is consistent from one project to the other.
Arabic Specific We want a toolkit that focuses solely on            In this section, we discuss the different pre-processing util-
Arabic NLP. Language agnostic tools don’t tend to model             ities that come with CAMeL Tools.
the complexities of Arabic text very well, or indeed, other
morphologically complex languages. Doing so would in-               Transliteration When working with Arabic text, it is
crease software complexity and development time.                    sometimes convenient or even necessary to use alternate
                                                                    transliteration schemes. Buckwalter transliteration (Buck-
Flexibility Provide flexible components to allow mixing             walter, 2002) and its variants, Safe Buckwalter and XML
and matching rather than providing large monolithic ap-             Buckwalter, for example, map the Arabic character set
plications. We want CAMeL Tools to be usable by both                into ASCII representations using a one-to-one map. These
application developers requiring only high-level access to          transliterations are used as an input or output formats
components, and researchers, who might want to exper-               in Arabic NLP tools such as MADAMIRA and SAMA,
iment with new implementations of components or their               and as the chosen encoding of various corpora such as
sub-components.                                                     the SAMA databases and the Columbia Arabic Tree-
Modularity Different implementations of the same com-               bank (CATiB) (Habash and Roth, 2009). Habash-Soudi-
ponent type should be slot-in replacements for one another.         Buckwalter (HSB) (Habash et al., 2007) is another vari-
This would allow for easier experimentation as well as al-          ant on the Buckwalter transliteration scheme that includes
lowing users to provide custom implementations as a re-             some non-ASCII characters whose pronunciation is easier
placement for built-in components.                                  to remember for non-Arabic speakers. We provide translit-
                                                                    eration functions to and from the following schemes: Ara-
Performance Components should provide close to state-               bic script, Buckwalter, Safe Buckwalter, XML Buckwal-
of-the-art results but within reasonable run-time perfor-           ter, and Habash-Soudi-Buckwalter. Additionally, we pro-
mance and memory usage. However, we don’t want this                 vide utilities for users to define their own transliteration
to come at the cost of code readability and maintainability         schemes.
and we are willing to sacrifice some performance benefits
to this end.                                                        Orthographic Normalization Due to Arabic’s complex
                                                                    morphology, it is necessary to normalize text in various
Ease of Use Using individual components shouldn’t take              ways in order to reduce noise and sparsity (Habash, 2010).
more than a few lines of code to get started with. We aim           The most common of these normalizations are:
to provide meaningful default configurations that are gen-
erally applicable in most situations.                                   • Unicode normalization which includes breaking up
                                                                          combined sequences (e.g. B to È and @), converting
4.3.   Implementation                                                     character variant forms to a single canonical form (e.g.
CAMeL Tools is implemented in Python 3 and can be in-                     ©, «, and ª to ¨), and converting extensions to the
stalled through pip3 . We chose Python due to it’s ease of                Arabic character set used for Persian and Urdu to the
use and its pervasiveness in NLP and Machine Learning                     closest Arabic character (e.g. À to ¼).
along with libraries such as tensorflow, pytorch, and scikit-
learn. We aim to be compatible with Python version 3.5 and              • Dediacritization which removes Arabic diacritics
later running on Linux, MacOS, and Windows. CAMeL                         which occur infrequently in Arabic text and tend to be
Tools is in continuous development with new features being                considered noise. These include short vowels, shadda

   3                                                                    4
   Installation instructions and documentation can be found on         https://camel-lab.github.io/camel_tools_
https://github.com/CAMeL-Lab/camel_tools                            updates/

                                                                 7025
(gemination marker), and the dagger alif (e.g.      éƒPYÓ      Reinflection Reinflection differs from generation in that
                                                                       the input to the reinflector is a fully inflected word, with
       mudar∼isah̄u to éƒPYÓ mdrsh̄).
                                                                        a list of morphological feature. The reinflector is not lim-
  • Removal of unnecessary characters including those                   ited to a specific lemma or POS tag, but rather uses the
    with no phonetic value, such as the tatweel (kashida)               analyzer component to generate all possible analyses that
    character.                                                          the input word has, including the possible lemmas and POS
                                                                        tags. Next, the reinflector uses the list of features it had for
  • Letter variant normalization for letters that are so
                                                                        input, along with the features from the analyzer, to create a
    often misspelled. This includes normalizing all the
                                                                       new list of features that can be input to the generator to pro-
    forms of Hamzated Alif ( @ Â, @ Ǎ, @ Ā) to bare Alif ( @ A),      duce a reinflection for every possible analysis of the input
    the Alif-Maqsura ( ø ý) to Ya ( ø y), the Ta-Marbuta ( è
                                                                       word.
       h̄) to Ha ( è h), and the non-Alif forms of Hamza (ð' ŵ,        In table 1(d), we present an example of the morphological
                                                                        analysis which CAMeL Tools can provide. The analysis in-
       ø' ŷ) to the Hamza letter ( Z' ’).                              cludes but not limited to diacritization, the CAMEL Arabic
Table 1(c), illustrates how these pre-processing utilities can          Phonetic Inventory (CAPHI), English gloss, tokenization,
be applied to an Arabic sentence. Specifically, we apply                lemmatization, and POS tagging.
Arabic to Buckwalter transliteration, letter variant normal-
ization, and tatweel removal.                                           6.2.   Disambiguation and Annotation
                                                                        Morphological disambiguation is the task of choosing an
             6.   Morphological Modeling
                                                                        analysis of a word in context. Tasks such as diacritization,
Morphology is one of the most studied aspects of the Ara-               POS tagging, tokenization and lemmatization can be con-
bic language. The richness of Arabic makes the modeling                 sidered forms of disambiguation. When dealing with Ara-
of morphology a complex task, justifying the focus put on               bic however, each of these tasks on their own, don’t fully
morphological analysis functionalities.                                 disambiguate a word. In order to avoid confusion, we re-
6.1.    Analysis, Generation, and Reinflection                          fer to the individual tasks on their own as annotation tasks.
                                                                        Traditionally, annotation tasks would be implemented in-
As part of CAMeL Tools, we currently provide implemen-
                                                                        dependently of each other. We however, chose to use the
tations of the CALIMAStar analyzer, generator, and re-
                                                                        one-fell-swoop approach of Habash and Rambow (2005).
inflector described by Taji et al. (2018). These compo-
                                                                        This approach entails predicting a set of features of a given
nents operate on databases that are in the same format as
                                                                        word, ranking its out-of-context analyses based on the pre-
the ALMORGEANA (Habash, 2007) database, which con-
                                                                        dicted features, and finally choosing the top ranked analy-
tains three lexicon tables for prefixes, suffixes, and stems,
                                                                        sis. Annotation can then be achieved by retrieving the rele-
and three compatibility tables for prefix-suffix, prefix-stem,
                                                                        vant feature from the ranked analysis.
and stem-suffix. CAMeL Tools include databases for MSA
                                                                        We provide two builtin disambiguators: a simple low-cost
as well as the Egyptian and Gulf dialects.
                                                                        disambiguator based on a maximum likelihood estimation
Analysis For our tool’s purposes, we define analysis as                 (MLE) model and a neural network disambiguator that pro-
the identification of all the different readings (analyses) of          vides improved disambiguation accuracy using multitask
a word our of context. Each of these readings is defined by             learning. Table 1 (e) shows how these morphological dis-
a lexical features such as lemma, gloss, and stem, and mor-             ambiguators can disambiguate a word.
phological features such POS tags, gender, number, case,
and mood. The analyzer expects a word for input, and tries              MLE Disambiguator We built a simple disambiguation
to match a prefix, stem, and suffix from the database lexi-             model based on YAMAMA (Khalifa et al., 2016b), where
con table to the surface form of the word, while maintaining            the main component is a lookup model of a word and its
the constraints imposed by the compatibility tables. This               most probable full morphological analysis. For a word that
follows the algorithm described by Buckwalter (2002) with               does appear in the lookup model, we use the morphological
some enhancements. These enhancements, along with the                   analyzer from Section 6.1 to generate all possible analyses
switch of encoding to Arabic script, allows our analyzer to             for the word and chose the top ranked analysis based on
better detect and handle foreign words, punctuation, and                the pre-computed probabilistic score for the joint lemma
digits.                                                                 and POS frequency. Both the lookup model and the pre-
                                                                        computed scores are based on the same training dataset for
Generation Generation is the task of inflecting a lemma
                                                                        the given dialect.
for a set of morphological features. The algorithm we use
for generation follow the description of the generation com-            Multitask Learning Disambiguator We provide a sim-
ponent in ALMOR (Habash, 2007). The generator expects                   plified implementation of the neural multitask learning ap-
minimally a lemma, and a POS tag, and produces a list of                proach to disambiguation by Zalmout and Habash (2019).
inflected words with their full analysis. If inflectional fea-          Instead of the LSTM-based (long short-term memory) lan-
tures, such as person, gender, and case, are specified in the           guage models used in the original system, we use unigram
input, we limit the generated output to those values. We                language models without sacrificing much accuracy. This is
generate all possible values for the inflectional features that         done to increase run-time performance and decrease mem-
are not specified. Clitics, being optional features, are only           ory usage.
generated when they are specified.                                      We evaluated the CAMeL Tools disambiguators against

                                                                     7026
CAMeL Tools                                    7.   Dialect Identification
    MSA            MADAMIRA            Multitask MLE                    We provide an implementation of the Dialect Identification
    DIAC                87.7%            90.9%        78.4%             (DID) system described by Salameh et al. (2018). This sys-
    LEX                 96.4%            95.4%        95.7%             tem is able to distinguish between 25 Arabic city dialects
    POS                 97.1%            97.2%        95.5%             and MSA. Table 1(h), shows an example of the our sys-
    FULL                85.6%            89.0%        70.0%             tem’s output.
    ATB TOK             99.0%            99.4%        99.0%             Approach We train a Multinomial Naive Bayes (MNB)
                                                                        classifier that outputs 26 probability scores referring to the
                                                                        25 cities and MSA. The model is fed with a suite of fea-
Table 3: Comparison of the performance of the CAMeL                     tures covering word unigrams and character unigrams, bi-
Tools Multitask learning and MLE systems on MSA to                      grams and trigrams weighted by their Term Frequency-
MADAMIRA. The systems are evaluated on their accuracy                   Inverse Document Frequency (TF-IDF) scores, combined
to correctly predict diacritics (DIAC), lemmas (LEX), part-             with word-level and character-level language model scores
of-speech (POS), the full set of predicted features (FULL),             for each dialect. We also train a secondary MNB model
and the ATB tokenization.                                               using this same setup trained that outputs six probability
                                                                        scores referring to five city dialects and MSA for which the
                                            CAMeL Tools                 MADAR corpus provides additional training data. The out-
       EGY            MADAMIRA                MLE                       put of the secondary model is used as additional features to
       DIAC               82.8%                 78.9%                   the primary model.
       LEX                86.6%                 87.8%                   Datasets We train our system using the MADAR corpus
       POS                91.7%                 91.8%                   (Bouamor et al., 2018), a large-scale collection of parallel
       FULL               76.4%                 73.0%                   sentences built to cover the dialects of 25 cities from the
       BW TOK             93.5%                 92.8%                   Arab World, in addition to MSA. It is composed of two
                                                                        sub-corpora; the first consisting of 1,600 training sentences
                                                                        and covering all 25 city dialects and MSA, and the other
Table 4: Comparison of the performance of the CAMeL                     consisting of 9,000 training sentences and covering 5 city
Tools MLE systems on Egyptian to MADAMIRA. The sys-                     dialects and MSA. The first sub-corpus is used to train our
tems are evaluated on their accuracy to correctly predict di-           primary model while the latter is used to train our secondary
acritics (DIAC), lemmas (LEX), part-of-speech (POS), the                model. We use the full corpus to train our language models.
full set of predicted features (FULL), and the Buckwalter
tag tokenization (BW TOK) (Khalifa et al., 2016b).                      Experiments and Results We evaluate our system us-
                                                                        ing the test split of the MADAR corpus. Our system is
                                                                        able to distinguish between the 25 Arabic city dialects and
MADAMIRA.5 For the accuracy evaluation we used the                      MSA with an accuracy of 67.9% for sentences with an av-
Dev sets recommended by Diab et al. (2013) of the Penn                  erage length of 7 words with an accuracy of 90% for sen-
Arabic Treebank (PATB parts 1,2 and 3) (Maamouri et al.,                tences consisting of 16 words. Our DID system is used in
2004) for MSA and the ARZ Treebank (Maamouri et al.,                    the back-end component of ADIDA (Obeid et al., 2019) to
2014) for EGY.                                                          compute the dialect probabilities of a given input and dis-
Table 3 provides the evaluation on MSA on MADAMIRA                      play them as point-map or a heat-map on top of a geograph-
using the SAMA database (Graff et al., 2009), CAMeL                     ical map of the Arab world.
Tools multitask learning system,6 and CAMeL Tools MLE
system. Table 4 provides the evaluation on EGY on                                 8.    Named Entity Recognition
MADAMIRA using the CALIMA ARZ database (Habash                          Our NER system targets the ENAMEX classes of Per-
et al., 2012), following the work described by Pasha et al.             son, Location and Organization. We employ a light-weight
(2015) and Khalifa et al. (2016b), and the CAMeL Tools                  machine learning approach and exploit large annotated
MLE system.                                                             datasets with the intention of being generalizable to differ-
We compare to Farasa’s tokenizer (Abdelali et al., 2016),               ent contexts.
and the accuracy of their system’s ATB tokenization on the
same MSA Dev set was 98%. However, this number is not                   Approach We adopt the common IOB (inside, outside,
fair to report for the sake of this evaluation because Farasa           beginning) NER tagging format, with a number of em-
follows slightly different tokenization conventions. For ex-            pirically determined features including morphological fea-
ample, the Ta ( H                          
                  t) in words such as AîDƒPYÓ  mdrsthA ‘her-           tures, POS tags, in addition to word-level, gazetteer, and
                                                                       contextual features. Random Forests empirically outper-
school’ is not reverted to its original Ta-Marbuta ( è h̄) form,        forms all the other algorithms we explored.
unlike the convention we have adopted for ATB tokeniza-
tion.                                                                   Datasets We train on a combination of Arabic NER anno-
                                                                        tated datasets, including ANERcorp (∼150K words) (Be-
   5
     MADAMIRA: Released on April 03, 2017, version 2.1                  najiba and Rosso, 2007) and ACE Multilingual Training
   6
     The implementation we report on is an older version that uses      Corpora (2003-2005, 2007) (∼200K words)(Mitchell et al.,
TensorFlow 1.8. We are currently working on a new implementa-           2003; Mitchell et al., 2005; Walker et al., 2006; Song et
tion using TensorFlow 2.1.                                              al., 2014). We also employ three sets of gazetteers totaling

                                                                     7027
Per. Loc. Org. Avg. CRF-based Sys.                                        CAMeL Tools
                                                                                      AraBERT mBERT Mazajak
 Precision 91% 90% 79% 87%                      86%
 Recall    93% 83% 74% 83%                      69%                       ArSAS          92%         89%         90%
 F-measure 92% 86% 76% 85%                      76%                       ASTD           73%         66%         72%
                                                                          SemEval        69%         60%         63%
Table 5: The results of the proposed system when trained
and tested on ANERcorp dataset vs. CRF-based system.              Table 7: CAMeL Tools sentiment analyzer performance us-
                                                                  ing AraBERT and mBERT compared to Mazajak over three
                                                                  benchmark datasets
              Per. Loc. Org. Avg. ANERcorp Model
Precision 97% 94% 93% 95%                     61%
                                                                  fully connected linear layer with a softmax activation func-
Recall    95% 88% 81% 88%                     40%
                                                                  tion to the last hidden state. We report results on both fine-
F-measure 96% 91% 86% 91%                     47%
                                                                  tuned models and we provide the best of the two as part of
                                                                  CAMeL Tools.
Table 6: The results of the proposed system when trained
on ACE corpora+ANERcorp datasets (W/O ACE 2003                    Datasets To ensure that mBERT and AraBERT can be
NW) and tested on ACE 2003 NW.                                    generalized to classify the sentiment of dialectal tweets, we
                                                                  used various datasets for fine-tuning and evaluation. The
                                                                  first dialectal dataset is the Arabic Speech-Act and Senti-
around 10K unique gazetteers entries, which were gathered         ment Corpus of Tweets (ArSAS) (AbdelRahim Elmadany
from free online resources/tools and ANERgazet (Benajiba          and Magdy, 2018) where each tweet is annotated for pos-
and Rosso, 2007).                                                 itive, negative, neutral, and mixed sentiment. The second
Experiments and Results We conducted a number of ex-              dataset is the Arabic Sentiment Tweets Dataset (ASTD)
periments to evaluate the performance of the proposed NER         (Nabil et al., 2015) that is in both Egyptian Arabic and
architecture, with Random Forests as the adopted ML algo-         MSA. Each tweet has a sentiment label of either posi-
rithm. For training purposes, we use standard NE anno-            tive, negative, neutral, or objective. The third dataset is
tated datasets, including ANERcorp (∼150K words) and              SemEval-2017 task 4-A benchmark dataset (Rosenthal et
ACE Multilingual Training Corpora (2003-2005, 2007)7              al., 2017) which is in MSA and each tweet was anno-
(∼200K words). Each experiment includes a reference (i.e.         tated with a sentiment label of positive, negative, or neutral.
gold-standard) dataset used for testing, including subset of      Lastly, we used the Multi-Topic Corpus for Target-based
ANERcorp and ACE 2003 Newswire (NW).                              Sentiment Analysis in Arabic Levantine Tweets (ArSenTD-
Table 5 illustrates the results of training the model on 5/6      Lev) (Baly et al., 2019) where tweet was labeled as being
of ANERcorp and testing on the remaining 1/6, compared            positive, very positive, neutral, negative, or very negative.
to the results by Benajiba and Rosso (2008)’s CRF-based           All the tweets in the datasets were pre-processed using the
system, which uses the same data splitting. The results           utilities provided by CAMeL Tools to remove diacritics,
show that our classifier significantly outperforms the CRF-       URLs, and usernames.
based system in terms of average Precision, Recall and F-
                                                                  Experiments and Results We compare our results to
measure. Moreover, we train on ACE corpora (w/o ACE
                                                                  Mazajak (Abu Farha and Magdy, 2019) and we tried to
2003 NW) and ANERcorp and then test on ACE 2003 NW
                                                                  follow their approach in terms of evaluation and experi-
(∼ 28K words), and compare against the performance of
                                                                  mental setup. We discarded the objective tweets from the
the ANERcorp model, as is illustrated in Table 6. The re-
                                                                  ASTD dataset and applied 80/20 random sampling to split
sults show a dramatic improvement in the performance.
                                                                  the dataset into train/test respectively. We also discarded
                                                                  the mixed class from the ArSAS dataset and kept the tweets
              9.    Sentiment Analysis                            with an annotation confidence of over 50% and applied
We leverage large pre-trained language models to build an         80/20 random sampling to split the dataset into train/test
Arabic sentiment analyzer as part of CAMeL Tools. Our             respectively. Additionally, we reduced the number of la-
analyzer classifies an Arabic sentence into being positive,       bels in the ArSenTD-Lev dataset to positive, negative, and
negative, or neutral. Table 1(f), shows an example of how         neutral by turning the very positive labels to positive and
CAMeL Tools sentiment analyzer takes an Arabic sentence           the very negative labels to negative. Both mBERT and
as an input and outputs its sentiment.                            AraBERT were fine-tuned on ArSenTD-Lev and the train
Approach We used HuggingFace’s Transformers (Wolf                 splits from SemEval, ASTD, and ArSAS (24,827 tweets)
et al., 2019) to fine-tune multilingual BERT (mBERT) (De-         on a single GPU for 3 epochs with a learning rate of 3e-5,
vlin et al., 2018) and AraBERT (Antoun et al., 2020) on the       batch size of 32, and a maximum sequence length of 128.
task of Arabic SA. The fine-tuning was done by adding a           For evaluation, we use the F1P N score which was defined
                                                                  by SemEval-2017 and used by Majazak; F1P N is the macro
   7
    Available under licence from LDC with catalogue numbers       F1 score over the positive and negative classes only while
LDC2004T09, LDC2005T09, LDC2006T06 and LDC2014T18,                neglecting the neutral class. Table 7 shows our results com-
respectively.                                                     pared to Mazajak.

                                                               7028
10.    Conclusion and Future Work                            on Computational Natural Language Learning (CoNLL),
We presented CAMeL Tools, an open source set of tools               pages 30–38, Ann Arbor, Michigan.
for Arabic NLP providing utilities for pre-processing, mor-       Alkuhlani, S. and Habash, N. (2011). A Corpus for Mod-
phological modeling, dialect identification, Named Entity           eling Morpho-Syntactic Agreement in Arabic: Gender,
Recognition and Sentiment Analysis.                                 Number and Rationality. In Proceedings of the Confer-
CAMeL Tools is in a state of active development. We will            ence of the Association for Computational Linguistics
continue to add new API components and command-line                 (ACL), Portland, Oregon, USA.
tools while also improving existing functionality. Addition-      Antoun, W., Baly, F., and Hajj, H. (2020). Arabert:
ally, we will be adding more datasets and models to support         Transformer-based model for arabic language under-
more dialects. Some features we are currently working on            standing.
adding to CAMeL Tools include:                                    Badaro, G., Baly, R., Hajj, H., Habash, N., and El-Hajj, W.
                                                                    (2014). A large scale Arabic sentiment lexicon for Ara-
  • A transliterator to and from Arabic script and Arabizi          bic opinion mining. In Proceedings of the Workshop for
    inspired by Al-Badrashiny et al. (2014).                        Arabic Natural Language Processing (WANLP), pages
  • A spelling correction component supporting MSA and              165–173, Doha, Qatar.
    DA text inspired by the work of Watson et al. (2018)          Badaro, G., El Jundi, O., Khaddaj, A., Maarouf, A., Kain,
    and Eryani et al. (2020).                                       R., Hajj, H., and El-Hajj, W. (2018). EMA at SemEval-
                                                                    2018 task 1: Emotion mining for Arabic. In Proceedings
  • Additional morphological disambiguators for Arabic              of The 12th International Workshop on Semantic Eval-
    dialects, building on work by (Khalifa et al., 2020).           uation, pages 236–244, New Orleans, Louisiana, June.
                                                                    Association for Computational Linguistics.
  • A dependency parser based on the CAMeL Parser by
                                                                  Baly, R., Hajj, H., Habash, N., Shaban, K. B., and El-Hajj,
    Shahrour et al. (2016).
                                                                    W. (2017). A sentiment treebank and morphologically
  • More Dialect ID options including more cities and a             enriched recursive deep models for effective sentiment
    new model for country/region based identification.              analysis in arabic. ACM Transactions on Asian and Low-
                                                                    Resource Language Information Processing (TALLIP),
             Bibliographical References                             16(4):23.
                                                                  Baly, R., Khaddaj, A., Hajj, H., El-Hajj, W., and Sha-
Abdelali, A., Darwish, K., Durrani, N., and Mubarak, H.
                                                                    ban, K. B. (2019). Arsentd-lev: A multi-topic corpus
  (2016). Farasa: A fast and furious segmenter for Arabic.
                                                                    for target-based sentiment analysis in arabic levantine
  In Proceedings of the Conference of the North American
                                                                    tweets.
  Chapter of the Association for Computational Linguis-
  tics (NAACL), pages 11–16, San Diego, California.               Beesley, K., Buckwalter, T., and Newton, S. (1989). Two-
AbdelRahim Elmadany, H. M. and Magdy, W. (2018).                    Level Finite-State Analysis of Arabic Morphology. In
  Arsas: An arabic speech-act and sentiment corpus of               Proceedings of the Seminar on Bilingual Computing in
  tweets. In Hend Al-Khalifa, et al., editors, Proceed-             Arabic and English.
  ings of the Eleventh International Conference on Lan-           Beesley, K. (1996). Arabic Finite-State Morphological
  guage Resources and Evaluation (LREC 2018), Paris,                Analysis and Generation. In Proceedings of the Interna-
  France, may. European Language Resources Association              tional Conference on Computational Linguistics (COL-
  (ELRA).                                                           ING), pages 89–94, Copenhagen, Denmark.
AbdelRahman, S., Elarnaoty, M., Magdy, M., and Fahmy,             Benajiba, Y. and Rosso, P. (2007). Anersys 2.0: Conquer-
  A. (2010). Integrated machine learning techniques for             ing the ner task for the arabic language by combining
  arabic named entity recognition. International Journal            the maximum entropy with pos-tag information. In Pro-
  of Computer Science Issues (IJCSI), 7:27–36.                      ceedings of workshop on natural language-independent
Abdul-Mageed, M. and Diab, M. (2014). SANA: A                       engineering, 3rd Indian international conference on ar-
  large scale multi-genre, multi-dialect lexicon for Ara-           tificial intelligence (IICAI-2007), pages 1814–1823.
  bic subjectivity and sentiment analysis. In Proceedings         Benajiba, Y. and Rosso, P. (2008). Arabic Named Entity
  of the Language Resources and Evaluation Conference               Recognition using Conditional Random Fields. In Pro-
  (LREC), pages 1162–1169, Reykjavik, Iceland.                      ceedings of workshop on HLT NLP within the Arabic
Aboaoga, M. and Aziz, M. J. A. (2013). Arabic person                world (LREC).
  names recognition by using a rule based approach. Jour-         Bouamor, H., Habash, N., Salameh, M., Zaghouani, W.,
  nal of Computer Science, 9:922–927.                               Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S.,
Abu Farha, I. and Magdy, W. (2019). Mazajak: An online              Eryani, F., Erdmann, A., and Oflazer, K. (2018). The
  Arabic sentiment analyser. In Proceedings of the Fourth           MADAR Arabic Dialect Corpus and Lexicon. In Pro-
  Arabic Natural Language Processing Workshop, pages                ceedings of the Language Resources and Evaluation
  192–198, Florence, Italy, August. Association for Com-            Conference (LREC), Miyazaki, Japan.
  putational Linguistics.                                         Boudchiche, M., Mazroui, A., Bebah, M. O. A. O.,
Al-Badrashiny, M., Eskander, R., Habash, N., and Ram-               Lakhouaja, A., and Boudlal, A. (2017). AlKhalil Mor-
  bow, O. (2014). Automatic transliteration of roman-               pho Sys 2: A robust Arabic morpho-syntactic analyzer.
  ized Dialectal Arabic. In Proceedings of the Conference           Journal of King Saud University - Computer and Infor-

                                                               7029
mation Sciences, 29(2):141–146.                                  phological analyzer and generator for the Arabic dialects.
Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Be-          In Proceedings of the International Conference on Com-
  bah, M., and Shoul, M. (2010). Alkhalil Morpho Sys1:             putational Linguistics and the Conference of the Asso-
  A morphosyntactic analysis system for Arabic texts. In           ciation for Computational Linguistics (COLING-ACL),
  Proceedings of the International Arab Conference on In-          pages 681–688, Sydney, Australia.
  formation Technology, pages 1–6.                               Habash, N. and Roth, R. (2009). CATiB: The Columbia
Buckwalter, T. (2002). Buckwalter Arabic morphological             Arabic Treebank. In Proceedings of the Joint Confer-
  analyzer version 1.0. Linguistic Data Consortium (LDC)           ence of the Association for Computational Linguistics
  catalog number LDC2002L49, ISBN 1-58563-257-0.                   and the International Joint Conference on Natural Lan-
Darwish, K. and Mubarak, H. (2016). Farasa: A new fast             guage Processing (ACL-IJCNLP), pages 221–224, Sun-
  and accurate Arabic word segmenter. In Proceedings               tec, Singapore.
  of the Language Resources and Evaluation Conference            Habash, N., Soudi, A., and Buckwalter, T. (2007). On Ara-
  (LREC), Portorož, Slovenia.                                      bic Transliteration. In A. van den Bosch et al., editors,
Darwish, K. (2013). Named entity recognition using cross-          Arabic Computational Morphology: Knowledge-based
  lingual resources: Arabic as an example. In Proceedings          and Empirical Methods, pages 15–22. Springer, Nether-
  of the 51st Annual Meeting of the Association for Com-           lands.
  putational Linguistics, page 1558â1567.                        Habash, N., Rambow, O., and Roth, R. (2009).
Darwish, K. (2014). Arabizi Detection and Conversion to            MADA+TOKAN: A toolkit for Arabic tokenization, dia-
  Arabic. In Proceedings of the Workshop for Arabic Nat-           critization, morphological disambiguation, POS tagging,
  ural Language Processing (WANLP), pages 217–224,                 stemming and lemmatization. In Khalid Choukri et al.,
  Doha, Qatar.                                                     editors, Proceedings of the International Conference on
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.               Arabic Language Resources and Tools, Cairo, Egypt.
  (2018). Bert: Pre-training of deep bidirectional trans-          The MEDAR Consortium.
  formers for language understanding.                            Habash, N., Eskander, R., and Hawwari, A. (2012). A
Diab, M., Habash, N., Rambow, O., and Roth, R. (2013).             Morphological Analyzer for Egyptian Arabic. In Pro-
  LDC Arabic treebanks and associated corpora: Data di-            ceedings of the Workshop of the Special Interest Group
  visions manual. arXiv preprint arXiv:1309.5652.                  on Computational Morphology and Phonology (SIG-
El Bazi, I. and Laachfoubi, N. (2019). Arabic named en-            MORPHON), pages 1–9, Montréal, Canada.
  tity recognition using deep learning approach. Interna-        Habash, N., Eryani, F., Khalifa, S., Rambow, O., Ab-
  tional Journal of Electrical and Computer Engineering,           dulrahim, D., Erdmann, A., Faraj, R., Zaghouani, W.,
  9:2025–2032.                                                     Bouamor, H., Zalmout, N., Hassan, S., shargi, F. A.,
Elfardy, H. and Diab, M. (2012). Simplified guidelines for         Alkhereyf, S., Abdulkareem, B., Eskander, R., Salameh,
  the creation of large scale dialectal Arabic annotations.        M., and Saddiki, H. (2018). Unified guidelines and re-
  In Proceedings of the Language Resources and Evalua-             sources for Arabic dialect orthography. In Proceedings
  tion Conference (LREC), Istanbul, Turkey.                        of the Language Resources and Evaluation Conference
Elfardy, H. and Diab, M. (2013). Sentence Level Dialect            (LREC), Miyazaki, Japan.
  Identification in Arabic. In Proceedings of the Confer-        Habash, N. (2007). Arabic Morphological Represen-
  ence of the Association for Computational Linguistics            tations for Machine Translation. In Antal van den
  (ACL), pages 456–461, Sofia, Bulgaria.                           Bosch et al., editors, Arabic Computational Mor-
ElJundi, O., Antoun, W., El Droubi, N., Hajj, H., El-Hajj,         phology: Knowledge-based and Empirical Methods.
  W., and Shaban, K. (2019). hULMonA: The universal                Kluwer/Springer.
  language model in Arabic. In Proceedings of the Fourth         Habash, N. Y. (2010). Introduction to Arabic natural lan-
  Arabic Natural Language Processing Workshop, pages               guage processing, volume 3. Morgan & Claypool Pub-
  68–77, Florence, Italy, August. Association for Compu-           lishers.
  tational Linguistics.                                          Khalifa, S., Habash, N., Abdulrahim, D., and Hassan,
Eryani, F., Habash, N., Bouamor, H., and Khalifa, S.               S. (2016a). A Large Scale Corpus of Gulf Arabic. In
  (2020). A Spelling Correction Corpus for Multiple Ara-           Proceedings of the Language Resources and Evaluation
  bic Dialects. In Proceedings of the Language Resources           Conference (LREC), Portorož, Slovenia.
  and Evaluation Conference (LREC), Marseille, France.           Khalifa, S., Zalmout, N., and Habash, N. (2016b). Ya-
Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick,          mama: Yet another multi-dialect Arabic morphological
  S., and Buckwalter, T. (2009). Standard Arabic Morpho-           analyzer. In Proceedings of the International Conference
  logical Analyzer (SAMA) Version 3.1. Linguistic Data             on Computational Linguistics (COLING), pages 223–
  Consortium LDC2009E73.                                           227, Osaka, Japan.
Habash, N. and Rambow, O. (2005). Arabic tokenization,           Khalifa, S., Zalmout, N., and Habash, N. (2020). Mor-
  part-of-speech tagging and morphological disambigua-             phological Disambiguation for Gulf Arabic: The Inter-
  tion in one fell swoop. In Proceedings of the Conference         play between Resources and Methods. In Proceedings
  of the Association for Computational Linguistics (ACL),          of the Language Resources and Evaluation Conference
  pages 573–580, Ann Arbor, Michigan.                              (LREC), Marseille, France.
Habash, N. and Rambow, O. (2006). MAGEAD: A mor-                 Loper, E. and Bird, S. (2002). Nltk: The natural language

                                                              7030
toolkit. In Proceedings of the Workshop on Effective             ings of the CoNLL 2018 Shared Task: Multilingual Pars-
  Tools and Methodologies for Teaching Natural Language            ing from Raw Text to Universal Dependencies, pages
  Processing and Computational Linguistics, pages 63–70,           160–170, Brussels, Belgium, October. Association for
  Philadelphia, Pennsylvania.                                      Computational Linguistics.
Maamouri, M., Bies, A., Buckwalter, T., and Mekki, W.            Rosenthal, S., Farra, N., and Nakov, P. (2017). Semeval-
  (2004). The Penn Arabic Treebank: Building a Large-              2017 task 4: Sentiment analysis in twitter. In Proceed-
  Scale Annotated Arabic Corpus. In Proceedings of the             ings of the International Workshop on Semantic Evalua-
  International Conference on Arabic Language Resources            tion (SemEval), pages 501–516, Vancouver, Canada.
  and Tools, pages 102–109, Cairo, Egypt.                        Sadat, F., Kazemi, F., and Farzindar, A. (2014). Automatic
Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N.,          Identification of Arabic Dialects in Social Media. In Pro-
  and Eskander, R. (2014). Developing an Egyptian Ara-             ceedings of the Workshop on Natural Language Process-
  bic Treebank: Impact of Dialectal Morphology on Anno-            ing for Social Media (SocialNLP), pages 22–27, Dublin,
  tation and Tool Development. In Proceedings of the Lan-          Ireland.
  guage Resources and Evaluation Conference (LREC),              Salameh, M., Bouamor, H., and Habash, N. (2018). Fine-
  Reykjavik, Iceland.                                              grained arabic dialect identification. In Proceedings of
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J.,               the International Conference on Computational Linguis-
  Bethard, S. J., and McClosky, D. (2014). The Stanford            tics (COLING), pages 1332–1344, Santa Fe, New Mex-
  CoreNLP natural language processing toolkit. In Pro-             ico, USA.
  ceedings of the Conference of the Association for Com-         Shaalan, K. and Raza, H. (2009). Nera: Named entity
  putational Linguistics (ACL), pages 55–60.                       recognition for arabic. Journal of the American Society
Mitchell, A., Strassel, S., Przybocki, M., Davis, J., Dod-         for Information Science and Technology, 60:1652–1663.
  dington, G., and R, G. (2003). Tides extraction (ace)          Shahrour, A., Khalifa, S., Taji, D., and Habash, N. (2016).
  2003 multilingual training data. ldc2004t09. Linguistic          CamelParser: A system for Arabic syntactic analysis and
  Data Consortium.                                                 morphological disambiguation. In Proceedings of the
Mitchell, A., Strassel, S., Huang, S., and Zakhary,                International Conference on Computational Linguistics
  R. (2005). Ace 2004 multilingual training corpus.                (COLING), pages 228–232.
  ldc2005t09. Linguistic Data Consortium.                        Smrž, O. (2007). ElixirFM — Implementation of Func-
Mohammed, N. and Omar, N. (2012). Arabic Named En-                 tional Arabic Morphology. In Proceedings of the Work-
  tity Recognition using artificial neural network. Journal        shop on Computational Approaches to Semitic Lan-
  of Computer Science, 8:1285–1293.                                guages (CASL), pages 1–8, Prague, Czech Republic.
Nabil, M., Aly, M., and Atiya, A. (2015). ASTD: Ara-               ACL.
  bic sentiment tweets dataset. In Proceedings of the            Song, Z., Maeda, K., Walker, C., and Strassel, S. (2014).
  2015 Conference on Empirical Methods in Natural Lan-             Ace 2007 multilingual training corpus. ldc2014t18. Lin-
  guage Processing, pages 2515–2519, Lisbon, Portugal,             guistic Data Consortium.
  September. Association for Computational Linguistics.          Taji, D., Khalifa, S., Obeid, O., Eryani, F., and Habash, N.
Obeid, O., Salameh, M., Bouamor, H., and Habash, N.                (2018). An Arabic Morphological Analyzer and Gener-
  (2019). ADIDA: Automatic dialect identification for              ator with Copious Features. In Proceedings of the Fif-
  Arabic. In Proceedings of the 2019 Conference of the             teenth Workshop on Computational Research in Pho-
  North American Chapter of the Association for Compu-             netics, Phonology, and Morphology (SIGMORPHON),
  tational Linguistics (Demonstrations), pages 6–11, Min-          pages 140–150.
  neapolis, Minnesota, June. Association for Computa-            Vanni, M. and Zajac, R. (1996). The Temple translator’s
  tional Linguistics.                                              workstation project. In Proceedings of the TIPSTER Text
Oudah, M. and Shaalan, K. (2012). A pipeline arabic                Program Workshop, pages 101–106, Vienna, Virginia.
  named entity recognition using a hybrid approach. In           Walker, C., Strassel, S., Medero, J., and Maeda, K. (2006).
  Proceedings of the 24th international conference on              Ace 2005 multilingual training corpus. ldc2006t06. Lin-
  computational linguistics (COLING 2012), pages 2159–             guistic Data Consortium.
  2176.                                                          Watson, D., Zalmout, N., and Habash, N. (2018). Uti-
Pasha, A., Al-Badrashiny, M., Diab, M., Kholy, A. E.,              lizing character and word embeddings for text normal-
  Eskander, R., Habash, N., Pooleery, M., Rambow, O.,              ization with sequence-to-sequence models. In Proceed-
  and Roth, R. (2014). MADAMIRA: A fast, comprehen-                ings of the 2018 Conference on Empirical Methods in
  sive tool for morphological analysis and disambiguation          Natural Language Processing, pages 837–843, Brussels,
  of Arabic. In Proceedings of the Language Resources              Belgium, October-November. Association for Computa-
  and Evaluation Conference (LREC), pages 1094–1101,               tional Linguistics.
  Reykjavik, Iceland.                                            Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,
Pasha, A., Al-Badrashiny, M., Diab, M., Habash, N.,                C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-
  Pooleery, M., and Rambow, O., (2015). MADAMIRA                   icz, M., and Brew, J. (2019). Huggingface’s transform-
  v2.0 User Manual.                                                ers: State-of-the-art natural language processing. ArXiv,
Qi, P., Dozat, T., Zhang, Y., and Manning, C. D. (2018).           abs/1910.03771.
  Universal dependency parsing from scratch. In Proceed-         Zaghouani, W., Mohit, B., Habash, N., Obeid, O., Tomeh,

                                                              7031
N., Rozovskaya, A., Farra, N., Alkuhlani, S., and
  Oflazer, K. (2014). Large Scale Arabic Error An-
  notation: Guidelines and Framework. In Proceedings
  of the Language Resources and Evaluation Conference
  (LREC), Reykjavik, Iceland.
Zaghouani, W. (2012). Renar: A rule-based arabic named
  entity recognition system. ACM Transactions on Asian
  Language Information Processing, 11:1–13.
Zaidan, O. F. and Callison-Burch, C. (2011). The Arabic
  Online Commentary Dataset: an Annotated Dataset of
  Informal Arabic With High Dialectal Content. In Pro-
  ceedings of the Conference of the Association for Com-
  putational Linguistics (ACL), pages 37–41.
Zalmout, N. and Habash, N. (2017). Don’t throw those
  morphological analyzers away just yet: Neural morpho-
  logical disambiguation for Arabic. In Proceedings of
  the Conference on Empirical Methods in Natural Lan-
  guage Processing (EMNLP), pages 704–713, Copen-
  hagen, Denmark.
Zalmout, N. and Habash, N. (2019). Adversarial multitask
  learning for joint multi-feature and multi-dialect mor-
  phological modeling. In Proceedings of the 57th Annual
  Meeting of the Association for Computational Linguis-
  tics, pages 1775–1786, Florence, Italy, July. Association
  for Computational Linguistics.
Zalmout, N., Erdmann, A., and Habash, N. (2018). Noise-
  robust morphological disambiguation for dialectal Ara-
  bic. In Proceedings of the Conference of the North Amer-
  ican Chapter of the Association for Computational Lin-
  guistics (NAACL), New Orleans, Louisiana, USA.
Zampieri, M., Malmasi, S., Ljubešić, N., Nakov, P., Ali, A.,
  Tiedemann, J., Scherrer, Y., and Aepli, N. (2017). Find-
  ings of the VarDial Evaluation Campaign 2017. In Pro-
  ceedings of the Workshop on NLP for Similar Languages,
  Varieties and Dialects (VarDial), pages 1–15, Valencia,
  Spain.

                                                                7032
You can also read