Arabic Offensive Language on Twitter: Analysis and Experiments - ACL Anthology

Page created by Julio Hicks
 
CONTINUE READING
Arabic Offensive Language on Twitter: Analysis and Experiments - ACL Anthology
Arabic Offensive Language on Twitter: Analysis and Experiments

                Hamdy Mubarak 1 Ammar Rashed2 Kareem Darwish1
                          Younes Samih1 Ahmed Abdelali1
                      1
                        Qatar Computing Research Institute, HBKU
                                 2
                                   Özyeğin University
           {hmubarak, kdarwish, ysamih, aabdelali}@hbku.edu.qa
                             ammar.rasid@ozu.edu.tr

                      Abstract                             a large dataset. Since our methodology does not
                                                           use a seed list of offensive words, it is not biased
    Detecting offensive language on Twitter
                                                           by topic, target, or dialect. Using our methodol-
    has many applications ranging from detect-
    ing/predicting bullying to measuring polariza-         ogy, we tagged a 10,000 Arabic tweet dataset for
    tion. In this paper, we focus on building a large      offensiveness, where offensive tweets account for
    Arabic offensive tweet dataset. We introduce a         roughly 19% of the tweets. Further, we labeled
    method for building a dataset that is not biased       tweets as vulgar or hate speech. To date, this is the
    by topic, dialect, or target. We produce the           largest available dataset, which we plan to make
    largest Arabic dataset to date with special tags       publicly available along with annotation guidelines.
    for vulgarity and hate speech. We thoroughly
                                                           We use this dataset to characterize Arabic offensive
    analyze the dataset to determine which topics,
    dialects, and gender are most associated with          language to ascertain the topics, dialects, and users’
    offensive tweets and how Arabic speakers use           gender that are most associated with the use of of-
    offensive language. Lastly, we conduct many            fensive language. Though we suspect that there are
    experiments to produce strong results (F1 =            common features that span different languages and
    83.2) on the dataset using SOTA techniques.            cultures, some characteristics of Arabic offensive
                                                           language are language and culture specific. Thus,
1   Introduction
                                                           we conduct a thorough analysis of how Arab users
Disclaimer: Due to the nature of the paper, some           use offensive language. Next, we use the dataset to
examples herein contain highly offensive language          train strong Arabic offensive language classifiers
and hate speech. They don’t reflect the views of           using state-of-the-art representations and classifica-
the authors in any way. This work is an attempt to         tion techniques. Specifically, we experiment with
help fight such speech.                                    static and contextualized embeddings for represen-
                                                           tation along with a variety of classifiers such as
   Much recent interest has focused on the detec-          Transformer-based and Support Vector Machine
tion of offensive language and hate speech in on-          (SVM) classifiers. The contributions of this paper
line social media. Offensiveness is often associ-          are as follows:
ated with undesirable behaviors such as trolling,
cyberbullying, online extremism, political polariza-           • We built the largest Arabic offensive language
tion, and propaganda. Thus, offensive language                   dataset to date that is also labeled for vulgar
detection is instrumental for a variety of applica-              language and hate speech and is not biased
tion such as: quantifying polarization (Barberá and             by topic or dialect. We describe the method-
Sood, 2015; Conover et al., 2011), trolls and pro-               ology for building it along with annotation
paganda account detection (Darwish et al., 2017),                guidelines.
hate crimes likelihood estimation (Waseem and
Hovy, 2016); and predicting conflicts (Chadefaux,              • We performed thorough analysis to describe
2014). In this paper, we describe our methodol-                  the peculiarities of Arabic offensive language.
ogy for building a large dataset of Arabic offensive
tweets. Given that roughly 1-2% of all Arabic                  • We experimented with SOTA classification
tweets are offensive (Mubarak and Darwish, 2019),                techniques to provide strong results on detect-
targeted annotation is essential to efficiently build            ing offensive language.

                                                        126
                Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 126–135
                                        Kyiv, Ukraine (Virtual), April 19, 2021.
Arabic Offensive Language on Twitter: Analysis and Experiments - ACL Anthology
2   Related Work                                         proved classification with stemming and achieved
                                                         a precision of 88%. Albadi et al. (2018) focused
Many recent papers have focused on the detec-            on detecting religious hate speech using a recurrent
tion of offensive language, including hate speech        neural network.
(Agrawal and Awekar, 2018; Badjatiya et al., 2017;           Arabic is a morphologically rich language with
Davidson et al., 2017; Djuric et al., 2015; Kwok         a standard variety called Modern Standard Arabic
and Wang, 2013; Malmasi and Zampieri, 2017;              (MSA), which is typically used in formal communi-
Nobata et al., 2016; Yin et al., 2009). Offensive        cation, and many dialectal varieties that differ from
language can be categorized as: vulgar, which in-        MSA in lexical selection, morphology, phonology,
clude explicit and rude sexual references, porno-        and syntactic structures. In MSA, words are typi-
graphic, and hateful, which includes offensive re-       cally derived from a set of thousands of roots by
marks concerning people’s race, religion, country,       fitting a root into a stem template and the result-
etc. (Jay and Janschewitz, 2008). Prior works            ing stem may accept a variety of prefixes and suf-
have concentrated on building annotated corpora          fixes. Though word segmentation, which greatly
and training classification models. Concerning cor-      improves word matching, is quite accurate for MSA
pora, hatespeechdata.com attempts to maintain            (Abdelali et al., 2016), with accuracy approaching
an updated list of hate speech corpora for multiple      99%, dialectal segmentation is not sufficiently re-
languages including Arabic and English. Further,         liable, with accuracy ranging between 91-95% for
SemEval 2019 ran an evaluation task targeted at          different dialects (Samih et al., 2017). Since di-
detecting offensive language, which focused ex-          alectal Arabic is ubiquitous in Arabic tweets and
clusively on English (Zampieri et al., 2019). For        many tweets have creative spellings of words, re-
SemEval 2020, they extended the task to include          cent work on Arabic offensive language detection
other languages including Arabic (Zampieri et al.,       used character-level models (Mubarak and Dar-
2020). As for classification models, most studies        wish, 2019).
used supervised classification at either word level
(Kwok and Wang, 2013), character sequence level            3       Data Collection
(Malmasi and Zampieri, 2017), and word embed-
dings (Djuric et al., 2015). The studies used differ-      3.1       Collecting Arabic Offensive Tweets
ent classification techniques including Naı̈ve Bayes       Our target is to build a large Arabic offensive lan-
(Kwok and Wang, 2013), Support Vector Machines             guage dataset that is representative of its appear-
(SVM) (Malmasi and Zampieri, 2017), and deep               ance on Twitter and is hopefully not biased to spe-
learning (Agrawal and Awekar, 2018; Badjatiya              cific dialects, topics, or targets. One of the main
et al., 2017; Nobata et al., 2016) classification. The     challenges is that offensive tweets constitute a very
accuracy of the aforementioned system ranged be-           small portion of overall tweets. To quantify their
tween 76% and 90%. Earlier work looked at the use          proportion, we took 3 random samples of tweets
of sentiment words as features as well as contextual       from different days, with each sample composed
features (Yin et al., 2009).                               of 1,000 tweets, and we found that only 1-2% of
   The work on Arabic offensive language de-               them were offensive (including pornographic ad-
tection is relatively nascent (Abozinadah, 2017;           vertisements). This percentage is consistent with
Alakrot et al., 2018; Albadi et al., 2018; Mubarak         previously reported percentages (Mubarak et al.,
et al., 2017; Mubarak and Darwish, 2019).                  2017). Thus, annotating random tweets is grossly
Mubarak et al. (2017) suggested that certain users         inefficient. One way to overcome this problem is
are more likely to use offensive languages than            to use a seed list of offensive words to filter tweets.
others, and they used this insight to build a list of      However, doing so is problematic, as it would skew
offensive Arabic words and to construct a labeled          the dataset to particular types of offensive language
set of 1,100 tweets. Abozinadah (2017) used super-         or to specific dialects. Offensiveness is often di-
vised classification based on a variety of features        alect and country specific.
including user profile features, textual features, and        After inspecting many tweets, we observed that
network features. They reported an accuracy of             many offensive tweets have the vocative particle
nearly 90%. Alakrot et al. (2018) used supervised          AK (“yA” – meaning “O”)1 , which is mainly used
classification based on word n-grams to detect of-
                                                               1
fensive language in YouTube comments. They im-                     Arabic words are provided along with their Buckwalter

                                                     127
in directing the speech to a specific person or                tute a small percentage of tweets in general, while
group. The ratio of offensive tweets increases to              being far more generic than using a seed list of
5% if a tweet contains one vocative particle and               offensive words, which may greatly skew the dis-
to 19% if it has at least two vocative particles.              tribution of offensive tweets. For future work, we
Users often repeat this particle for emphasis, as              plan to explore other methods for identifying offen-
    
in: éKñJk AK ú× @ AK (“yA Amy yA Hnwnp” – O my                 sive tweets with greater stylistic diversity.
mother, O kind one), which is endearing and non-               3.2   Annotating Tweets
                     
offensive, and P Y¯ AK I    . Ê¿ AK (“yA klb yA q*r” –       We developed annotation guidelines jointly with
“O dog, O dirty one”), which is offensive. We de-
                                                             an experienced annotator, who is a native Arabic
cided to use this pattern to increase our chances of
                                                             speaker with good knowledge of various Arabic di-
finding offensive tweets. One of the main advan-
                                                             alects, in accordance to the OffensEval2019 guide-
tages of the pattern AK ... AK (“yA ... yA”) is that it is
                                                             lines. Tweets were given one or more of the fol-
not associated with any specific topic or genre, and         lowing four labels: offensive, vulgar, hate speech,
it appears in all Arabic dialects. Though the use of         or clean. Since the offensive label covers both vul-
offensive language does not necessitate the appear-          gar and hate speech and vulgarity and hate speech
ance of the vocative particle, the particle does not         are not mutually exclusive, a tweet can be just of-
favor any specific offensive expressions and greatly         fensive or offensive and vulgar and/or hate speech.
improves our chances of finding offensive tweets.            The annotation adhered to the following guidelines:
Using Twitter APIs, we collected 660k Arabic
tweets having this pattern between April 15 – May            OFFENSIVE (OFF): Offensive tweets contain
6, 2019. To increase diversity, we sorted the word           explicit or implicit insults or attacks against other
sequences between the vocative particles and took            people, or inappropriate language, such as:
                                                             Direct       threats
the most frequent 10,000 unique sequences. For
each word sequence, we took a random tweet con-               é“PAªÖÏ @ H@       or (“AHrqwA
                                                                              @ñ¯Qk@
                                                                          Q®Ó
                                                                                               incitement,       ex:
                                                                                                              mqrAt
taining that sequence. Then we annotated those               AlmEArDp” – “burn opposition headquar-
tweets, ending up with 1,915 offensive tweets                                             
                                                             ters”) and ‡¯AJÖÏ @ @ Yë @ñÊJ¯@ (“h*A AlmnAfq yjb
which represent roughly 19% of all tweets. Each              qtlh” – “kill this hypocrite”).
tweet was labeled as: offensive, which could ad-             Insults and expressions of contempt, which
ditionally be labeled as vulgar and/or hate speech,          include: Animal analogies, ex: I
or Clean. We describe in greater detail our anno-
                                                                                                   . Ê¿ AK (“yA klb”
tation guidelines, which are compatible with the             – “O dog”) and       á.K É¿ (“kl tbn” – “eat hay”).;
OffensEval2019 annotation guidelines (Zampieri                 Insult to family, ex: ½Ó @ hðP AK (“yA rwH Amk”
et al., 2019). For example, if a tweet has insults
                                                             – “O mother’s soul”); Sexually-related insults, ex:
or threats targeting a group based on their national-          X AK (“yA dywv” – “O cuckold”); Damnation,
                                                             HñK
ity, ethnicity, gender, political affiliation, religious
belief, or other common characteristics, this is con-          ex: ½JªÊK é
(“dynk Alq*r” – “your filthy religion”).                                             Tweets     Words
                                                                  Offensive           1,915       38k
                                                                  – Vulgar              225        4k
                                                                  – Hate speech         506       13k
CLEAN (CLN): Clean tweets do not contain                          Clean               8,085      151k
vulgar or offensive language. We noticed that                     Total              10,000      193k
some tweets have some offensive words, but the
whole tweet should not be considered as offen-             Table 1: Distribution of offensive and clean tweets.
sive due to the intention of users. This sug-
gests that normal string match without consid-
                                                          3.3   Statistics and User Demographics
ering contexts may fail in some cases. Ex-
amples of such ambiguous cases include: Hu-          Given the annotated tweets, we wanted to ascer-
                         
mor, ex: éêë ékQ®Ë@ èðY« AK (“yA Edwp AlfrHp         tain the distribution of: types of offensive language,
hhh” – “O enemy of happiness hahaha”); Advice,       genres or topics where it is used, the dialects used,
                            
ex: QK Qg AK ½J.kA’Ë É®K B (“lA tql lSAHbk yA      and the gender of users using such language. Ac-
                                                     cordingly, the annotator manually examined and
xnzyr” – “don’t say to your friend: You are a
                                                    tagged all the offensive tweets.
pig”); Condition, ex: ÉJÔ« AK àñËñ®K ÑîD“PA« @ X@
                                                     Topic: Figure 1 shows the distribution of topics as-
(“A*A EArDthm yqwlwn yA Emyl” – “if you
                                                     sociated with offensive tweets. As the figure shows,
disagree with them, they call you a spy”); Con-
                               ‚ @ XAÖÏ (“lmA*A  sports and politics are most dominant for offensive
demnation, ex: ? èQ®K. AK :Èñ®K. I .                 language including vulgar and hate speech.
nsb bqwl: yA bqrp?” – “Why do we insult              Dialect: We looked at MSA and four major di-
others by saying: O cow?”); Self offense, ex:        alects, namely Egyptian (EGY), Leventine (LEV),
    úGA‚Ë áÓ IJ
P Y®Ë@             . ªK (“tEbt mn lsAny Alq*r” – “IMaghrebi (MGR), and Gulf (GLF). Figure 2 shows
am tired of my dirty tongue”); Non-human target, that 71% of vulgar tweets were written in EGY
               
ex: èPñ» AK éKñJj.ÖÏ @ I K. AK (“yA bnt Almjnwnp followed by GLF, which accounted for 13% of
yA kwrp” – “O daughter of the crazy one O foot- vulgar tweets. MSA was not used in any vulgar
ball”); and Quotation from a movies or a story, ex: tweets. As for offensive tweets in general, EGY
ɃA ¯ AK úGAK !ú»P AK úGAK (“tAny yA zky! tAny yA and GLF were used in 36% and 35% of the offen-
                                                     sive tweets respectively. Unlike the case of vulgar
fA$l” – “again smarty! again O loser”). For am-
                                                     language, 15% of the offensive tweets were writ-
biguous expressions, the annotator searched Twitter
                                                     ten in MSA. For hate speech, GLF and EGY were
to observe real sample usages.
                                                     again dominant and MSA constituted 21% of the
   Table 1 shows the distribution of the annotated   tweets. This is consistent with findings for other
tweets. There are 1,915 offensive tweets, including  languages, e.g. English and Italian, where vulgar-
225 vulgar tweets and 506 hate speech tweets, and    ity was more frequently associated with colloquial
8,085 clean tweets. To validate annotation quality, language (Mattiello, 2005; Maisto et al., 2017).
we asked three additional annotators to annotate     Gender: Figure 3 shows that the vast majority of
two tweet sample sets. The first was a random        offensive tweets, including vulgar and hate speech,
sample of 100 tweets containing 50 offensive and     were  authored by males. Female Twitter users ac-
50 non-offensive tweets. The Inter-Annotator         counted for 14% of offensive tweets in general and
Agreement (IIA) between the annotators using         6% and 9% of vulgar and hate speech respectively.
Fleiss’s Kappa coefficient (Fleiss, 1971) was        Figure 4 shows a detailed categorization of hate
0.92. The second was general random samples          speech types, where the top three include insulting
containing 100 tweets each from the dataset, and     groups based on their political ideology, origin, and
the IIA with the dataset was: 0.97, 0.96, and        sport affiliation. Religious hate speech appeared in
0.97. This high level of agreement gives more        only 15% of all hate speech tweets.
confidence in the quality of the annotation. Data       Next, we analyzed all tweets labeled as offen-
can be downloaded from:                              sive to better understand how Arabic speakers use
https://alt.qcri.org/resources/                      offensive language. Here is a breakdown of usage:
OSACT2020-sharedTask-CodaLab-Train-Dev-Test.Direct name calling: The most frequent attack is
zip                                                  to call a person an animal name, and the most used

                                                    129
Figure 3: Gender distribution for offensive language
Figure 1: Topic distribution for offensive language and         and its sub-categories
its sub-categories

                                                                                Disability/Diseases

                                                                                             Gender

                                                                                   Social Class/Job

                                                                                            Religion

                                                                                    Sport Affiliation

                                                                 Origin (race, ethnicity, nationality)

                                                                                  Political Ideology

                                                                                                         0   0.1   0.2   0.3   0.4

                                                                Figure 4: Distribution of Hate Speech Types. Note: A
                                                                tweet may have more than one type.

Figure 2: Dialect distribution for offensive language
and its sub-categories

animals were  I.Ê¿ (“klb” – “dog”), PAÔg (“HmAr”
– “donkey”), and ÕæîE. (“bhym” – “beast”). The sec-
ond most common was insulting mental abilities us-
ing words such as úæ« (“gby” – “stupid”) and ¡JJ.«
                      .
(“EbyT” –“idiot”). Culturally, not all animal names
are used as insults. For example, animals such as
                         (“Sqr” – “falcon”), and               Figure 5: Tag cloud for words with top valence score
Yƒ @ (“Asd” – “lion”), Q®“                                      among offensive class, e.g. name calling (animals),
È@Q« (“gzAl” – “gazelle”) are typically used for                curses, insults, etc.
praise. For other insults, people use: some bird
                                                
names such as ék. Ag. X (“djAjp” – “chicken”), éÓñK.
                                                                metaphor were they would compare a person to: an
(“bwmp” – “owl”), and H      . @Q« (“grAb” – “crow”);           animal as in PñJË@ ø P (“zy Alvwr” – “like a bull”),
                   
insects such as éK. AK. X (“*bAbp” – “fly”), Pñ“Qå•             ½®J îE úæªÖÞ                (“smEny nhyqk” – “let me hear your
                                   
(“SrSwr” – “cockroach”), and èQå„k (“H$rp” – “in-              braying”), and ½ÊK X Që (“hz dylk” – “wag your
                                    
sect”); microorganisms such as éÓñKQk. (“jrvwmp” –             tail”); a person with mental or physical disability
“microbe”) and I                                                such as úÍñªJÓ (“mngwly” – “Mongolian (Down
                  . ËAj£ (“THAlb” – “algae”); inan-                                               (“mEwq” – “disabled”), and
                          
imate objects such as éÓQk. (“jzmp” – “shoes”) and              syndrome)”),                    †ñªÓ
É¢ƒ (“sTl” – “bucket”) among other usages.                      Ð Q¯ (“qzm” – “dwarf”); and to the opposite gender
Simile and metaphor: Users use simile and                       such as È@ñK   k. (“jy$ nwAl” – “Nawal’s army
                                                          130
(Nawal is female name)”) and ø QK P ø XAK (“nAdy            4     Experiments
zyzy” – “Zizi’s club (Zizi is a female nickname)”).       We conducted an extensive battery of experiments
Indirect speech: This includes: sarcasm such as           on the dataset to establish strong Arabic offen-
½K@ñk@ ú»X @ (“A*kY AxwAtk” – “smartest one of           sive language classification results. Though of-
your siblings”) and QÒmÌ '@ ¬ñ‚ÊJ¯ (“fylswf AlH-         fensive tweets have finer-grained labels where of-
myr” – “the donkeys’ philosopher”); questions             fensive tweet could also be vulgar and/or hate
such as èX ZAJ.ªË@ É¿ éK @ (“Ayh kl AlgbA dh” –           speech, we conducted coarser-grained classifica-
                                                          tion to determine if a tweet was offensive or not.
“what is all this stupidity”); and indirect speech
                                  ® JË@ (“AlnqA$ mE     For classification, we experimented with several
such as QÒJÓ Q« Õç' AîD.Ë@ ©Ó €A
                                                          tweet representation and classification models. For
AlbhAym gyr mvmr” – “no use arguing with cat-             tweet representations, we used: the count of pos-
tle”).                                                    itive and negative terms, based on a polarity lexi-
Wishing Evil: This entails wishing death or ma-           con; static embeddings, namely fastText and Skip-
jor harm to befall someone such as ¼YgAK AJK. P           Gram; and deep contextual embeddings, namely
(“rbnA yAxdk” – “May God take (kill) you”),               BERTbase-multilingual and AraBERT (Antoun et al.,
½JªÊK é
different corpora with different vector dimension-      of the sequence, [CLS], adding a softmax activa-
ality. We compared pre-trained embeddings to em-        tion on the top of BERT to predict the probability
beddings that were trained on our dataset. For          of the l label: p(l|h) = sof tmax(W h), where
pre-trained embeddings, we used: fastText Egyp-         W is the task-specific weight matrix. During fine-
tian Arabic pre-trained embeddings (Bojanowski          tuning, all BERT/AraBERT parameters together
et al., 2017) with vector dimensionality of 300; Ar-    with W are optimized end-to-end to maximize the
aVec skip-gram embeddings (Mohammad et al.,             log-probability of the correct labels.
2017), trained on 66.9M Arabic tweets with 100-
dimensional vectors; and Mazajak skip-gram em-            4.3   Classification Models
beddings (Abu Farha and Magdy, 2019), trained on        We explored different classifiers. When using lexi-
250M Arabic tweets with 300-dimensional vectors.        cal features and pre-trained static embeddings, we
Sentence embeddings were calculated by taking the       primarily used an SVM classifier with a radial basis
mean of the embeddings of their tokens. The im-         function kernel. Only when using the Mazajak em-
portance of testing a character level n-gram model      beddings, we experimented with other classifiers
like fastText lies in the agglutinative nature of the   such as AdaBoost and Logistic regression. The
Arabic language. We trained a new fastText text         SVM classifier performed the best on static em-
classification model (Joulin et al., 2017) on our       beddings, and we picked the Mazajak embeddings
dataset with vectors of 40 dimensions, 0.5 learning     because they yielded the best results among all
rate, 2−10 character n-grams as features, for 30        static embeddings. We used the Scikit Learn imple-
epochs. These hyper-parameters were tuned using         mentations of all the classifiers such as libsvm for
a 5-fold cross-validated grid-search.                   the SVM classifier. We also experimented with fast-
                                                        Text, which trained embeddings on our data. When
Deep Contextualized Embeddings We also ex-
                                                        using contextualized embeddings, we fine-tuned
perimented with pre-trained contextualized em-
                                                        BERT and AraBERT by adding a fully-connected
beddings with fine-tuning for down-stream tasks.
                                                        dense layer followed by a softmax classifier, mini-
Recently, deep contextualized language models
                                                        mizing the binary cross-entropy loss function for
such as BERT (Bidirectional Encoder Represen-
                                                        the training data. For all experiments, we used the
tations from Transformers) (Devlin et al., 2019),
                                                        PyTorch2 implementation by HuggingFace3 as it
UMLFIT (Howard and Ruder, 2018), and Ope-
                                                        provides pre-trained weights and vocabularies.
nAI GPT (Radford et al., 2018), have achieved
ground-breaking results in many NLP classifica-           4.4   Evaluation
tion and language understanding tasks. In this pa-
per, we fine-tuned BERTbase-multilingual (or simply     For all of our experiments, we used 5-fold cross
BERT) and AraBERT embeddings to classify Ara-           validation with identical folds for all experiments.
bic offensive language on Twitter as it eliminates      Table 2 reports on the results of using lexical fea-
the need for feature engineering. Although Ro-          tures, static pre-trained embeddings with an SVM
bustly Optimized BERT (RoBERTa) embeddings              classifier, embeddings trained on our data with fast-
perform better than (BERTlarge ) on GLUE (Wang          Text classifier, and BERT and AraBERT over a
et al., 2018), RACE (Lai et al., 2017), and SQuAD       dense layer with softmax activation. As the results
(Rajpurkar et al., 2016) tasks, pre-trained multilin-   show, using fine-tuned AraBERT yielded the best
gual RoBERTa models are not available. BERT             results overall, followed closely by Mazajak/SVM,
is pre-trained on Wikipedia text from 104 lan-          with large improvements in precision over using
guages, and AraBERT is trained on a large Arabic        BERT. The success of AraBERT was surprising
news corpus containing 8.5M articles composed of        given that it was not trained on social media text.
roughly 2.5B tokens. Both use identical architec-       Perhaps, pre-training a Transformer model on so-
tures and come with hundreds of millions of param-      cial media text may improve results further. We
eters. Both contain an encoder with 12 Transformer      suspect that the Mazajak/SVM combination per-
blocks, hidden size of 768, and 12 self-attention       formed better than BERT due to the fact that the
heads. These embedding use BP sub-word seg-             Mazajak embeddings, though static, were trained
ments. Following Devlin et al. (2019), the classi-          2
                                                             https://pytorch.org/
fication consists of introducing a dense layer over         3
                                                             https://github.com/huggingface/
the final hidden state h corresponding to first token     transformers

                                                    132
on in-domain data, as opposed to BERT. For com-                 • Implicit           Sarcasm:                    ex.
pleteness, we compared 7 other classifiers with                                I.k ú¯ ½¾‚  QK A« I K@ áK Ag AK
                                                                     QÊË I.ª‚Ë@
SVM using Mazajak embeddings. As results in
                                                                          (“yA xAyn Ant EAwz t$kk fy Hb Al$Eb
Table 3 show, using SVM yielded the best results.
                                                                    llrys” – “O traitor, (you) want to question
      Model/classifier       Prec. Recall       F1                  people’s love for the president     ”) where the
                    Lexical Features                                author is mocking the president’s popularity.
      SVM                     68.5    35.3    46.6
                                                                Two false negative types:
             Pre-trained static embeddings
      fastText/SVM            76.7    43.5    55.5              • Mixture of offensiveness and admiration: ex.
                                                                                           
                                                                  calling a girl a puppy éK. ñJ.Ê¿ AK (“yA klbwbp” –
      AraVec/SVM              85.5    69.2    76.4
      Mazajak/SVM             88.6    72.4    79.7                “O puppy”) in a flirtatious manner.
            Embeddings trained on our data                      • Implicit offensiveness:             ex.     call-
      fastText/fastText       82.1    68.1    74.4                ing for cure while               implying sanity:
              Contextualized embeddings                             QÖÏ @ áÓ ¼YÊK. ÐA¾k   ù®‚ ð (“wt$fy HkAm
      BERTbase-multilingual   78.3    74.0    76.0
                                                                    bldk mn AlmrD” – “and cure rulers of your
      AraBERT                 84.6    82.4    83.2
                                                                    country from illness”).
Table 2: Classification performance with different fea-
tures and models.                                               5     Conclusion and Future Work
                                                            In this paper we presented a systematic method for
   Model                    Prec.   Recall     F1           building an Arabic offensive language tweet dataset
   Decision Tree            51.2     53.8     52.4          that does not favor specific dialects, topics, or gen-
   Random Forest            82.4     42.4     56.0          res. We developed detailed guidelines for tagging
   Gaussian NB              44.9     86.0     59.0          the tweets as clean or offensive, including special
   Perceptron               75.6     67.7     66.8          tags for vulgar tweets and hate speech. We tagged
   AdaBoost                 74.3     67.0     70.4          10,000 tweets, which we plan to release publicly
   Gradient Boosting        84.2     63.0     72.1          and would constitute the largest available Arabic
   Logistic Regression      84.7     69.5     76.3          offensive language dataset. We characterized the
   SVM                      88.6     72.4     79.7          offensive tweets in the dataset to determine the top-
                                                            ics that illicit such language, the dialects that are
Table 3: Performance of different classification models     most often used, the common modes of offensive-
on Mazajak embeddings.                                      ness, and the gender distribution of their authors.
                                                            We performed this breakdown for offensive tweets
4.5     Error Analysis                                      in general and for vulgar and hate speech tweets
                                                            separately. We believe that this is the first detailed
We inspected the tweets of one fold that were mis-
                                                            analysis of its kind. Lastly, we conducted a large
classified by the Mazajak/SVM model (36 false
                                                            battery of experiments on the dataset, using cross-
positives/121 false negatives) to determine the most
                                                            validation, to establish a strong system for Arabic
common errors. They were as follows:
                                                            offensive language detection. We showed that us-
Four false positive types:                                  ing an Arabic specific BERT model (AraBERT)
• Gloating: ex. èYJJ.ë AK (“yA hbydp” - “O you              and static embeddings trained on tweets produced
  delusional”) referring to fans of rival sports team       competitive results on the dataset.
  for thinking they could win.                                 For future work, we plan to pursue several di-
                                                           rections. First, we want explore target specific
• Quoting: ex.        I.Ê¿ AK Èñ®K ð I.‚ Yg AÖÏ            offensive language, where attacks against an entity
  (“lmA Hd ysb wyqwl yA klb” – “when some-                  or a group may employ certain expressions that are
  one swears and says: O dog”).                             only offensive within the context of that target and
• Idioms: ex. ½JK X Qå Ag AK àA’ÓP Q£A¯ AK (“yA             completely innocuous otherwise. Second, we plan
  fATr rmDAn yA xAsr dynk” – “o you who does                to examine the effectiveness of cross dialectal and
  not fast Ramadan, you have lost your faith”),             cross lingual learning of offensive language.
  which is a colloquial idiom.

                                                          133
References                                                 Kareem Darwish, Dimitar Alexandrov, Preslav Nakov,
                                                             and Yelena Mejova. 2017. Seminar users in the ara-
Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and           bic twitter sphere. In International Conference on
  Hamdy Mubarak. 2016. Farasa: A fast and furious            Social Informatics, pages 91–108. Springer.
  segmenter for arabic. In Proceedings of the 2016
  conference of the North American chapter of the as-      Thomas Davidson, Dana Warmsley, Michael Macy,
  sociation for computational linguistics: Demonstra-        and Ingmar Weber. 2017. Automated hate speech
  tions, pages 11–16.                                        detection and the problem of offensive language. In
                                                             Eleventh International Conference on Web and So-
Ehab Abozinadah. 2017. Detecting Abusive Arabic              cial Media (ICWSM), pages 512–515.
  Language Twitter Accounts Using a Multidimen-
  sional Analysis Model. Ph.D. thesis, George Mason        Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
  University.                                                 Kristina Toutanova. 2019. BERT: Pre-training of
                                                              deep bidirectional transformers for language under-
Ibrahim Abu Farha and Walid Magdy. 2019. Mazajak:
                                                              standing. In Proceedings of the 2019 Conference
   An online Arabic sentiment analyser. In Proceed-
                                                              of the North American Chapter of the Association
   ings of the Fourth Arabic Natural Language Process-
                                                              for Computational Linguistics: Human Language
   ing Workshop, pages 192–198, Florence, Italy. Asso-
                                                             Technologies, Volume 1 (Long and Short Papers),
   ciation for Computational Linguistics.
                                                              pages 4171–4186, Minneapolis, Minnesota. Associ-
Sweta Agrawal and Amit Awekar. 2018. Deep learn-              ation for Computational Linguistics.
  ing for detecting cyberbullying across multiple so-
                                                           Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Gr-
  cial media platforms. In European Conference on
                                                             bovic, Vladan Radosavljevic, and Narayan Bhamidi-
  Information Retrieval, pages 141–153. Springer.
                                                             pati. 2015. Hate speech detection with comment em-
Azalden Alakrot, Liam Murray, and Nikola S Nikolov.          beddings. In Proceedings of the 24th international
  2018. Towards accurate detection of offensive lan-         conference on world wide web, pages 29–30. ACM.
  guage in online communication in arabic. Procedia
  computer science, 142:315–320.                           Samhaa R. El-Beltagy. 2016. NileULex: A phrase and
                                                             word level sentiment lexicon for Egyptian and mod-
Nuha Albadi, Maram Kurdi, and Shivakant Mishra.              ern standard Arabic. In Proceedings of the Tenth In-
  2018. Are they our brothers? analysis and detec-           ternational Conference on Language Resources and
  tion of religious hate speech in the arabic twitter-       Evaluation (LREC’16), pages 2900–2905, Portorož,
  sphere. In 2018 IEEE/ACM International Confer-             Slovenia. European Language Resources Associa-
  ence on Advances in Social Networks Analysis and           tion (ELRA).
  Mining (ASONAM), pages 69–76. IEEE.
                                                           Joseph L Fleiss. 1971. Measuring nominal scale agree-
Wissam Antoun, Fady Baly, and Hazem Hajj. 2020.               ment among many raters. Psychological bulletin,
  Arabert: Transformer-based model for arabic lan-           76(5):378.
  guage understanding. In Proceedings of the 4th
 Workshop on Open-Source Arabic Corpora and Pro-           Jeremy Howard and Sebastian Ruder. 2018. Universal
  cessing Tools, with a Shared Task on Offensive Lan-         language model fine-tuning for text classification. In
  guage Detection, pages 9–15.                                Proceedings of the 56th Annual Meeting of the As-
                                                              sociation for Computational Linguistics (Volume 1:
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta,              Long Papers), Melbourne, Australia. Association for
   and Vasudeva Varma. 2017. Deep learning for hate           Computational Linguistics.
   speech detection in tweets. In Proceedings of the
  26th International Conference on World Wide Web          Timothy Jay and Kristin Janschewitz. 2008. The prag-
  Companion, pages 759–760. International World              matics of swearing. Journal of Politeness Research.
  Wide Web Conferences Steering Committee.                   Language, Behaviour, Culture, 4(2):267–288.
Pablo Barberá and Gaurav Sood. 2015. Follow your          Armand Joulin, Edouard Grave, Piotr Bojanowski, and
  ideology: Measuring media ideology on social net-          Tomas Mikolov. 2017. Bag of tricks for efficient
  works. In Annual Meeting of the European Political         text classification. In Proceedings of the 15th Con-
  Science Association, Vienna, Austria. Retrieved from       ference of the European Chapter of the Association
  http://www. gsood. com/research/papers/mediabias.          for Computational Linguistics: Volume 2, Short Pa-
  pdf.                                                       pers, pages 427–431. Association for Computational
                                                             Linguistics.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
  Tomas Mikolov. 2017. Enriching word vectors with         Irene Kwok and Yuzhou Wang. 2013. Locate the hate:
   subword information. Transactions of the Associa-          Detecting tweets against blacks. In Twenty-seventh
   tion for Computational Linguistics, 5:135–146.             AAAI conference on artificial intelligence.
Thomas Chadefaux. 2014. Early warning signals              Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
  for war in the news. Journal of Peace Research,            and Eduard Hovy. 2017. Race: Large-scale reading
  51(1):5–18.                                                comprehension dataset from examinations.
Michael Conover, Jacob Ratkiewicz, Matthew R Fran-         Alessandro Maisto, Serena Pelosi, Simonetta Vietri,
  cisco, Bruno Gonçalves, Filippo Menczer, and              Pierluigi Vitale, and Via Giovanni Paolo II. 2017.
  Alessandro Flammini. 2011. Political polarization          Mining offensive language on social media. CLiC-
  on twitter. ICWSM, 133:89–96.                              it 2017 11-12 December 2017, Rome, page 252.

                                                     134
Shervin Malmasi and Marcos Zampieri. 2017. De-                Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin.
  tecting hate speech in social media. arXiv preprint         2020. Semeval-2020 task 12: Multilingual offensive
  arXiv:1712.06427.                                           language identification in social media (offenseval
Elisa Mattiello. 2005. The pervasiveness of slang in          2020). arXiv preprint arXiv:2006.07235.
   standard and non-standard english. Mots Palabras
  Words, 5:7–41.
Abu Bakr Mohammad, Kareem Eissa, and Samhaa El-
  Beltagy. 2017. Aravec: A set of arabic word embed-
  ding models for use in arabic nlp. Procedia Com-
  puter Science, 117:256–265.
Hamdy Mubarak and Kareem Darwish. 2019. Arabic
  offensive language classification on twitter. In In-
  ternational Conference on Social Informatics, pages
  269–276. Springer.
Hamdy Mubarak, Kareem Darwish, and Walid Magdy.
  2017. Abusive language detection on arabic social
  media. In Proceedings of the First Workshop on Abu-
  sive Language Online, pages 52–56.
Chikashi Nobata, Joel Tetreault, Achint Thomas,
  Yashar Mehdad, and Yi Chang. 2016. Abusive lan-
  guage detection in online user content. In Proceed-
  ings of the 25th international conference on world
  wide web, pages 145–153. International World Wide
  Web Conferences Steering Committee.
Alec Radford, Karthik Narasimhan, Tim Salimans,
  and Ilya Sutskever. 2018. Improving language
  understanding by generative pre-training. URL
  https://s3-us-west-2. amazonaws. com/openai-
  assets/researchcovers/languageunsupervised/language
  understanding paper. pdf.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  Percy Liang. 2016. Squad: 100,000+ questions for
  machine comprehension of text.
Younes Samih, Mohamed Eldesouki, Mohammed At-
  tia, Kareem Darwish, Ahmed Abdelali, Hamdy
  Mubarak, and Laura Kallmeyer. 2017. Learning
  from relatives: unified dialectal arabic segmentation.
  In Proceedings of the 21st Conference on Compu-
  tational Natural Language Learning (CoNLL 2017),
  pages 432–441.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
  Hill, Omer Levy, and Samuel R. Bowman. 2018.
  Glue: A multi-task benchmark and analysis platform
  for natural language understanding.
Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
  bols or hateful people? predictive features for hate
  speech detection on twitter. In Proceedings of the
  NAACL student research workshop, pages 88–93.
Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian D
  Davison, April Kontostathis, and Lynne Edwards.
  2009. Detection of harassment on web 2.0. Pro-
  ceedings of the Content Analysis in the WEB, 2:1–7.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
 Sara Rosenthal, Noura Farra, and Ritesh Kumar.
 2019. Semeval-2019 task 6: Identifying and cate-
 gorizing offensive language in social media (offen-
 seval). arXiv preprint arXiv:1903.08983.
Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa
 Atanasova, Georgi Karadzhov, Hamdy Mubarak,

                                                        135
You can also read