Arabic Offensive Language on Twitter: Analysis and Experiments - ACL Anthology

Page created by Julio Hicks

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Arabic Offensive Language on Twitter: Analysis and Experiments

Hamdy Mubarak 1 Ammar Rashed2 Kareem Darwish1
Younes Samih1 Ahmed Abdelali1
1
Qatar Computing Research Institute, HBKU
2
Özyeğin University
{hmubarak, kdarwish, ysamih, aabdelali}@hbku.edu.qa
ammar.rasid@ozu.edu.tr

Abstract a large dataset. Since our methodology does not
use a seed list of offensive words, it is not biased
Detecting offensive language on Twitter
by topic, target, or dialect. Using our methodol-
has many applications ranging from detect-
ing/predicting bullying to measuring polariza- ogy, we tagged a 10,000 Arabic tweet dataset for
tion. In this paper, we focus on building a large offensiveness, where offensive tweets account for
Arabic offensive tweet dataset. We introduce a roughly 19% of the tweets. Further, we labeled
method for building a dataset that is not biased tweets as vulgar or hate speech. To date, this is the
by topic, dialect, or target. We produce the largest available dataset, which we plan to make
largest Arabic dataset to date with special tags publicly available along with annotation guidelines.
for vulgarity and hate speech. We thoroughly
We use this dataset to characterize Arabic offensive
analyze the dataset to determine which topics,
dialects, and gender are most associated with language to ascertain the topics, dialects, and users’
offensive tweets and how Arabic speakers use gender that are most associated with the use of of-
offensive language. Lastly, we conduct many fensive language. Though we suspect that there are
experiments to produce strong results (F1 = common features that span different languages and
83.2) on the dataset using SOTA techniques. cultures, some characteristics of Arabic offensive
language are language and culture specific. Thus,
1 Introduction
we conduct a thorough analysis of how Arab users
Disclaimer: Due to the nature of the paper, some use offensive language. Next, we use the dataset to
examples herein contain highly offensive language train strong Arabic offensive language classifiers
and hate speech. They don’t reflect the views of using state-of-the-art representations and classifica-
the authors in any way. This work is an attempt to tion techniques. Specifically, we experiment with
help fight such speech. static and contextualized embeddings for represen-
tation along with a variety of classifiers such as
Much recent interest has focused on the detec- Transformer-based and Support Vector Machine
tion of offensive language and hate speech in on- (SVM) classifiers. The contributions of this paper
line social media. Offensiveness is often associ- are as follows:
ated with undesirable behaviors such as trolling,
cyberbullying, online extremism, political polariza- • We built the largest Arabic offensive language
tion, and propaganda. Thus, offensive language dataset to date that is also labeled for vulgar
detection is instrumental for a variety of applica- language and hate speech and is not biased
tion such as: quantifying polarization (Barberá and by topic or dialect. We describe the method-
Sood, 2015; Conover et al., 2011), trolls and pro- ology for building it along with annotation
paganda account detection (Darwish et al., 2017), guidelines.
hate crimes likelihood estimation (Waseem and
Hovy, 2016); and predicting conflicts (Chadefaux, • We performed thorough analysis to describe
2014). In this paper, we describe our methodol- the peculiarities of Arabic offensive language.
ogy for building a large dataset of Arabic offensive
tweets. Given that roughly 1-2% of all Arabic • We experimented with SOTA classification
tweets are offensive (Mubarak and Darwish, 2019), techniques to provide strong results on detect-
targeted annotation is essential to efficiently build ing offensive language.

126
Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 126–135
Kyiv, Ukraine (Virtual), April 19, 2021.

2 Related Work proved classification with stemming and achieved
a precision of 88%. Albadi et al. (2018) focused
Many recent papers have focused on the detec- on detecting religious hate speech using a recurrent
tion of offensive language, including hate speech neural network.
(Agrawal and Awekar, 2018; Badjatiya et al., 2017; Arabic is a morphologically rich language with
Davidson et al., 2017; Djuric et al., 2015; Kwok a standard variety called Modern Standard Arabic
and Wang, 2013; Malmasi and Zampieri, 2017; (MSA), which is typically used in formal communi-
Nobata et al., 2016; Yin et al., 2009). Offensive cation, and many dialectal varieties that differ from
language can be categorized as: vulgar, which in- MSA in lexical selection, morphology, phonology,
clude explicit and rude sexual references, porno- and syntactic structures. In MSA, words are typi-
graphic, and hateful, which includes offensive re- cally derived from a set of thousands of roots by
marks concerning people’s race, religion, country, fitting a root into a stem template and the result-
etc. (Jay and Janschewitz, 2008). Prior works ing stem may accept a variety of prefixes and suf-
have concentrated on building annotated corpora fixes. Though word segmentation, which greatly
and training classification models. Concerning cor- improves word matching, is quite accurate for MSA
pora, hatespeechdata.com attempts to maintain (Abdelali et al., 2016), with accuracy approaching
an updated list of hate speech corpora for multiple 99%, dialectal segmentation is not sufficiently re-
languages including Arabic and English. Further, liable, with accuracy ranging between 91-95% for
SemEval 2019 ran an evaluation task targeted at different dialects (Samih et al., 2017). Since di-
detecting offensive language, which focused ex- alectal Arabic is ubiquitous in Arabic tweets and
clusively on English (Zampieri et al., 2019). For many tweets have creative spellings of words, re-
SemEval 2020, they extended the task to include cent work on Arabic offensive language detection
other languages including Arabic (Zampieri et al., used character-level models (Mubarak and Dar-
2020). As for classification models, most studies wish, 2019).
used supervised classification at either word level
(Kwok and Wang, 2013), character sequence level 3 Data Collection
(Malmasi and Zampieri, 2017), and word embed-
dings (Djuric et al., 2015). The studies used differ- 3.1 Collecting Arabic Offensive Tweets
ent classification techniques including Naı̈ve Bayes Our target is to build a large Arabic offensive lan-
(Kwok and Wang, 2013), Support Vector Machines guage dataset that is representative of its appear-
(SVM) (Malmasi and Zampieri, 2017), and deep ance on Twitter and is hopefully not biased to spe-
learning (Agrawal and Awekar, 2018; Badjatiya cific dialects, topics, or targets. One of the main
et al., 2017; Nobata et al., 2016) classification. The challenges is that offensive tweets constitute a very
accuracy of the aforementioned system ranged be- small portion of overall tweets. To quantify their
tween 76% and 90%. Earlier work looked at the use proportion, we took 3 random samples of tweets
of sentiment words as features as well as contextual from different days, with each sample composed
features (Yin et al., 2009). of 1,000 tweets, and we found that only 1-2% of
The work on Arabic offensive language de- them were offensive (including pornographic ad-
tection is relatively nascent (Abozinadah, 2017; vertisements). This percentage is consistent with
Alakrot et al., 2018; Albadi et al., 2018; Mubarak previously reported percentages (Mubarak et al.,
et al., 2017; Mubarak and Darwish, 2019). 2017). Thus, annotating random tweets is grossly
Mubarak et al. (2017) suggested that certain users inefficient. One way to overcome this problem is
are more likely to use offensive languages than to use a seed list of offensive words to filter tweets.
others, and they used this insight to build a list of However, doing so is problematic, as it would skew
offensive Arabic words and to construct a labeled the dataset to particular types of offensive language
set of 1,100 tweets. Abozinadah (2017) used super- or to specific dialects. Offensiveness is often di-
vised classification based on a variety of features alect and country specific.
including user profile features, textual features, and After inspecting many tweets, we observed that
network features. They reported an accuracy of many offensive tweets have the vocative particle
nearly 90%. Alakrot et al. (2018) used supervised AK (“yA” – meaning “O”)1 , which is mainly used
classification based on word n-grams to detect of-
1
fensive language in YouTube comments. They im- Arabic words are provided along with their Buckwalter

127

in directing the speech to a specific person or                tute a small percentage of tweets in general, while
group. The ratio of offensive tweets increases to              being far more generic than using a seed list of
5% if a tweet contains one vocative particle and               offensive words, which may greatly skew the dis-
to 19% if it has at least two vocative particles.              tribution of offensive tweets. For future work, we
Users often repeat this particle for emphasis, as              plan to explore other methods for identifying offen-
    
in: éKñJk AK ú× @ AK (“yA Amy yA Hnwnp” – O my                 sive tweets with greater stylistic diversity.
mother, O kind one), which is endearing and non-               3.2   Annotating Tweets
                     
offensive, and P Y¯ AK I    . Ê¿ AK (“yA klb yA q*r” –       We developed annotation guidelines jointly with
“O dog, O dirty one”), which is offensive. We de-
                                                             an experienced annotator, who is a native Arabic
cided to use this pattern to increase our chances of
                                                             speaker with good knowledge of various Arabic di-
finding offensive tweets. One of the main advan-
                                                             alects, in accordance to the OffensEval2019 guide-
tages of the pattern AK ... AK (“yA ... yA”) is that it is
                                                             lines. Tweets were given one or more of the fol-
not associated with any specific topic or genre, and         lowing four labels: offensive, vulgar, hate speech,
it appears in all Arabic dialects. Though the use of         or clean. Since the offensive label covers both vul-
offensive language does not necessitate the appear-          gar and hate speech and vulgarity and hate speech
ance of the vocative particle, the particle does not         are not mutually exclusive, a tweet can be just of-
favor any specific offensive expressions and greatly         fensive or offensive and vulgar and/or hate speech.
improves our chances of finding offensive tweets.            The annotation adhered to the following guidelines:
Using Twitter APIs, we collected 660k Arabic
tweets having this pattern between April 15 – May            OFFENSIVE (OFF): Offensive tweets contain
6, 2019. To increase diversity, we sorted the word           explicit or implicit insults or attacks against other
sequences between the vocative particles and took            people, or inappropriate language, such as:
                                                             Direct       threats
the most frequent 10,000 unique sequences. For
each word sequence, we took a random tweet con-               éPAªÖÏ @ H@       or (“AHrqwA
                                                                              @ñ¯Qk@
                                                                          Q®Ó
                                                                                               incitement,       ex:
                                                                                                              mqrAt
taining that sequence. Then we annotated those               AlmEArDp” – “burn opposition headquar-
tweets, ending up with 1,915 offensive tweets                                             
                                                             ters”) and ¯AJÖÏ @ @ Yë @ñÊJ¯@ (“h*A AlmnAfq yjb
which represent roughly 19% of all tweets. Each              qtlh” – “kill this hypocrite”).
tweet was labeled as: offensive, which could ad-             Insults and expressions of contempt, which
ditionally be labeled as vulgar and/or hate speech,          include: Animal analogies, ex: I
or Clean. We describe in greater detail our anno-
                                                                                                   . Ê¿ AK (“yA klb”
tation guidelines, which are compatible with the             – “O dog”) and       á.K É¿ (“kl tbn” – “eat hay”).;
OffensEval2019 annotation guidelines (Zampieri                 Insult to family, ex: ½Ó @ hðP AK (“yA rwH Amk”
et al., 2019). For example, if a tweet has insults
                                                             – “O mother’s soul”); Sexually-related insults, ex:
or threats targeting a group based on their national-          X AK (“yA dywv” – “O cuckold”); Damnation,
                                                             HñK
ity, ethnicity, gender, political affiliation, religious
belief, or other common characteristics, this is con-          ex: ½JªÊK é

(“dynk Alq*r” – “your filthy religion”). Tweets Words
Offensive 1,915 38k
– Vulgar 225 4k
– Hate speech 506 13k
CLEAN (CLN): Clean tweets do not contain Clean 8,085 151k
vulgar or offensive language. We noticed that Total 10,000 193k
some tweets have some offensive words, but the
whole tweet should not be considered as offen- Table 1: Distribution of offensive and clean tweets.
sive due to the intention of users. This sug-
gests that normal string match without consid-
3.3 Statistics and User Demographics
ering contexts may fail in some cases. Ex-
amples of such ambiguous cases include: Hu- Given the annotated tweets, we wanted to ascer-

mor, ex: éêë ékQ®Ë@ èðY« AK (“yA Edwp AlfrHp tain the distribution of: types of offensive language,
hhh” – “O enemy of happiness hahaha”); Advice, genres or topics where it is used, the dialects used,

ex: QK Qg AK ½J.kAË É®K B (“lA tql lSAHbk yA and the gender of users using such language. Ac-
cordingly, the annotator manually examined and
xnzyr” – “don’t say to your friend: You are a
tagged all the offensive tweets.
pig”); Condition, ex: ÉJÔ« AK àñËñ®K ÑîDPA« @ X@
Topic: Figure 1 shows the distribution of topics as-
(“A*A EArDthm yqwlwn yA Emyl” – “if you
sociated with offensive tweets. As the figure shows,
disagree with them, they call you a spy”); Con-
@ XAÖÏ (“lmA*A sports and politics are most dominant for offensive
demnation, ex: ? èQ®K. AK :Èñ®K. I . language including vulgar and hate speech.
nsb bqwl: yA bqrp?” – “Why do we insult Dialect: We looked at MSA and four major di-
others by saying: O cow?”); Self offense, ex: alects, namely Egyptian (EGY), Leventine (LEV),
úGAË áÓ IJ
P Y®Ë@ . ªK (“tEbt mn lsAny Alq*r” – “IMaghrebi (MGR), and Gulf (GLF). Figure 2 shows
am tired of my dirty tongue”); Non-human target, that 71% of vulgar tweets were written in EGY

ex: èPñ» AK éKñJj.ÖÏ @ I K. AK (“yA bnt Almjnwnp followed by GLF, which accounted for 13% of
yA kwrp” – “O daughter of the crazy one O foot- vulgar tweets. MSA was not used in any vulgar
ball”); and Quotation from a movies or a story, ex: tweets. As for offensive tweets in general, EGY
ÉA ¯ AK úGAK !ú»P AK úGAK (“tAny yA zky! tAny yA and GLF were used in 36% and 35% of the offen-
sive tweets respectively. Unlike the case of vulgar
fA$l” – “again smarty! again O loser”). For am-
language, 15% of the offensive tweets were writ-
biguous expressions, the annotator searched Twitter
ten in MSA. For hate speech, GLF and EGY were
to observe real sample usages.
again dominant and MSA constituted 21% of the
Table 1 shows the distribution of the annotated tweets. This is consistent with findings for other
tweets. There are 1,915 offensive tweets, including languages, e.g. English and Italian, where vulgar-
225 vulgar tweets and 506 hate speech tweets, and ity was more frequently associated with colloquial
8,085 clean tweets. To validate annotation quality, language (Mattiello, 2005; Maisto et al., 2017).
we asked three additional annotators to annotate Gender: Figure 3 shows that the vast majority of
two tweet sample sets. The first was a random offensive tweets, including vulgar and hate speech,
sample of 100 tweets containing 50 offensive and were authored by males. Female Twitter users ac-
50 non-offensive tweets. The Inter-Annotator counted for 14% of offensive tweets in general and
Agreement (IIA) between the annotators using 6% and 9% of vulgar and hate speech respectively.
Fleiss’s Kappa coefficient (Fleiss, 1971) was Figure 4 shows a detailed categorization of hate
0.92. The second was general random samples speech types, where the top three include insulting
containing 100 tweets each from the dataset, and groups based on their political ideology, origin, and
the IIA with the dataset was: 0.97, 0.96, and sport affiliation. Religious hate speech appeared in
0.97. This high level of agreement gives more only 15% of all hate speech tweets.
confidence in the quality of the annotation. Data Next, we analyzed all tweets labeled as offen-
can be downloaded from: sive to better understand how Arabic speakers use
https://alt.qcri.org/resources/ offensive language. Here is a breakdown of usage:
OSACT2020-sharedTask-CodaLab-Train-Dev-Test.Direct name calling: The most frequent attack is
zip to call a person an animal name, and the most used

129

Figure 3: Gender distribution for offensive language
Figure 1: Topic distribution for offensive language and         and its sub-categories
its sub-categories

                                                                                Disability/Diseases

                                                                                             Gender

                                                                                   Social Class/Job

                                                                                            Religion

                                                                                    Sport Affiliation

                                                                 Origin (race, ethnicity, nationality)

                                                                                  Political Ideology

                                                                                                         0   0.1   0.2   0.3   0.4

                                                                Figure 4: Distribution of Hate Speech Types. Note: A
                                                                tweet may have more than one type.

Figure 2: Dialect distribution for offensive language
and its sub-categories

animals were  I.Ê¿ (“klb” – “dog”), PAÔg (“HmAr”
– “donkey”), and ÕæîE. (“bhym” – “beast”). The sec-
ond most common was insulting mental abilities us-
ing words such as úæ« (“gby” – “stupid”) and ¡JJ.«
                      .
(“EbyT” –“idiot”). Culturally, not all animal names
are used as insults. For example, animals such as
                         (“Sqr” – “falcon”), and               Figure 5: Tag cloud for words with top valence score
Y @ (“Asd” – “lion”), Q®                                      among offensive class, e.g. name calling (animals),
È@Q« (“gzAl” – “gazelle”) are typically used for                curses, insults, etc.
praise. For other insults, people use: some bird
                                                
names such as ék. Ag. X (“djAjp” – “chicken”), éÓñK.
                                                                metaphor were they would compare a person to: an
(“bwmp” – “owl”), and H      . @Q« (“grAb” – “crow”);           animal as in PñJË@ ø P (“zy Alvwr” – “like a bull”),
                   
insects such as éK. AK. X (“*bAbp” – “fly”), PñQå             ½®J îE úæªÖÞ                (“smEny nhyqk” – “let me hear your
                                   
(“SrSwr” – “cockroach”), and èQåk (“H$rp” – “in-              braying”), and ½ÊK X Që (“hz dylk” – “wag your
                                    
sect”); microorganisms such as éÓñKQk. (“jrvwmp” –             tail”); a person with mental or physical disability
“microbe”) and I                                                such as úÍñªJÓ (“mngwly” – “Mongolian (Down
                  . ËAj£ (“THAlb” – “algae”); inan-                                               (“mEwq” – “disabled”), and
                          
imate objects such as éÓQk. (“jzmp” – “shoes”) and              syndrome)”),                    ñªÓ
É¢ (“sTl” – “bucket”) among other usages.                      Ð Q¯ (“qzm” – “dwarf”); and to the opposite gender
Simile and metaphor: Users use simile and                       such as È@ñK   k. (“jy$ nwAl” – “Nawal’s army
                                                          130

(Nawal is female name)”) and ø QK P ø XAK (“nAdy            4     Experiments
zyzy” – “Zizi’s club (Zizi is a female nickname)”).       We conducted an extensive battery of experiments
Indirect speech: This includes: sarcasm such as           on the dataset to establish strong Arabic offen-
½K@ñk@ ú»X @ (“A*kY AxwAtk” – “smartest one of           sive language classification results. Though of-
your siblings”) and QÒmÌ '@ ¬ñÊJ¯ (“fylswf AlH-         fensive tweets have finer-grained labels where of-
myr” – “the donkeys’ philosopher”); questions             fensive tweet could also be vulgar and/or hate
such as èX ZAJ.ªË@ É¿ éK @ (“Ayh kl AlgbA dh” –           speech, we conducted coarser-grained classifica-
                                                          tion to determine if a tweet was offensive or not.
“what is all this stupidity”); and indirect speech
                                  ® JË@ (“AlnqA$ mE     For classification, we experimented with several
such as QÒJÓ Q« Õç' AîD.Ë@ ©Ó A
                                                          tweet representation and classification models. For
AlbhAym gyr mvmr” – “no use arguing with cat-             tweet representations, we used: the count of pos-
tle”).                                                    itive and negative terms, based on a polarity lexi-
Wishing Evil: This entails wishing death or ma-           con; static embeddings, namely fastText and Skip-
jor harm to befall someone such as ¼YgAK AJK. P           Gram; and deep contextual embeddings, namely
(“rbnA yAxdk” – “May God take (kill) you”),               BERTbase-multilingual and AraBERT (Antoun et al.,
½JªÊK é

different corpora with different vector dimension-      of the sequence, [CLS], adding a softmax activa-
ality. We compared pre-trained embeddings to em-        tion on the top of BERT to predict the probability
beddings that were trained on our dataset. For          of the l label: p(l|h) = sof tmax(W h), where
pre-trained embeddings, we used: fastText Egyp-         W is the task-specific weight matrix. During fine-
tian Arabic pre-trained embeddings (Bojanowski          tuning, all BERT/AraBERT parameters together
et al., 2017) with vector dimensionality of 300; Ar-    with W are optimized end-to-end to maximize the
aVec skip-gram embeddings (Mohammad et al.,             log-probability of the correct labels.
2017), trained on 66.9M Arabic tweets with 100-
dimensional vectors; and Mazajak skip-gram em-            4.3   Classification Models
beddings (Abu Farha and Magdy, 2019), trained on        We explored different classifiers. When using lexi-
250M Arabic tweets with 300-dimensional vectors.        cal features and pre-trained static embeddings, we
Sentence embeddings were calculated by taking the       primarily used an SVM classifier with a radial basis
mean of the embeddings of their tokens. The im-         function kernel. Only when using the Mazajak em-
portance of testing a character level n-gram model      beddings, we experimented with other classifiers
like fastText lies in the agglutinative nature of the   such as AdaBoost and Logistic regression. The
Arabic language. We trained a new fastText text         SVM classifier performed the best on static em-
classification model (Joulin et al., 2017) on our       beddings, and we picked the Mazajak embeddings
dataset with vectors of 40 dimensions, 0.5 learning     because they yielded the best results among all
rate, 2−10 character n-grams as features, for 30        static embeddings. We used the Scikit Learn imple-
epochs. These hyper-parameters were tuned using         mentations of all the classifiers such as libsvm for
a 5-fold cross-validated grid-search.                   the SVM classifier. We also experimented with fast-
                                                        Text, which trained embeddings on our data. When
Deep Contextualized Embeddings We also ex-
                                                        using contextualized embeddings, we fine-tuned
perimented with pre-trained contextualized em-
                                                        BERT and AraBERT by adding a fully-connected
beddings with fine-tuning for down-stream tasks.
                                                        dense layer followed by a softmax classifier, mini-
Recently, deep contextualized language models
                                                        mizing the binary cross-entropy loss function for
such as BERT (Bidirectional Encoder Represen-
                                                        the training data. For all experiments, we used the
tations from Transformers) (Devlin et al., 2019),
                                                        PyTorch2 implementation by HuggingFace3 as it
UMLFIT (Howard and Ruder, 2018), and Ope-
                                                        provides pre-trained weights and vocabularies.
nAI GPT (Radford et al., 2018), have achieved
ground-breaking results in many NLP classifica-           4.4   Evaluation
tion and language understanding tasks. In this pa-
per, we fine-tuned BERTbase-multilingual (or simply     For all of our experiments, we used 5-fold cross
BERT) and AraBERT embeddings to classify Ara-           validation with identical folds for all experiments.
bic offensive language on Twitter as it eliminates      Table 2 reports on the results of using lexical fea-
the need for feature engineering. Although Ro-          tures, static pre-trained embeddings with an SVM
bustly Optimized BERT (RoBERTa) embeddings              classifier, embeddings trained on our data with fast-
perform better than (BERTlarge ) on GLUE (Wang          Text classifier, and BERT and AraBERT over a
et al., 2018), RACE (Lai et al., 2017), and SQuAD       dense layer with softmax activation. As the results
(Rajpurkar et al., 2016) tasks, pre-trained multilin-   show, using fine-tuned AraBERT yielded the best
gual RoBERTa models are not available. BERT             results overall, followed closely by Mazajak/SVM,
is pre-trained on Wikipedia text from 104 lan-          with large improvements in precision over using
guages, and AraBERT is trained on a large Arabic        BERT. The success of AraBERT was surprising
news corpus containing 8.5M articles composed of        given that it was not trained on social media text.
roughly 2.5B tokens. Both use identical architec-       Perhaps, pre-training a Transformer model on so-
tures and come with hundreds of millions of param-      cial media text may improve results further. We
eters. Both contain an encoder with 12 Transformer      suspect that the Mazajak/SVM combination per-
blocks, hidden size of 768, and 12 self-attention       formed better than BERT due to the fact that the
heads. These embedding use BP sub-word seg-             Mazajak embeddings, though static, were trained
ments. Following Devlin et al. (2019), the classi-          2
                                                             https://pytorch.org/
fication consists of introducing a dense layer over         3
                                                             https://github.com/huggingface/
the final hidden state h corresponding to first token     transformers

                                                    132

on in-domain data, as opposed to BERT. For com- • Implicit Sarcasm: ex.
pleteness, we compared 7 other classifiers with I.k ú¯ ½¾ QK A« I K@ áK Ag AK
QÊË I.ªË@
SVM using Mazajak embeddings. As results in
(“yA xAyn Ant EAwz t$kk fy Hb Al$Eb
Table 3 show, using SVM yielded the best results.
llrys” – “O traitor, (you) want to question
Model/classifier Prec. Recall F1 people’s love for the president ”) where the
Lexical Features author is mocking the president’s popularity.
SVM 68.5 35.3 46.6
Two false negative types:
Pre-trained static embeddings
fastText/SVM 76.7 43.5 55.5 • Mixture of offensiveness and admiration: ex.

calling a girl a puppy éK. ñJ.Ê¿ AK (“yA klbwbp” –
AraVec/SVM 85.5 69.2 76.4
Mazajak/SVM 88.6 72.4 79.7 “O puppy”) in a flirtatious manner.
Embeddings trained on our data • Implicit offensiveness: ex. call-
fastText/fastText 82.1 68.1 74.4 ing for cure while implying sanity:
Contextualized embeddings QÖÏ @ áÓ ¼YÊK. ÐA¾k ù® ð (“wt$fy HkAm
BERTbase-multilingual 78.3 74.0 76.0
bldk mn AlmrD” – “and cure rulers of your
AraBERT 84.6 82.4 83.2
country from illness”).
Table 2: Classification performance with different fea-
tures and models. 5 Conclusion and Future Work
In this paper we presented a systematic method for
Model Prec. Recall F1 building an Arabic offensive language tweet dataset
Decision Tree 51.2 53.8 52.4 that does not favor specific dialects, topics, or gen-
Random Forest 82.4 42.4 56.0 res. We developed detailed guidelines for tagging
Gaussian NB 44.9 86.0 59.0 the tweets as clean or offensive, including special
Perceptron 75.6 67.7 66.8 tags for vulgar tweets and hate speech. We tagged
AdaBoost 74.3 67.0 70.4 10,000 tweets, which we plan to release publicly
Gradient Boosting 84.2 63.0 72.1 and would constitute the largest available Arabic
Logistic Regression 84.7 69.5 76.3 offensive language dataset. We characterized the
SVM 88.6 72.4 79.7 offensive tweets in the dataset to determine the top-
ics that illicit such language, the dialects that are
Table 3: Performance of different classification models most often used, the common modes of offensive-
on Mazajak embeddings. ness, and the gender distribution of their authors.
We performed this breakdown for offensive tweets
4.5 Error Analysis in general and for vulgar and hate speech tweets
separately. We believe that this is the first detailed
We inspected the tweets of one fold that were mis-
analysis of its kind. Lastly, we conducted a large
classified by the Mazajak/SVM model (36 false
battery of experiments on the dataset, using cross-
positives/121 false negatives) to determine the most
validation, to establish a strong system for Arabic
common errors. They were as follows:
offensive language detection. We showed that us-
Four false positive types: ing an Arabic specific BERT model (AraBERT)
• Gloating: ex. èYJJ.ë AK (“yA hbydp” - “O you and static embeddings trained on tweets produced
delusional”) referring to fans of rival sports team competitive results on the dataset.
for thinking they could win. For future work, we plan to pursue several di-
rections. First, we want explore target specific
• Quoting: ex. I.Ê¿ AK Èñ®K ð I. Yg AÖÏ offensive language, where attacks against an entity
(“lmA Hd ysb wyqwl yA klb” – “when some- or a group may employ certain expressions that are
one swears and says: O dog”). only offensive within the context of that target and
• Idioms: ex. ½JK X Qå Ag AK àAÓP Q£A¯ AK (“yA completely innocuous otherwise. Second, we plan
fATr rmDAn yA xAsr dynk” – “o you who does to examine the effectiveness of cross dialectal and
not fast Ramadan, you have lost your faith”), cross lingual learning of offensive language.
which is a colloquial idiom.

133

References Kareem Darwish, Dimitar Alexandrov, Preslav Nakov,
and Yelena Mejova. 2017. Seminar users in the ara-
Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and bic twitter sphere. In International Conference on
Hamdy Mubarak. 2016. Farasa: A fast and furious Social Informatics, pages 91–108. Springer.
segmenter for arabic. In Proceedings of the 2016
conference of the North American chapter of the as- Thomas Davidson, Dana Warmsley, Michael Macy,
sociation for computational linguistics: Demonstra- and Ingmar Weber. 2017. Automated hate speech
tions, pages 11–16. detection and the problem of offensive language. In
Eleventh International Conference on Web and So-
Ehab Abozinadah. 2017. Detecting Abusive Arabic cial Media (ICWSM), pages 512–515.
Language Twitter Accounts Using a Multidimen-
sional Analysis Model. Ph.D. thesis, George Mason Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
University. Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
Ibrahim Abu Farha and Walid Magdy. 2019. Mazajak:
standing. In Proceedings of the 2019 Conference
An online Arabic sentiment analyser. In Proceed-
of the North American Chapter of the Association
ings of the Fourth Arabic Natural Language Process-
for Computational Linguistics: Human Language
ing Workshop, pages 192–198, Florence, Italy. Asso-
Technologies, Volume 1 (Long and Short Papers),
ciation for Computational Linguistics.
pages 4171–4186, Minneapolis, Minnesota. Associ-
Sweta Agrawal and Amit Awekar. 2018. Deep learn- ation for Computational Linguistics.
ing for detecting cyberbullying across multiple so-
Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Gr-
cial media platforms. In European Conference on
bovic, Vladan Radosavljevic, and Narayan Bhamidi-
Information Retrieval, pages 141–153. Springer.
pati. 2015. Hate speech detection with comment em-
Azalden Alakrot, Liam Murray, and Nikola S Nikolov. beddings. In Proceedings of the 24th international
2018. Towards accurate detection of offensive lan- conference on world wide web, pages 29–30. ACM.
guage in online communication in arabic. Procedia
computer science, 142:315–320. Samhaa R. El-Beltagy. 2016. NileULex: A phrase and
word level sentiment lexicon for Egyptian and mod-
Nuha Albadi, Maram Kurdi, and Shivakant Mishra. ern standard Arabic. In Proceedings of the Tenth In-
2018. Are they our brothers? analysis and detec- ternational Conference on Language Resources and
tion of religious hate speech in the arabic twitter- Evaluation (LREC’16), pages 2900–2905, Portorož,
sphere. In 2018 IEEE/ACM International Confer- Slovenia. European Language Resources Associa-
ence on Advances in Social Networks Analysis and tion (ELRA).
Mining (ASONAM), pages 69–76. IEEE.
Joseph L Fleiss. 1971. Measuring nominal scale agree-
Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. ment among many raters. Psychological bulletin,
Arabert: Transformer-based model for arabic lan- 76(5):378.
guage understanding. In Proceedings of the 4th
Workshop on Open-Source Arabic Corpora and Pro- Jeremy Howard and Sebastian Ruder. 2018. Universal
cessing Tools, with a Shared Task on Offensive Lan- language model fine-tuning for text classification. In
guage Detection, pages 9–15. Proceedings of the 56th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, Long Papers), Melbourne, Australia. Association for
and Vasudeva Varma. 2017. Deep learning for hate Computational Linguistics.
speech detection in tweets. In Proceedings of the
26th International Conference on World Wide Web Timothy Jay and Kristin Janschewitz. 2008. The prag-
Companion, pages 759–760. International World matics of swearing. Journal of Politeness Research.
Wide Web Conferences Steering Committee. Language, Behaviour, Culture, 4(2):267–288.
Pablo Barberá and Gaurav Sood. 2015. Follow your Armand Joulin, Edouard Grave, Piotr Bojanowski, and
ideology: Measuring media ideology on social net- Tomas Mikolov. 2017. Bag of tricks for efficient
works. In Annual Meeting of the European Political text classification. In Proceedings of the 15th Con-
Science Association, Vienna, Austria. Retrieved from ference of the European Chapter of the Association
http://www. gsood. com/research/papers/mediabias. for Computational Linguistics: Volume 2, Short Pa-
pdf. pers, pages 427–431. Association for Computational
Linguistics.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching word vectors with Irene Kwok and Yuzhou Wang. 2013. Locate the hate:
subword information. Transactions of the Associa- Detecting tweets against blacks. In Twenty-seventh
tion for Computational Linguistics, 5:135–146. AAAI conference on artificial intelligence.
Thomas Chadefaux. 2014. Early warning signals Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
for war in the news. Journal of Peace Research, and Eduard Hovy. 2017. Race: Large-scale reading
51(1):5–18. comprehension dataset from examinations.
Michael Conover, Jacob Ratkiewicz, Matthew R Fran- Alessandro Maisto, Serena Pelosi, Simonetta Vietri,
cisco, Bruno Gonçalves, Filippo Menczer, and Pierluigi Vitale, and Via Giovanni Paolo II. 2017.
Alessandro Flammini. 2011. Political polarization Mining offensive language on social media. CLiC-
on twitter. ICWSM, 133:89–96. it 2017 11-12 December 2017, Rome, page 252.

134

Shervin Malmasi and Marcos Zampieri. 2017. De- Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin.
tecting hate speech in social media. arXiv preprint 2020. Semeval-2020 task 12: Multilingual offensive
arXiv:1712.06427. language identification in social media (offenseval
Elisa Mattiello. 2005. The pervasiveness of slang in 2020). arXiv preprint arXiv:2006.07235.
standard and non-standard english. Mots Palabras
Words, 5:7–41.
Abu Bakr Mohammad, Kareem Eissa, and Samhaa El-
Beltagy. 2017. Aravec: A set of arabic word embed-
ding models for use in arabic nlp. Procedia Com-
puter Science, 117:256–265.
Hamdy Mubarak and Kareem Darwish. 2019. Arabic
offensive language classification on twitter. In In-
ternational Conference on Social Informatics, pages
269–276. Springer.
Hamdy Mubarak, Kareem Darwish, and Walid Magdy.
2017. Abusive language detection on arabic social
media. In Proceedings of the First Workshop on Abu-
sive Language Online, pages 52–56.
Chikashi Nobata, Joel Tetreault, Achint Thomas,
Yashar Mehdad, and Yi Chang. 2016. Abusive lan-
guage detection in online user content. In Proceed-
ings of the 25th international conference on world
wide web, pages 145–153. International World Wide
Web Conferences Steering Committee.
Alec Radford, Karthik Narasimhan, Tim Salimans,
and Ilya Sutskever. 2018. Improving language
understanding by generative pre-training. URL
https://s3-us-west-2. amazonaws. com/openai-
assets/researchcovers/languageunsupervised/language
understanding paper. pdf.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. Squad: 100,000+ questions for
machine comprehension of text.
Younes Samih, Mohamed Eldesouki, Mohammed At-
tia, Kareem Darwish, Ahmed Abdelali, Hamdy
Mubarak, and Laura Kallmeyer. 2017. Learning
from relatives: unified dialectal arabic segmentation.
In Proceedings of the 21st Conference on Compu-
tational Natural Language Learning (CoNLL 2017),
pages 432–441.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R. Bowman. 2018.
Glue: A multi-task benchmark and analysis platform
for natural language understanding.
Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
bols or hateful people? predictive features for hate
speech detection on twitter. In Proceedings of the
NAACL student research workshop, pages 88–93.
Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian D
Davison, April Kontostathis, and Lynne Edwards.
2009. Detection of harassment on web 2.0. Pro-
ceedings of the Content Analysis in the WEB, 2:1–7.
Marcos Zampieri, Shervin Malmasi, Preslav Nakov,
Sara Rosenthal, Noura Farra, and Ritesh Kumar.
2019. Semeval-2019 task 6: Identifying and cate-
gorizing offensive language in social media (offen-
seval). arXiv preprint arXiv:1903.08983.
Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa
Atanasova, Georgi Karadzhov, Hamdy Mubarak,

135

You can also read