JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Page created by Mark Walsh
 
CONTINUE READING
JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGY
Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20

 JOURNAL OF INFORMATION AND
 COMMUNICATION TECHNOLOGY
 http://jict.uum.edu.my

How to cite this article:
Nezhad, Z. B., & Deihimi, A. M. (2021). Sarcasm detection in Persian. Journal of Information
and Communication Technology, 20(1), 1-20. https://doi.org/10.32890/jict.20.1.2021.6249

 SARCASM DETECTION IN PERSIAN

 Zahra Bokaee Nezhad & 2Mohammad Ali Deihimi
 1
 1
 Department of Computer Engineering, Zand University, Iran
2
 Department of Electronics Engineering, Bahonar University, Iran

 Corresponding author: zbokaee@gmail.com
 m.a.deihimi@gmail.com

Received: 25/4/2019 Revised: 13/5/2020 Accepted: 14/5/2020 Published: 4/11/2020

 ABSTRACT

Sarcasm is a form of communication where the individual states the opposite of
what is implied. Therefore, detecting a sarcastic tone is somewhat complicated
due to its ambiguous nature. On the other hand, identification of sarcasm is
vital to various natural language processing tasks such as sentiment analysis
and text summarisation. However, research on sarcasm detection in Persian
is very limited. This paper investigated the sarcasm detection technique on
Persian tweets by combining deep learning-based and machine learning-based
approaches. Four sets of features that cover different types of sarcasm were
proposed, namely deep polarity, sentiment, part of speech, and punctuation
features. These features were utilised to classify the tweets as sarcastic and non-
sarcastic. In this study, the deep polarity feature was proposed by conducting
a sentiment analysis using deep neural network architecture. In addition, to
extract the sentiment feature, a Persian sentiment dictionary was developed,
which consisted of four sentiment categories. The study also used a new
Persian proverb dictionary in the preparation step to enhance the accuracy of
the proposed model. The performance of the model is analysed using several

 1
Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20

standard machine learning algorithms. The results of the experiment showed
that the method outperformed the baseline method and reached an accuracy of
80.82%. The study also examined the importance of each proposed feature set
and evaluated its added value to the classification.

Keywords: Sarcasm detection, natural language processing, machine learning,
sentiment analysis, classification.

 INTRODUCTION

Twitter has become one of the biggest destinations for people to put forward
their opinions. There is a great opportunity for companies and organisations to
notice users’ ideas (Rajadesingan, Zafarani, & Liu, 2015). Sentiment analysis
has emerged as a field of study in identifying opinionative data in the Web and
classifying them according to their polarity. Sentiment analysis can present
reasonable opportunities for marketers to generate market intelligence on
consumer attitudes and help organisations to fulfil their purposes (Al-Otaibi
et al., 2018). Nevertheless, there are some problems with using sentiment
analysis tools. One of these problems is the existence of sarcasm in the user’s
view, which can lead to misclassification of sentiment analysis (Maynard &
Greenwood, 2014). In fact, sarcasm is one of the considerable challenges
in sentiment analysis. It is an indirect way of telling a message and can be
conveyed through different ways such as direct conversation, speech, text, etc.
(Seyed Sadeqi & Ehsanjou, 2018). In direct conversation, facial expression
and body gesture help to recognise sarcasm. In speech, sarcasm can be derived
from changes in tone. In text, it is too difficult to identify sarcasm; however,
there are some methods that help to reveal it.

Cambridge Dictionary describes sarcasm as “the use of remarks that mean
the opposite of what they say made to hurt some one’s feeling or to criticise
something humorously!” Consider the following tweet, “Yay! It’s a holiday
weekend, and I’m on call for work! Couldn’t be luckier!” Although this tweet
includes the words “yay” and “lucky” with positive sentiments, the expression
has a negative sentiment (Liu et al., 2014). This shows that detecting the
sentiment of a tweet seems complicated when the tweet is sarcastic. Sarcasm
may create problems for not only sentiment analysis approaches but also many
other natural language processing (NLP) tasks such as review summarisation
systems. Consider the following tweet, “I’m very pleased to waste my four
hours on such a pathetic movie!”. Although this tweet has the word “pleased”
with a positive sentiment, the whole emotion of the tweet is negative.
Accordingly, if a movie review summarisation system does not employ a

 2
Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20

sarcasm detection model, it may recognise this tweet as a positive review
(Maynard & Greenwood, 2014). It is demonstrated that the state-of-the-art
approaches of sentiment analysis can be highly enhanced when there is an
ability to detect sarcastic statements (Hazarika et al., 2018). To compound the
problem, in Persian, people tend to use sarcasm in their daily conversations
for criticizing and censoring especially in political topics (Hokmi, 2017).To
the best of the researchers’ knowledge, there is no work on sarcasm detection
in Persian. Therefore, they aim to present a model that performs the task of
sarcasm detection in Persian. The proposed approach considers different types
of sarcasm and it is evaluated on the first sarcastic Persian dataset.

The rest of the article is structured as follows. The next section describes
related works. It is followed by the architecture of the proposed model. The
results are shown in the fourth section, while the fifth section concludes this
work and proposes possible directions for future works.

 RELATED WORKS

Currently, despite several studies that have been conducted to detect sarcasm
in English, there is no attempt to address this problem in Persian. However,
there are several research related to the recognition of sarcasm in different
languages and this subject has gained rapid attention. Twitter sarcasm
detection techniques can be classified into five categories, namely pattern-
based approach, context-based approach, deep learning-based approach,
machine learning-based approach, and lexicon-based approach (Bharti et al.,
2017).

Pattern-based approach: Bouazizi and Otsuki (2016) used a pattern-based
approach to detect sarcasm on Twitter. They utilised three patterns for sarcasm
in tweets: (1) sarcasm as Wit, which is used by capital letter words, punctuation
marks, or sarcasm-related emoticons to show the sense of humour; (2) sarcasm
as Evasion, which is used when a person wants to avoid giving a clear answer;
and (3) sarcasm as Whimper, in which the anger of a person is shown. Then,
four sets of features were extracted as sentiment-related features, punctuation-
related features, syntactic and semantic features, and pattern features. They
reached an accuracy of 83.1%.

Context-based approach: Schifanella et al. (2016) built a complex
classification model that worked over an entire tweet sequence rather than
one tweet at a time. They deployed features based on the integration between
linguistic and contextual features.

 3
Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20

Deep learning-based approach: Some studies proposed automated sarcasm
detection using deep neural network architecture. Son et al. (2019) put forward
a hybrid deep learning model based on soft Attention-Based Bidirectional
Long Short-Term Memory (sAtt-BLSTM) and convolutional neural network
(ConvNet). They applied Global Vector (GloVe) for word representation.
Additionally, they used feature maps generated by sAtt-BLSTM as well as
punctuation-based features. Felbo et al. (2017) proposed a deep Moji model
based on the occurrence of emoji. They used Long Short-Term Memory
(LSTM) as well as attention mechanisms and gained acceptable results.
Ghosh and Veale (2016) built a model combing a Recurrent Neural Network
(RNN) with a ConvNet. Their model represented better results as compared to
recursive support vector machine (SVM).

Machine learning-based approach: Many studies have been conducted to
detect sarcasm based on machine learning approach. Rajadesingan et al. (2015)
proposed a model to detect sarcasm by behavioural features using users’ past
tweets. They employed theories from behavioural and psychological studies to
construct their features. They used Naïve Bayes and SVM classifiers. Blamey
et al. (2012) suggested a new feature to capture properties of a figurative
language like emotional scenario and unexpectedness with ambiguity and
polarity. Suhaimin et al. (2019) presented a framework to support sentiment
analysis by using sarcasm detection and classification. The framework
comprised six modules: pre-processing, feature extraction, feature selection,
initial sentiment classification, sarcasm detection and classification, and actual
sentiment classification. The framework was evaluated using a nonlinear
SVM and Malay social media data. The best average F-measure score of
90.5% was recorded using their framework. Suhaimin et al. (2017) proposed a
feature extraction process to detect sarcasm using Malay social media data as
bilingual texts. They considered four categories of features using NLP, namely
lexical, pragmatic, prosodic, and syntactic. They investigated the use of the
idiosyncratic feature to capture peculiar and odd comments found in the texts.
A nonlinear SVM was utilised for classification. Their results demonstrated that
the combination of syntactic, pragmatic, and prosodic features produced the
best performance with a F-measure score of 85.2%. Lunando and Purwarianti
(2013) presented an Indonesian sarcasm detection model by using unigram,
number of interjections, words, negativity, and question words as extracted
features. For term recognition, they used translated SentiWordnet, which led
to undetected terms and very low accuracy. Rahayu et al. (2018) proposed
another method on Indonesian tweets based on two extracted features,
namely interjection and punctuation. They also used two different weighting
algorithms. Their proposed model outperformed the model proposed by
Lunando and Purwarianti (2013). Nevertheless, the paucity of sophisticated
features on their work leave more room for improvement.

 4
w accuracy. Rahayu et al. (2018) proposed another method on Indonesian tweets based on two ext
atures, namely interjection and punctuation. They also used two different weighting algorithms.
 Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20
oposed model outperformed the model proposed by Lunando and Purwarianti (2013). Neverthele
ucity of sophisticated features on their work leave more room for improvement.
 Lexicon-based approach: Parmar et al. (2018) utilised lexical and
 hyperbole features
 xicon-based approach: Parmartoetimprove the sentiment
 al. (2018) utilised analysis results.
 lexical and They employed
 hyperbole features to impro
 MapReduce to reduce the execution time by a parallel process platform. They
ntiment analysissuggested
 results. They employed MapReduce to reduce the execution time by a parallel p
 five parts for their proposed model. In the first part, different Twitter
atform. They suggested five parts for
 application programming their (APIs)
 interfaces proposedweremodel.
 performed In tothe firsttweets.
 retrieve part, different T
plication programming
 Afterwards,interfaces (APIs)
 the results were were
 stored performed
 in the hadoop to retrieve
 distributedtweets. Afterwards, the
 file system
 (HDFS). Secondly, all data were preprocessed to remove
 re stored in the Hadoop Distributed File System (HDFS). Secondly, all data were noisy data such as preproces
move noisy data Uniform
 such asResource
 Uniform Locator (URL).
 Resource The third
 Locator part was
 (URL). Thewords
 third relationships,
 part was words relation
 which were done by part-of-speech tagging (POS) method. Subsequently, all
hich were done the
 by phrases
 part-of-speech tagging
 were stored (POS)file.
 in a parsed method. Subsequently,
 As the fourth all theanalysis
 step, sentiment phrases were store
rsed file. As the
 wasfourth step,onsentiment
 conducted all phrases. analysis
 The lastwasstepconducted on all phrases.
 was a feature-based compositeThe last step
ature-based composite
 approachapproach
 (FBCA). In (FBCA).
 this step,Inthethis step, the
 algorithm usedalgorithm
 hyperbole used hyperbole
 and lexical with and lexica
 punctuation and negation features to
 nctuation and negation features to detect sarcastic tweets.detect sarcastic tweets.

 Riloff et al. (2013) developed two bags of lexicons using bootstrap techniques.
loff et al. (2013) developed
 These lexiconstwo bags ofoflexicons
 consisted positiveusing bootstrap
 sentiments and techniques. These lexicons con
 negative situations.
 positive sentiments and negative
 They attempted situations.
 to identify sarcasmThey attempted
 in tweets for any to identify
 positive sarcasm
 sentiment in a in tweets fo
sitive sentimentnegative situation.
 in a negative However,However,
 situation. their method hadmethod
 their some limitations
 had somesince it could since it cou
 limitations
 not identify sarcasm across
entify sarcasm across multiple sentences. multiple sentences.

 In the present study, the researchers will investigate the sarcasm detection
 the present study, the researchers
 technique on Persianwill investigate
 tweets the sarcasm
 by combining detection technique
 deep learning-based and machine on Persian twe
mbining deep learning-based and machine
 learning-based approaches. Theylearning-based
 aim to find bothapproaches. They
 world-level and aim to find both
 sentence-
 level inconsistencies. Four different sets of features are extracted
vel and sentence-level inconsistencies. Four different sets of features are extracted to cover to cover di
pes of sarcasm different types
 in Persian. In of sarcasm this
 addition, in Persian. In addition,
 study presents thethis study
 first presents
 Persian the first
 proverb dictionary to
 Persian proverb dictionary to ignore any misclassification, which will be
y misclassification, which will be further explained in the next section.
 further explained in the next section.

 PROPOSED MODEL
 PROPOSED MODEL
 The existence of sarcasm in tweets can lead to a state of ambiguity. It can
 e existence of also
 sarcasm in tweets
 decrease can lead
 the accuracy of to a state
 some NLPoftasks
 ambiguity.
 such as It can alsoAnalysis
 Sentiment decrease the accur
me NLP tasks such as Sentiment
 and Text Analysis
 Summarisation (Bhartiand Text
 et al., Summarisation
 2017). Consider the(Bharti et al.,
 following 2017). Consid
 tweet,
 lowing tweet, "!!!‫ بهتر از این نمیشه‬،‫( "گوشیم خورد شد‬My (My phone
 phoneis is broken,
 broken, howhow
 betterbetter
 can can this b
 this be???). Although it is a sarcastic tweet, without deploying the sarcasm
though it is a sarcastic tweet, without deploying the sarcasm detection model, the sentiment an
 detection model, the sentiment analysis model may mislead. Therefore, it is
odel may mislead. Therefore, it is crucial to develop a model to detect sarcasm.
 crucial to develop a model to detect sarcasm.

gure 1 represents the architecture
 Figure 1 represents of
 thethearchitecture
 proposed model. The model
 of the proposed consists
 model. The of four key compo
 model
 consists of four (2)
mely (1) Data Preprocessing, keyData
 components, namely
 Preparation, (3)(1) Data Preprocessing,
 Feature Extraction, and(2)(4)
 Data
 Classification
 Preparation, (3) Feature Extraction, and (4) Classification. From each tweet,
ch tweet, the researchers extracted a set of features in a way that covered different types of sa
 en, they used several machine learning algorithms to perform the classification.
 5
Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20

 the researchers extracted a set of features in a way that covered different types
 of sarcasm. Then, they used several machine learning algorithms to perform
 the classification.

 Figure 1.
 Figure 1. Architecture
 Architecture of
 of Persian
 Persian sarcasm
 sarcasm detection
 detection model.
 model.
 Figure
 Figure 1. Architecture of Persian sarcasm 1. Architecture
 detection of Persian sarcasm detection model.
 model.

 Data
Data Collection
 Collection
 Data Collection
 Data Collection
OneOne problem
 problem of NLPoftasks
 NLPin tasks
 PersianinisPersian
 that thereisisthat there isdataset
 no standard no standard dataset
 for sentiment for nor
 analysis
 One
sarcasm problem
 detection of NLP
 (Gelbukh, tasks
 2009).
 sentiment analysis nor sarcasm in Persian
 Therefore, the is
 present
 detection that there
 study
 (Gelbukh, is
 created ano standard
 Persian dataset dataset
 for the forissentiment
 sarcasm a
detection using Tweepy API. To collect One problem of NLP
 sarcastic tweets, tasks 2009).
 the researchersin queriedTherefore,
 Persian theisAPIthat
 for there
 tweets no standa
 sarcasm
 present detection
 study created(Gelbukh,
 a Persian 2009).
 datasetTherefore,
 for sarcasmthe present
 detection study
 using created
 Tweepy a Persian dataset
 sarcasm
containing hashtags #‫ طعنه‬or #‫( کنایه‬both mean detection (Gelbukh, tweets
 sarcasm). Non-sarcastic 2009). Therefore,
 were the on
 retrieved based present study
 detection
 API. To using
 collect Tweepy
 sarcastic
political hashtags. Then, these tweets API.
 tweets,To collect
 the
 were manually sarcastic
 researchers
 checked tweets,
 queried
 and API.
 cleanedTo thethe
 API
 up by researchers
 for
 removing noisy queried
 tweets and the API
 containing hashtags detection using Tweepy collect sarcastic tweets, the re
 containing
irrelevant tweets.hashtags
 In this step#‫طعنه‬
 to avoid #‫کنایه‬ten(both
 or bias, (both mean
 mean
 different sarcasm).
 sarcasm).
 computer Non-sarcastic
 Non-sarcastic
 scientists tweets
 and native speakerstweetswerewere retrieve
 were for
employed retrieved
 checkingbased
 labels. on containing
 political
 Ultimately,
 hashtags
 hashtags. Then,#‫طعنه‬
 theseor tweets
 #‫کنایه‬ (both
 were meanhashtags
 manually sarcasm). Non-sarc
 political hashtags. Then, these the study were
 tweets collected 1,200
 manually sarcastic tweets
 checked containing
 and cleaned up by removing
 checked and cleaned political
 upstep
 by tweets.
 removinghashtags.
 In thisnoisy Then, theseandtweets wereIn manually
 thiswereandchecked an
 tenand irrelevant tweets.
#‫طعنه‬ or #‫ کنایه‬and 1,300 non-sarcastic dataset, sarcastic non-sarcastic tweets
 irrelevant tweets. In this to avoid bias, different computer scientists native spe
labelled 0 and 1, respectively. irrelevant tweets. In this step
 step to avoid bias, ten different computer scientists and native speakers were to avoid bias, ten different compute
 employed for checking labels. Ultimately, the study collected 1,200 sarcastic tweets containi
 employed for checking labels. employed for checking
 Ultimately, the study labels. Ultimately,
 collected the study collected 1,20
 1,200 sarcastic
 #‫طعنه‬
Data or #‫ کنایه‬and 1,300 non-sarcastic tweets. In this dataset, sarcastic and non-sarcastic t
 Preprocessing
 tweets containing hashtags #‫ طعنه‬or #‫ کنایه‬and and 1,300 non-sarcastic tweets.
 1,300 non-sarcastic tweets. In this
 In this dataset, sa
 labelled
 dataset, 0 and 1,
 sarcastic respectively.
 andpreprocessed
 non-sarcastic
In this section, tweets were labelled 0tweets 1,were
 andand
 to clean labelled
 them for0feature
 respectively.
 transfer and 1,extraction.
 respectively.
 This step
includes the following process:
 Data
 DataPreprocessing
 Preprocessing
 Data Preprocessing
Text Filtering: The researchers filtered non-Persian tweets, URLs, retweets, mentions, and some special
characters such as ‘^%#-+’. In addition, all hashtags were removed and replaced informal words with
 InInthis
 this section, tweets
 tweetswere
 werepreprocessed
 preprocessed to to
 clean andand
 clean transfer themthem
 transfer for feature
 for feature extraction
 In thiscomprehensive
formal words using Dehkhoda – the largest
 extraction. section, tweets
 Persian were preprocessed
 dictionary to clean and
 (Dehkhoda, 1931). transfer t
 includes the This step includes
 following process:the following process:
 includes the following process:
 5
 Text Filtering: The researchers filtered non-Persian tweets, URLs, retweets, mentions, and so
 6 The researchers filtered non-Persian tweets, URLs,
 Text Filtering:
 characters such as ‘^%#-+’. In addition, all hashtags were removed and replaced informal
 characters such as ‘^%#-+’. In addition, all hashtags were remov
 formal words using Dehkhoda – the largest comprehensive Persian dictionary (Dehkhoda, 193
 formal words using Dehkhoda – the largest comprehensive Persian
Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20

 Text Filtering: The researchers filtered non-Persian tweets, URLs, retweets,
 mentions, and some special characters such as ‘^%#-+’. In addition, all
 hashtags were removed and replaced informal words with formal words using
 Dehkhoda – the largest comprehensive Persian dictionary (Dehkhoda, 1931).

 Emoji Dictionary: People Emoji Dictionary:
 usually use emojiPeopleduringusually
 their daily useconversations
 emoji during their daily
 Emoji Dictionary: People usually use emo
 se emoji duringintheir daily conversations
 microblogs such as Twitter Twitter in
 andmicroblogs
 Facebook.
 and Facebook. such
 As aAs as
 result,
 aEmojithe present
 result, researchers
 the present researchers creat
 Emoji Dictionary: People Dictionary:
 usually People usually use
 use emoji People
 ictionary:
Dictionary:
Dictionary:
 he present
 during usually
 People
 theirEmoji
 created
 researchers
 usually
 daily
 an use conversations
 Dictionary:
 emoji
 emoji
 created
 use an during
 dictionary
 emoji emoji
 during
 in
 People microblogs
 their daily
 containing
 dictionary
 Twitter their daily
 emojis.
 Emoji
 usually all
 such Twitter
 Twitter
 containing
 as
 Dictionary:
 conversations
 use emoji
 conversations
 These emojis.
 all
 emojis
 and
 during
 in
 were theiruse
 Facebook.
 People
 in microblogs
 These
 microblogs
 manually
 emoji
 usually
 daily
 emojis
 As
 such aduring
 use result,
 as emoji
 conversations
 such as
 were
 labelled as
 their
 the
 happy,indaily
 pres
 during mic
 sad
 y use emoji during their daily conversations in microblogs
 Twitter and such
 Facebook. as As
 Twitter Twitter
 a result,
 emojis. and Facebook.
 the
 These present
 emojis As a result,
 researchers the
 crea
,and
ople
 eir
 and
 the
 ple
anually
 present
 usually
 daily
 nd Facebook.
 Facebook.
 labelled
 researchers
 use
 As
 emoji
 conversations
 Asmanually
 as in
 aahappy,
 result,created
 during
 result, the an
 their
 microblogs
 Twitter the present
 sad, and
 labelled emoji
 daily
 such dictionary
 conversations
 Facebook.
 present
 and as
 researchers
 as happy, containing
 As acreated
 sad,
 researchers
 neutral. Based
 in
 result,
 and
 created
 on the an all
 microblogs
 Twitter and
 theemoji
 neutral.
 an present
 Based
 emoji
 dictionary,
 such
 Facebook.
 dictionary as
 researchers
 on the
 dictionary As a result,
 containing an emoji dictional
 the
 created all
 dictionary,
 containing each
 all
 were manually
 present resear
 lt, the present researchers created each
 anneutral.
 emojiall emoji
 dictionary
 Twitter in a retrieved
 containing
 emojis. These tweet
 all was
 Twitter
 emojis werereplaced
 emojis.
 manually by
 Theseits relative
 emojis
 labelled label.
 were
 wasaslabelled
 happy,manu
manually
 As
 ers
 mojis.
emojis.
 placed
 labelled
 acreated
 result,
 These
 These
 by its
 the
 an as happy,
 present
 emoji
 emoji
 emojis
 emojis
 relative were
 were
 sad,
 Twitter and
 researchers
 dictionary
 in amanually
 retrieved
 manually emojis.created
 containing
 tweet
 labelled
 These
 labelled
 Based
 an
 was
 as onwere
 emoji the
 replaced
 asemojis
 happy,
 happy, sad,
 sad,
 dictionary,
 dictionary
 Twitter
 by each
 anditsneutral.
 and
 manually relative emoji
 containing
 emojis. These
 labelled
 neutral.
 in
 label.
 Based
 Based asall
 on
 on
 ahappy,
 emojisretrieved
 the sad,tweet
 were manually
 the dictionary,
 and neutral.
 dictionary, Basedsa
 replaced bo
 as
 manually
 replaced
mojis labelled
 were byand as label.
 happy, sad, and neutral. each
 its relative
 manually label.
 labelled Basedemoji
 as
 onhappy, on thein dictionary,
 sad, each emoji
 a retrieved tweet
 and neutral. in a retrieved
 was replaced Based tweet was
 by its relative repla
 label.
 on the
 appy,
oji
oji ji in in aasad,
 retrieved
 retrieved neutral.
 tweet
 tweet was Based
 waseach replaced
 emoji
 replaced theby indictionary,
 by aitsretrieved
 its relative
 relativetweet label.
 label.was each emoji
 replaced bydictionary,
 in aitsretrieved
 relativetweet label.was replaced by its relati
 slabel.
 replaced
 tweet wasby its relative
 replaced Proverb
 by its label. Dictionary:
 relative label. Persian speakers mostly use slang expressions
 as well as several common Proverb Dictionary: proverbs in Persian Proverb speakers
 their conversationsDictionary: mostly Persian
 (Hokmi, use speakers
 slang expres most
ers kers mostly use
 mostly use 2017). slang expressions
 slang Proverb
 expressions
 1,000 mostly common as
 as use well
 proverbs
 well as
 Proverbseveral
 asPersian in
 several their common
 Dictionary:
 common conversations Proverb
 Persian
 proverbs (Hokmi,
 speakers
 inslang
 theirDictionary:2017).
 mostly
 conversations Persian
 1,000
 use slang speakers
 common
 (Hokmi, expre 20
 Dictionary:
 Dictionary: Persian Persian speakersspeakers Dictionary:
 mostly usePersian
 slang
 slang slang
 expressions Proverb
 speakers
 expressions and as proverbs
 as Dictionary:
 mostly
 well
 well as use
 as were
 several
 several collected
 Persian common speakers
 expressions
 common using asmostly
 well as usesev sl
 kmi,
 eakers
Hokmi,
Persian
 ginersian 2017).
 mostly
 2017).
 speakers 1,000
 use common
 slang
 Dehkhoda
 1,000
 mostly commonuseexpressions Persian
 Dictionary.
 slangPersian as slang
 well
 collected
 slang
 expressions The and
 as
 proverbs
 and asstudyproverbs
 several
 using
 proverbs
 well in also
 as common
 Dehkhoda
 their were
 were used
 several collected proverbs
 Dictionary.
 conversationsthe
 common website
 using in
 The
 (Hokmi, their
 Dehkhoda study
 created conversations
 2017). also by used
 1,000
 Dictionary. the (Hokm
 common
 The websstu
 inexpressions
 their
 their conversations as well as
 conversations severalin
 proverbs
 (Hokmi,
 (Hokmi, common
 2017).
 2017).their1,000 conversations
 1,000 commonproverbs
 common Persian in
 (Hokmi,
 Persian slangtheirand
 2017).
 slang and conversations
 1,000 proverbs
 common
 proverbs were (Hokmi,
 were Persian 2017). slang and 1,000 p
 Hokmi,
 y.The
ersations
 common The study 2017).
 study also
 (Hokmi,also
 Persian 1,000
 used
 Ehsan
 used the
 2017).
 slang common
 theand website
 (2013)
 website
 1,000 proverbsPersian
 to created
 find
 created
 common were slang
 some
 someby
 by Ehsan
 Ehsan
 Persian andslang
 other
 collected proverbs
 (2013)
 slangs
 slangs
 (2013) using andto
 collected to
 suchwere
 such
 findfind
 Dehkhoda
 proverbs as:
 as:
 some
 using collected
 "‫جهنم‬Dictionary.
 other
 were
 Dehkhoda‫"به‬ (The
 (The
 slangs using hell
 The
 hell
 such Dehkhoda
 Dictionary. with
 study
 as: "‫جهنم‬it),
 also
 it),
 The Dictionary.
 used
 "‫کردی‬
 ‫"به‬
 study (The the
 ‫سرویس‬
 also web
 hell Th
 us
 using
 using Dehkhoda
 Dehkhoda Dictionary. Dictionary. collected The
 The studyusing
 study also Dehkhoda
 also used used the Dictionary.
 the website
 website created The study
 created by
 by Ehsan
 also used
 Ehsan (2013)
 (2013) the to website
 to find
 find created by Ehsan
 ary.
da he
 (The The
 hell
 Dictionary. study
 hellwithwith also
 The used
 it),"‫کردی‬
 it), "‫کردی‬
 study the
 ‫سرویس‬
 ‫سرویس‬
 also website
 used ‫"دهنمو‬ the created(bite
 (bite
 (bite
 website bycreated
 me),
 me),
 "‫میجوشه‬ Ehsan
 some ‫سرکه‬
 ‫سرکه‬ ‫(و‬2013)
 other
 (I"‫جهنم‬
 haveby‫سیرو‬ ‫سیر‬ slangs
 Ehsan ‫مثل‬
 a‫"دهنمو‬ to‫"دلم‬
 ‫مثل‬
 butterfly find
 ‫"دلم‬
 such
 (2013) "‫میجوشه‬
 in tosome
 as: my (Isuch
 "‫جهنم‬
 (I
 find other
 have ‫"به‬as:
 stomach), slangs
 a(The
 butterfly
 hell
 butterfly
 etc. such in
 inmy
 with
 Finally, as:
 myit),"‫جهنم‬
 "‫کردی‬
 stomach),
 each ‫سرویس"به‬
 (The etc
hererthe website
 slangs
 slangs such
 such created
 as: by Ehsan
 "‫جهنم‬
 "‫جهنم‬
 as:stomach), ‫"به‬
 some
 ‫"به‬ (The
 (The (2013)
 other
 hell
 hell to
 slangs
 with
 with find
 it),
 such
 it), "‫کردی‬
 "‫کردی‬as: ‫سرویس‬
 ‫سرویس‬ some
 ‫"به‬ (The
 ‫"دهنمو‬ other
 (bite
 hell
 (bite slangs
 with
 me),
 me), ‫سرکه‬
 it),
 ‫سرکه‬"‫کردی‬‫سیر وو‬ "‫جهنم‬
 ‫سیر‬‫سرویس‬
 ‫مثل‬
 ‫مثل‬ ‫"به‬
 ‫"دلم‬
 ‫"دلم‬ (The
 ‫"دهنمو‬ hell
 (bite me),tweet
 with it),
 ‫سرکه‬"‫ی‬
 "
as:
‫کر‬ (The
 ach),
 mach), hell
 etc.
 "‫جهنم‬a ‫"به‬
 ‫سرویس‬ etc. with
 Finally,
 ‫"دهنمو‬Finally,
 (The(bite it),
 hell "‫کردی‬
 each
 each
 with
 me), tweet
 tweetetc.
 ‫سرویس‬
 it),
 ‫سرکه‬ "‫کردی‬
 ‫و‬Finally,
 was ‫"دهنمو‬
 ‫سیر‬ scanned
 ‫سرویس‬
 scanned
 ‫مثل‬ each
 (bite
 or
 ‫"دلم‬ ‫"دهنمو‬
 to tweet
 me),
 to
 slang"‫میجوشه‬ ‫سرکه‬
 replace
 replace
 (bite
 with was (I‫و‬
 its
 me),
 the scanned
 ‫سیر‬
 have
 its
 "‫میجوشه‬ ‫مثل‬
 a
 proverb
 ‫سرکه‬
 proverb
 direct ‫"دلم‬
 (I
 to
 butterfly
 or
 have
 replace
 slang
 ‫و‬meaning.
 ‫سیر‬ "‫میجوشه‬
 ‫مثل‬ain my
 with its
 ‫ "دلم‬each
 butterfly (Iproverb
 the have
 stomach),direct
 in my a or
 meaning.slang
 butterfly
 etc.
 stomach), Finally, in my
 etc.each stomac
 twee
 Finally,
 II have butterfly
 have a butterflywith in my
 in my "‫میجوشه‬
 stomach),
 stomach), (I have
 etc. Finally,
 a butterfly each intweet
 my stomach),
 was
 etc. Finally, each tweet was scanned to replace its proverb scanned etc. to
 Finally,
 replace its proverb
 tweet was scanned to repla
 tomach), etc. the direct
 tweet meaning.
 ly
ach inthe
 withtweet
with was Finally,
 my stomach),
 the direct
 direct scanned etc.each
 Finally,
 meaning.toorreplace
 meaning. slangeach
 was scanned
 tweet
 its proverb
 with the direct wasto replace
 ormeaning.
 slang with
 scanned toitsreplace
 or proverb
 the direct
 slang with theordirect
 its meaning.
 proverb slangmeaning.
 with the direct meaning.
meaning.
 Data Preparation Data Preparation Data Preparation
 Data Preparation Data Preparation Data Preparation
 paration
eparation
 eparation Data Preparation
 In this section, four types of preparation were In thisperformed
 section, on fourthe types dataset,
 of preparation wer
 namely (1) Normalisation, In this (2) section,
 word four types of
 Tokenisation, (3)In
 preparation
 Lemmatisation,
 this section,
 were four andperformed
 types(4) of preparatioon the d
 ration
 ction, were
 four performed
 types of on the dataset, namely In
 (1) this section,
 Normalisation,In this four typesperformed
 word
 (2)
 section, of
 four preparation
 Tokenisation,
 types of (3) were
 preparation performed
 Lemmatisation, were on
 and
 perform the(4
 ion
ection,
 ection, were fourperformed of preparation
 typesPart-of-Speech on Inthe
 preparation thisdataset,
 section,
 were
 were
 (POS) performed
 four
 namely
 performed types (1)
 word Tokenisation,
 Tagging. onof
 on the
 preparation
 dataset,
 Normalisation,
 the dataset, namely
 were
 namely(2)
 (3) Lemmatisation, (1)
 (1) Normalisation,
 on
 Normalisation, the dataset,
 (2)
 and (4) Part-of-Speech(2) namely (1)
 (PO N
 es aration
 on, on and
 of the were
 (4)
 preparation performed
 Part-of-Speech
 dataset, were
 namely on
 (1) the
 (POS)
 performed dataset,
 Tagging.
 on
 Normalisation, the namely
 dataset,
 (2) word(1) Normalisation,
 Tokenisation,
 namely (1) (3) (2)
 Normalisation, word
 Lemmatisation, Tokenisation,
 (2) and (4) (3) Lemmatisation,
 Part-of-Speech (PO a
kenisation,
 , andand
 kenisation, (4) (4) (3) Lemmatisation,
 Part-of-Speech
 (3) Lemmatisation, word (POS) Tokenisation,
 and (4)
 andTagging. Part-of-Speech(3) Lemmatisation,
 (4) Part-of-Speech (POS) Tagging. (POS) word Tokenisation,
 Tagging. and (4) (3)
 Part-of-Speech Lemmatisation,
 (POS) and
 Tagging. (4) Part-of-S
 tion,
Lemmatisation,
eech (POS) Tagging. Part-of-Speech (POS)
 and (4) Part-of-Speech (POS) Tagging. Tagging. Normalisation: This step converted a list of
 Normalisation: This step converted a list
 Normalisation: This of stepwordsconverted into a uniform a list sequence.
 of
 d
 sation: a list of
 This words
 step into
 converted a uniform
 Normalisation:
 a list sequence.
 of words This Normalisation:
 With
 into
 step a normalisation,
 converted
 uniform Normalisation: This
 sequence.
 a list the step
 researchers
 of
 Normalisation:
 converted
 This
 words
 With aimed
 step auniform
 list
 to
 converted
 normalisation,
 into a ofwords
 This
 overcome the
 step
 words
 sequence.
 into
 a listconverted
 into a aunifo
 profound
 of words
 With unifoacin
 n
 sation:
 a list of This
 words stepWith converted
 into anormalisation,
 uniform a listsequence.
 of words the researchers
 into
 With
 researchers aimed
 a normalisation,
 uniform
 aimed to
 sequence.
 to overcome
 the
 overcome Withprofound
 normalisation,
 profound challenges
 challenges the in Persian
 ted
 ep ofound
 a a listchallenges
 converted
 uniform of sequence.
 words
 a list into
 in
 of a uniform
 Persian
 words
 With into sequence.
 including:
 normalisation,a uniform (1)
 the With
 researchers
 Existence
 sequence. normalisation,
 With aimed
 of
 researchers various to the
 normalisation, aimed researchers
 overcome
 prefixes such
 tothe profound
 as
 overcome aimed
 “‫”ب‬ (b), to
 challenges
 “‫”بر‬
 profound overcome
 (bar), in(1)
 “‫”پس‬ profou
 Persian
 challenges (pa
 rs aimed to overcome
 in inPersian researchers
 profound including: challenges
 aimed to in
 overcome
 (1)(1)Existence Persian profound
 including: challenges
 (1) Existence in Persian
 of various including: Exis
ers
 ers aimed
found
 profound
 “‫”پس‬
 ,vercome (pas),
 to overcome
 challenges
 challenges
 “‫”بر“”فرا‬
 profound (fara),
 challenges
 profound
 inPersian
 Persian
 etc.; (2)insuch
 challenges
 including:
 including:
 Different
 Persian prefixes (1)
 inExistence
 encoding
 including:
 Persian
 such
 Existence
 prefixes asofExistence
 such
 forms
 (1)
 of
 “‫”ب‬
 of
 as
 various
 including:
 for
 various
 (b),
 various
 “‫”ب‬
 some
 prefixes
 “‫”بر‬
 (b),
 (1) (bar),
 characters
 of “‫”بر‬
 Existence
 prefixes
 various
 such as
 (bar),
 like “‫”پس‬
 such
 “‫”ی‬ “‫”پس‬ (y)
 “‫”ب‬
 of(pas),
 various
 as(bar),
 “‫”ب‬
 (pas),
 and
 (b),
 “‫”فرا‬
 (b),
 “‫”ک‬ “‫”فرا‬ (fara),
 “‫”بر‬
 (k); (bar),
 (fara),
 (3) etc.;“‫س‬
 etc.;
 Using (
 such Persian
 such as
 as “‫”ب‬
 “‫”ب‬including:
 (b),
 (b), “‫”بر‬ (1)
 (bar),
 (bar),
 (bar), Existence
 prefixes “‫”پس‬
 “‫”پس‬ (pas),
 (pas),
 (pas), ofasvarious
 “‫”فرا‬
 “‫”ب‬
 “‫”فرا‬ (fara),
 (fara),
 (b),
 (fara), “‫”بر‬ etc.;
 etc.;
 etc.; prefixes
 (bar), (2)
 (2)
 (2) Different
 “‫”پس‬
 Different
 Differentsuch(pas), as “‫”ب‬
 encoding
 “‫”فرا‬
 encoding
 encoding (b),
 (fara),
 forms“‫”بر‬
 forms
 forms etc.;for
 forfor
 (2)
 some
 some “‫”پس‬
 some
 Different (pas), “‫”فرا‬
 encoding (f
 r),‫”پس‬ “‫”پس‬(pas), (pas),“‫”فرا‬ (fara),
 “‫(”فرا‬pas),
 (fara), etc.;etc.; (2) (2) Different
 Different encoding
 “‫تر‬charactersencoding like forms
 (4)forms “‫”ی‬ for
 (y)
 for formssome
 and
 some “‫”ک‬ (k);
 characters (3) Using
 like “‫”ی‬ half-space
 (y) and “‫”ک‬ such(k);as (3) “‫“ا‬U
 ),
 rs
 ,(3)
 sa), “‫”بر‬
 like
 like
 Using
 etc.; (bar),
 “‫”ی‬
 “‫”ی‬ (2)(y)
 (y)
 “‫”پس‬
 half-space
 andcharacters
 Different
 and “‫”ک‬
 “‫”ک‬
 such “‫”فرا‬
 encoding
 (k); as
 characters
 (k); (3)
 (3)
 “(fara),
 ‫ها‬
 like ‫آن‬ ”“‫”ی‬
 forms
 Using
 Using
 and
 like etc.;
 (y)
 for
 “‫”ی‬
 half-space
 half-space
 ‫سخت‬
 (2)
 some characters
 ”;
 (y)Different
 and “‫”ک‬
 and and
 such
 such “‫”ک‬(k);
 as
 as(4)
 like
 Existence
 encoding
 (3)
 “characters
 “(k);
 ‫ها‬
 ‫ها‬ ‫آن‬ ””“‫”ی‬
 ‫(آن‬3) Using
 and
 andUsing(y)
 “a“like
 of
 ‫تر‬ and
 wide
 ‫سخت‬ for
 “‫”ی‬ “‫”ک‬
 range
 ‫تر‬half-space
 ”;
 half-space
 ‫سخت‬
 some
 (y) (k);
 andand
 ”; and
 ofsuch
 (4)
 such
 (4)
 (3)
 suffixes
 “‫”ک‬ Using
 as
 Existence
 as
 Existence “‫ها‬such
 (k); ‫آن‬half-space
 (3) and
 ”of
 of
 as
 and “‫”ها‬
 Using “‫سختتر‬ such
 (ha),
 half-space as
 ”;“‫رین‬
 and
3)
 ); Using
 (3) Using
 “‫”ترین‬half-space
 half-space such such as as“ ‫ها‬“ ‫آن‬
 ‫ها‬ ”
 ‫آن‬ ”and and “ ‫تر‬
 “
 a ‫تر‬‫سخت‬
 ‫سخت‬
 wide a ”;
 ”;
 wideand
 and
 range (4)
 of
 range Existence
 Existence
 suffixes
 of suffixes suchof
 of a
 suchas “‫”ها‬
 wideas (ha),
 range
 “‫”ها‬ “‫”ترین‬
 of
 (ha), suffixes
 “‫”ترین‬ (tarin),such
 (tarin), “‫”یان‬
 as “‫”ها‬
 “‫”یان‬ (yan),
 (ha
 (yan
 nd uch
 nge
ange
 “‫”ک‬
 (ha), as
 of (k);
 ‫(”آنها‬3)
 “suffixes
 of suffixes
 (tarin),
 and such“‫”یان“تر‬
 Using
 such ‫سخت‬
 as a”;(yan),
 half-space
 as “‫”ها‬
 “‫”ها‬ wide ;(ha),
 and and
 (ha), etc.
 (4)such(4)
 “‫”ترین‬
 range (Mohtaj,
 as
 Existence
 “‫”ترین‬ of “‫آنها‬of
 Existence
 (tarin),
 suffixes
 (tarin),
 ” “‫”یان‬
 and of“(yan),
 Roshanfekr,
 such
 “‫”یان‬
 ‫تر‬
 aas ‫سخت‬
 wide
 (yan), “‫”ها‬ ”;
 a Zafarian,
 wide
 etc.
 etc.
 and
 range
 (ha), (4)
 range
 (Mohtaj, Asghari,
 &
 ofExistence
 “‫”ترین‬
 (Mohtaj, ofsuffixes 2018).
 suffixes
 Roshanfekr,
 (tarin),
 Roshanfekr,
 ofsuch as
 suchZafarian,
 “‫”یان‬ as “‫”ها‬etc.
 (yan),
 Zafarian, (ha),
 (ha),
 &&(Mohtaj,“‫ ”ترین‬Roshan
 (tarin),
 ha),
 ‫”ه‬
‫”یان‬such “‫”ترین‬
 (ha), as “‫”ترین‬
 “‫”ها‬ (tarin),
 (ha),(tarin), “‫”یان‬
 “‫”ترین‬
 (yan), etc. (Mohtaj, Roshanfekr, “‫”یان‬ (yan),
 (yan),
 (tarin), etc.
 etc.
 “‫”یان‬ (Mohtaj,
 (Mohtaj,
 (yan),
 Zafarian, & Asghari,
 etc.
 etc. Roshanfekr,
 Roshanfekr,
 Asghari, 2018).
 (Mohtaj,
 (Mohtaj, 2018). Zafarian,
 Zafarian,
 Roshanfekr,
 Roshanfekr,
 Asghari, 2018). && Asghari,
 Zafarian,
 Zafarian, & & 2018).
 Asghari, 2018).
 2018).
 2018). Asghari, 2018).
 Tokenisation: This step was conducted to
ducted to break Tokenisation:
 each tweet down This to itsstep was
 Tokenisation:
 constitutive conducted
 Tokenisation: words. This to
 This
 Letstepbreak step
 consider
 us was Tokenisation:
 each
 was an tweet
 conducted
 example:
 conducted down This
 to
 to‫هستم‬break tostep
 break
 ‫نویس‬ its was
 each
 ‫برنامه‬ ‫یک‬ conducte
 tweet
 ‫ "من‬eac "do d)
ation:
 ation: This This step step was Tokenisation:
 was conducted
 conducted toto break
 break Thiseach steptweet
 each was conducted
 tweet downTokenisation:
 down toto itsits to This
 constitutive
 break
 constitutive step
 each was
 words.
 tweet
 words. conducted
 down
 Let
 Let ustoeach
 us its tweet
 to constitutiv
 break
nducted
 cted
 ‫یک‬
 pan ‫"من‬to
 was down " to
 break
 )I break
 am
 conducted constitutive
 each
 a each tweet
 programmer tweet
 toconstitutive
 break words.
 down
 down
 . It
 eachwords. to
 converts
 tweet toLet
 itsits
 downus consider
 constitutive
 consider
 constitutive
 to ،"‫نویس‬
 to ‫نویس‬
 its‫"برنامه‬anan example:
 words.
 words.
 an constitutive example:
 ،"‫"یک‬ Let
 Let
 ، "‫"من‬
 ‫هستم‬ us
 ‫هستم‬
 words.us
 "‫"هستم‬. consider
 ‫نویس‬Let ‫برنامه‬
 us an‫یک‬ example:
 ‫"من‬ " (I
 )I am
 ‫هستم‬
 am a‫نویس‬ ‫برنامه‬
 programmer ‫یک‬ ‫ن‬
 tweet example:‫هستم‬
 an example: to its
 ‫هستم‬ ‫نویس‬
 ‫نویس‬ ‫برنامه‬ ‫یک‬
 ‫"من یک‬
 consider
 ‫برنامه‬ ‫"من‬ an "" )I amLet
 example:
 )I am considerus
 ‫هستم‬
 aa programmer
 programmer example:
 ‫برنامه‬.consider
 . ItIt‫یک‬converts‫"من‬
 converts an " ‫نویس‬
 example:
 )I
 to
 toam ‫برنامه‬
 ،"‫نویس‬ ‫هستم‬‫نویسیک‬
 ‫"برنامه‬
 a programmer
 ،"‫نویس‬ ‫"برنامه‬ ‫"من‬
 ،"‫"یک‬
 ،"‫"یک‬" )I. ،It،am
 ‫برنامه‬ ‫یک‬
 "‫"من‬ a‫"من‬
 converts
 "‫"من‬ programmer
 " )Itoam a pro
 ،"‫نویس‬ ‫ه‬.
 ‫یهب‬ammer
 ‫یک‬ ‫"من‬
 ‫برنامه" "من‬
 ‫نویس‬ )I " )I
 . Itam am
 ‫یک‬a‫"من‬
 converts
 programmer
 a programmer
 programmer
 " to
 )I am
 ،"‫نویس‬
 .. It
 It converts
 . It converts
 a programmer
 ‫"برنامه‬ converts
 ،"‫"هستم""یک‬. .،It to
 "‫"من‬ ،"‫نویس‬
 ،"‫نویس‬
 toconverts ‫"برنامه‬
 "‫"هستم‬.‫"برنامه‬ ،"‫نویس‬
 to "‫"هستم‬. ،"‫"یک‬
 ،"‫"یک‬ ‫"برنامه‬، "‫"من‬
 ،"‫"من‬ "‫"هستم‬.
 ،"‫ "یک‬،"‫"من‬
 "‫"هستم‬.
 Lemmatisation and Stemming: This step wa
his step was Stemming:
 Lemmatisation
 done to decrease the size
 andofStemming:the Lemmatisation
 dataset
 Thisfind
 and the
 stepthe was
 and root
 done
 Stemming:
 ofwasalldataset to decrease
 Lemmatisation
 words. This step theandwassize doneof to decrease
 Stemming: This to s
 isation
 isation and and Stemming: the Lemmatisation
 This
 datasetThis step
 andstep was
 findwas the Lemmatisation
 and
 done
 done root toStemming:
 to decrease
 ofdecrease
 all words. Lemmatisation
 This
 the and size
 size Stemming:
 stepofof the
 the and
 done
 dataset Stemming:
 This
 to and step
 decrease
 and find
 find was
 the This
 the
 the done
 root
 size
 root step towas
 of the done
 decrease
 dataset th
This
emming:
 ecrease stepthe was
 This size done
 step
 of to decrease
 was
 the done
 dataset to the
 and size the
 decrease
 find ofthe the
 root dataset
 of all
 size of words.
 the anddataset find the
 androot findofthe allroot
 words.
 rds. step
 rds. was done to decreaseof all the
 words. size of the
 of alldataset
 words. and of find
 all the
 words. root
 Part-of-Speech (POS) Tagging: It is a proce
 Part-of-Speech (POS) Tagging: It is a process of converting tweets into lists
 is a process of converting tweets into lists of Part-of-Speech
 tuples where each
 Part-of-Speech (POS)
 tuple hasof Part-of-Speech
 Tagging:
 a form
 (POS) (word,It is POS
 Tagging: aeach (POS)
 process
 tag).
 Ittuple
 is of
 TheTagging:
 converting
 POSoftag
 aconverting
 process ofIt conv
 istwe
 sign a
Speech
 Speech (POS) (POS) Tagging: of tuples
 Tagging: Part-of-Speech
 ItIt isis aa process
 where each of
 process of(POS)
 tuple hasTagging:
 converting
 converting
 Part-of-Speech a form tweets
 tweets It(POS)
 (word, is
 into a process
 into lists
 POS
 lists
 Tagging: tuples
 of
 tag).
 of converting
 The
 tuples It where
 POS
 where
 is a tweets
 tag
 each
 process signifiesinto
 tuple
 of lists tuples
 twee
 Itting
 S is tagatweets
 process
 signifies is of converting
 whether of the tweets
 word into
 is a(word, lists of tuples
 has a form where (word, each POStuple
 All hasPOS
 tag). a form
 The
 preparation POS (word,
 tag
 steps POS
 signifies
 were done tag).
 whetherThe Hazm,
 using POS
 the taa
 wo
Tagging:
m ma(word,
 process
 (word,
 It into
 POS
 POSof atag).
 process
 lists
 whether
 converting
 tag). The
 The
 of the
 tuples
 POS
 has
 POS
 converting
 atag
 tweets where
 word
 form
 tag signifies
 into isnoun,
 signifies
 tweets
 eacha noun,
 lists
 adjective,
 tuple
 whether
 POS
 whether
 has
 into tag).
 ofadjective,
 atuples
 form
 lists
 adjective,
 the
 theThe
 verb,
 ofhas
 word
 where
 word
 (word, POS and
 tuplesisisaeach
 verb, so
 aform
 atag
 POS and
 noun,on.
 where (word,
 signifies
 noun, tuple
 tag).
 each
 soadjective,
 on. tuple
 whether
 adjective,
 The POS tag).
 verb,
 verb,
 tag theThe
 andword
 and POS
 signifiesso
 so on.is
 on.tag signifies
 awhether
 noun, whet
 adjective,
 the wor
ng OS
 ag).
 rration tag
 Hazm,
 the The
 word signifies
 a ispython
 POS tag
 awhetherwhether
 noun, library
 signifies the
 adjective, for word
 whetherdigesting
 verb, isthe a noun,
 Persian
 word All
 atext.
 islibrarynoun, verb,
 preparation
 adjective, and stepsso
 verb, on.
 were andAll
 done preparation
 so on. usingdone Hazm, steps were
 a python done using
 library forHlid
aration
 tag steps
 steps
 signifies were
 were done
 done using
 All
 using
 the preparation
 word Hazm,
 Hazm, is a aaandpythonso adjective,
 steps
 python
 noun, All
 on.were
 library done
 preparation for
 for verb, All and
 digesting
 using
 digesting
 steps
 preparation
 Hazm, so
 werePersian
 Persianadone
 on. steps
 python
 text.
 text. were
 using library
 Hazm, for using
 digesting
 a python
 Hazm, a python
 Persian
 library text.
 for di
 re
 aryingdone Hazm,
 for usinga python
 digesting Hazm,library a python forlibrary
 digesting for Persian
 digesting text.
 Persian7text. Feature Extraction
 Hazm, a pythonPersian library text. for digesting PersianFeature text. Extraction Feature Extraction
 Extraction
 Extraction Feature Extraction Feature Extraction
 Feature Extraction In this section, four sets of features were ex
Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20

 All preparation steps were done using Hazm, a python library for digesting
 Persian text.

 Feature Extraction

 In this section, four sets of features were extracted, namely (1) Sentiment
 Feature, (2) Deep Polarity Feature, (3) POS Feature, and (4) Punctuation
 Feature.

 These features were extracted in a way that covered different types of Persian
 sarcasm. Based on these features, all of the retrieved tweets were represented
 by feature vectors. This section explains how to represent a tweet as a feature
 represent a tweet as a feature vector to train a classifier for sarcasm id
 vector to train a classifier for sarcasm identification.
 identification.
dentification.
 Deep Polarity FeaturesDeep Polarity Features

 This section focuses
 This section focuses on sentence-level on sentence-level
 inconsistency. inconsistency.
 Let us consider the twe Let us consid
 iderthe
der thetweet: et:
 tweet: “ “‫قدر‬
 ‫خرابشدشدچهچهقدر‬ ‫( ”!!!خوشبختم منگوشیم‬My
 ‫گوشیمخراب‬ (Myphone
 phoneisisbroken
 broken how
 howlucky I am!!!).
 lucky I am!!!). There is a s
 There is
 sentence-levelinconsistency
 entence-level a sentence-level inconsistency between the first and second parts
 betweenthe first and second parts of the tweet (i.e. My phone
 inconsistencybetween of is broken and
 dhow
 howlucky the tweet
 luckyI Iam!!!).
 am!!!).We (i.e. My
 Weconsideredphone is broken and how lucky I am!!!). We considered
 consideredeach tweet with more than n tokens as a multiple sentence tw
 each tweet with more than n tokens as a multiple sentence tweet (MST). Based
weet (MST).Based
 eet (MST). Basedononthe authors’observation on about 1,000 retrieved tweets, the proper values for n
 theauthors’
 on the authors’ observation on about 1,000 retrieved tweets, the proper values
 nare
 are6,6,12,
 12,and
 and18, 18,respectively.
 respectively.The Thebest value for n was evaluated in the Analysis and Results section. E
 for n are 6, 12, and 18, respectively. The best value for n was evaluated in the
Each
 ach MSTMSTwas was divided
 divided into
 into two
 two parts.
 parts.
 Analysis and Results section. Thus,Each
 for the
 MSTmentioned tweet,
 was divided it is
 into twoas parts.
 follows:
 Thus, for
 the mentioned tweet, it is as follows:
 tweet: My phone is broken how lucky I am!!!
 tweet: My phone is brokenn >how
 6 orlucky
 n>12I am!!!
 or n>18? Tweet is MST so divide it into two equal par
arts: skipitit
 ts: skip n > 6 or n>12 or n>18? nTweet is MST so divide it into two equal parts: skip it
 =8, Hence we have:
 n =8, Hence we have: Part1: My phone is broken
 Part1: My phone is brokenPart 2: how lucky I am!!!
 Part 2: how lucky I am!!!
 If there is anybetween
 If there is any sentiment inconsistency sentimenttheinconsistency
 first and second between
 parts ofthe
 thefirst and second
 dparts
 partsofofthethetweet,
 tweet,itititwill
 tweet, will
 will hint
 hint
 hint about
 about
 about beingsarcastic.
 being sarcastic.ToTo fulfil
 fulfil this,this,
 first,first, the present
 the present research combined
 research
 dedtwo
 twodeepdeepneural
 neuralnetwork
 combined network
 two deepmodels,
 models, LSTM
 neural and convolutional
 network models, LSTMneural network, as proposed
 and convolutional neural by Roshanfek
 ekretetal.al.(2017).
 kr (2017). Then,as
 network,
 Then, the
 the combined
 proposed
 combined bydeep learningetmodel
 Roshanfekr was employed
 al. (2017). Then, thetocombined
 each MST. Afterwards, the de
 deep
 eepmodel
 ep modelwas learning
 was appliedmodel
 applied totothe was
 thefirst employed
 andsecondtoparts
 firstand each of
 MST.
 eachAfterwards, the deep model
 tweet respectively. was
 Consequently, two new b
 binaryfeatures,
 binary applied to
 dpf1and
 features,dpf1 the first and
 dpf2were
 anddpf2 second
 were parts
 introduced of each
 (which tweet
 stand respectively.
 for deep Consequently,
 polarity feature).
 two new binary features, dpf1 and dpf2 were introduced (which stand for deep
 polarity feature).
 The feature dpf1 is activated if the first and second parts of the MST d
Tdodonot
 nothave
 havethethesame
 samesentiment.
 sentiment. The feature dpf2 is activated if the first or second parts of the MST d
 The feature dpf1 is activated if the first and second parts of the MST do not
Tdodonot
 nothavehavethethesame
 have same sentiment
 the sentiment
 same withthe whole MST’s sentiment.
 with
 sentiment.

 For example, 8in this tweet: “My phone is broken and how lucky I a
 am!!!”,the
am!!!”, thesentiment
 sentimentanalysis modeldetected positive sentiment for this tweet and detected negative se
 analysismodel
sentiment
 ntiment forforthe
 thefirst
 firstpart.
 part.Thus, thefeature dpf2 was activated.
 Thus,the
pmployed
 eep
 learning tomodel
 learning each
 modelMST.
 waswas Afterwards,
 employed
 employed the
 to to
 each deep
 each
 MST. model
 MST. was applied
 Afterwards,
 Afterwards, thetheto
 deepthe
 deep first was
 model
 model and
 was
 applied
 applied
 to to
 thethe
 first
 first
 andand
et
ond respectively.
econd parts
 partsof of Consequently,
 each
 eachtweet two newConsequently,
 tweetrespectively.
 respectively. binary features,
 Consequently, two dpf1
 twonew
 newand dpf2features,
 binary
 binary were
 features,dpf1
 dpf1
 andanddpf2
 dpf2were
 were
 Journal of ICT, 20, No. 1 (January) 2021, pp: 1-20
 r deep polarity
ntroduced
oduced (which
 (which feature).
 stand
 stand
 forfor
 deep
 deep
 polarity
 polarity
 feature).
 feature).

eedfeature
The iffeature
 the first
 dpf1 and
 dpf1 second
 is activated
 is
 The parts
 activated
 if the
 feature if offirst
 the
 dpf2 the MST
 isfirst
 andand
 activateddo
 secondifnot
 second have
 firstofthe
 parts
 the parts of
 the
 or same
 the
 MST
 secondMSTsentiment.
 dodo
 partsnotnot
 of have
 thehavethethe
 MST same
 dosame
 notsentiment.
 sentiment.
 have
eedfeature
The iffeature
 the dpf2
 first or
 dpf2 second
 is the
 is sameparts
 activated
 activated ifofthe
 if the
 sentiment thefirst
 first MST
 or or
 with dowhole
 second not
 second
 the have
 parts ofthe
 parts
 MST’s of same
 thethe
 MSTMST sentiment
 dodo
 sentiment. notnot with
 have
 have thethe
 same
 same sentiment
 sentimentwith
 with
nt.
 he
 whole
 whole MST’s
 MST’s sentiment.
 sentiment.
 For example, in this tweet: “My phone is broken and how lucky I am!!!”,
For “My
 t:example,phone
 example, in in
 thisthe
 is this sentiment
 broken
 tweet: and
 “My
 tweet: analysis
 how
 “Myphone lucky
 phone model
 I am!!!”,
 is is
 broken
 broken detected
 andthehow
 and howpositive
 sentiment
 lucky
 lucky sentiment
 I analysis
 am!!!”,
 I am!!!”,thefor
 model
 the this
 sentimenttweet
 sentiment and
 analysis
 analysis
 model
 model
nt for positive
 ected
 etected this tweet
 positive detected
 sentiment
 sentiment negative
 and detected
 forfor
 thisthis sentiment
 negative
 tweet
 tweetand and for
 sentiment
 detectedthe
 detected first
 for negative
 the part.
 negative Thus,
 first sentiment
 part. the
 sentiment feature
 Thus,forthe
 for
 thethe dpf2
 first
 first was
 part.
 part.
 Thus,
 Thus,thethe
 activated.
 ture
 eature dpf2
 dpf2
 waswasactivated.
 activated.
 Sentiment Feature
Sentiment
ntiment Feature
 Feature
 This section concentrates on word-level inconsistency, which refers to
 sonsection
This word-level
 section inconsistency,
 concentrates
 concentrates onon
 the coexistence which
 word-level
 word-level refers
 of negative toand
 thepositive
 inconsistency, coexistence
 inconsistency, which
 which ofwithin
 refers
 words negative
 refersto to
 thethe and
 coexistence
 the coexistence
 same tweet. of of
 negative
 Fornegative andand
same
 itive tweet.
 ositive
 words
 words For example,
 within
 within thethe
 example, same"‫بدبختم‬
 same tweet.‫انقدر‬For
 tweet. ‫که‬For‫خوبه‬ ‫( "چه‬I"‫بدبختم‬
 example,
 example, love my
 ‫انقدر‬
 "‫بدبختم‬
 love my misfortune).
 ‫انقدر‬
 ‫که که‬
 ‫خوبه‬
 ‫خوبه‬
 ‫"چه"چه‬
 misfortune). (IThere
 love
 (I love
 There is
 my
 is my
 amisfortune).
 misfortune).
 word-level There
 There is is
ncy betweeninconsistency
word-level
 word-level “‫( ”خوب‬love)
 inconsistency
 inconsistency and “‫”بدبخت‬
 between
 between
 between “‫”خوب‬ (love)
 “‫(”خوب‬misfortune).
 (love)
 (love)andand To
 and“‫”بدبخت‬
 “‫”بدبخت‬ (misfortune).
 identify such
 (misfortune).
 (misfortune). To identify
 ToToidentify
 identifysuch such
 t dictionary awas
 nconsistency,
 onsistency, a such
 sentiment inconsistency,
 provided
 sentiment by usingwas
 dictionary
 dictionary a two
 sentiment
 was popular
 provided dictionary
 provided Persian
 by byusing was
 using provided
 dictionaries,
 two
 two popular
 popularby using
 Moein
 Persian two
 Persian popular Moein
 dictionaries,
 dictionaries, Moein
 31).
 1972)
 72) and The
 and sentiment
 Dehkhoda Persian
 Dehkhoda (1931).dictionaries,
 dictionary
 (1931). The consisted
 The sentimentMoein
 sentiment (1972)
 ofdictionary
 2,500
 dictionary and
 emotional Dehkhoda
 consisted words
 consisted of of (1931).
 along
 2,500
 2,500 withThe sentiment
 emotional
 emotional words
 words along
 alongwith
 with
.ir
 heirThere were
 polarities
 polarities and dictionary
 four
 and different
 scores.
 scores. consisted
 There polarities
 There were
 wereof 2,500
 that
 four
 four emotional
 were
 different
 different words
 considered
 polarities
 polaritiesasalong
 that
 that with
 positive,
 were
 were their
 high polarities
 considered
 considered as as and
 positive,
 positive,high
 high
 scores. There were four different polarities that were considered as positive,
gh negative.
 itive,
 ositive, Theand
 negative,
 negative, scores
 andhigh were
 high integerThe
 negative.
 negative. from
 The 2 (i.e.
 scores
 scoreswerehigh
 were positive)
 integer
 integer from to2-2(i.e.
 from 2 (i.e.
 (i.e.
 highhigh
 highpositive)
 positive)to to
 -2 -2
 (i.e.
 (i.e.
 high
 high
 high positive, negative, and high negative. The scores were integer from 2 (i.e.
thin theIftweet
 ative).
 egative). If
 anyanydidword
 word not exist
 within
 within inthe
 the the sentiment
 tweet
 tweet diddid dictionary,
 notnotexist
 existin in its
 thethe score
 sentiment was
 sentiment set to 0. itsits
 dictionary,
 dictionary, score
 score
 was wassetset
 to to
 0. 0.
 high positive) to -2 (i.e. high negative). If any word within the tweet did not
bleur polarities
Table 1 represents with
 1 represents related
 these
 thesefour examples.
 four
 polarities
 polarities with
 with related
 relatedexamples.
 examples.
 exist in the sentiment dictionary, its score was set to 0. Table 1 represents these
 four polarities with related examples.
 7 77
e1e11111 Table 1

rDifferent
 Different
 Different
 Different
Different Four
 Polarities
 Polarities
 DifferentPolarities
 Polarities
 Polarities
 Polarities Different
 with
 with
 with
 with
 with
 with Polarities with Examples
 Examples
 Examples
 Examples
 Examples
 Examples
 Examples

 Polarity
 Polarity
 Polarity
 Polarity
 Polarity
 Polarity Polarity Example
 Example
 Example
 Example
 Example
 Example
 Example
 Positive
 Positive
 Positive
 Positive
 Positive
 Positive Positive (happy)،،،‫خوب‬
 ‫(خوشحال‬happy)
 ‫(خوشحال‬happy)
 ‫(خوشحال‬happy)
 ‫(خوشحال‬happy)
 ‫(خوشحال‬happy)
 ‫(خوشحال‬happy) ،‫خوب‬
 ،‫خوب‬
 ،‫خوب‬
 ‫خوب‬ (good)
 (good)
 ‫(خوب‬good)
 (good)
 (good)
 (good)
 (good)
 High positive
 Highpositive
 High
 High
 High
 High positive
 positive
 positive
 positive High positive (excellent)
 ‫(عالی‬excellent)
 ‫(عالی‬excellent)
 ‫(عالی‬excellent)
 ‫(عالی‬excellent)
 ‫(عالی‬excellent)
 ‫(عالی‬excellent) ،‫آور‬
 ،‫آور‬
 ،‫آور‬ ‫شگفت‬
 ،‫شگفتآور‬
 ،‫آور‬
 ،‫آور‬ ‫شگفت‬ (wonderful)
 (wonderful)
 ‫(شگفت‬wonderful)
 ‫شگفت‬
 ‫شگفت‬ (wonderful)
 (wonderful)
 (wonderful)
 (wonderful)
 Negative
 Negative
 Negative
 Negative
 Negative
 Negative Negative (miserable)
 ‫(بدبخت‬miserable
 ‫(بدبخت‬miserable
 ‫(بدبخت‬miserable
 ‫(بدبخت‬miserable
 ‫(بدبخت‬miserable
 ‫(بدبخت‬miserable ‫())ناراحت‬sad)
 (((((،(،،‫ناراحت‬
 ،‫ناراحت‬
 ،‫ناراحت‬
 ،‫ناراحت‬
 ‫ناراحت‬ )sad)
 )sad)
 )sad)
 sad)
 )sad)
 sad)
 High negative
 Highnegative
 High
 High
 High
 High negative
 negative
 negative
 negative High negative ‫(افتضاح‬awful)
 (awful)
 ‫(افتضاح‬awful)
 ‫(افتضاح‬awful) ،‫چندش‬
 ،‫چندش‬
 ‫(افتضاح‬awful)،،،‫چندش‬
 ‫(افتضاح‬awful)
 ‫(افتضاح‬awful) ،‫چندش‬
 ‫چندش‬ (disgusting)
 (disgusting)
 ‫(چندش‬disgusting)(disgusting)
 (disgusting)
 (disgusting)
 (disgusting)

 ceably, due to Noticeably,
 the lack of duePersian
 rich to the lexicon,
 lack of rich the Persian lexicon,
 researchers built the researchers built aon their
 ceably,
 eably,
 eably,
 eably,
 ably, due duedueto
 due
 due to
 to
 totothe
 thethelack
 the
 the lackof
 lack
 lack
 lack of
 ofofrich
 of richPersian
 rich
 rich
 rich Persianlexicon,
 Persian
 Persian
 Persian lexicon,the
 lexicon,
 lexicon,
 lexicon, thetheresearchers
 the
 the researchersbuilt
 researchers
 researchers
 researchers builtaaaaaasemantic
 built
 built
 built semantic
 semantic
 semantic
 semantic dictionary
 semanticdictionary
 dictionaryon
 dictionary
 dictionary
 dictionary ontheir
 on
 on
 on their
 their
 their
 their
 semantic dictionary on their own. First, they selected common emotional
 First,
 First,
 .First,
 First,
First, they
 they
 First,they
 they
 they selected
 selected
 theyselected
 selectedcommon
 selected
 selected common
 common
 common
 common emotional
 emotional
 commonemotional words
 words
 emotionalwords
 emotional along
 wordsalong
 words along with
 alongwith
 along
 along with their
 their
 withtheir
 with
 with polarity
 polarity
 theirpolarity
 their
 their polarityfrom
 polarity
 polarity from
 from Dehkhoda
 Dehkhoda
 fromDehkhoda
 from
 from (1931)
 (1931)
 Dehkhoda(1931)
 Dehkhoda (1931)
 (1931) and
 (1931)and and
 and
 and
 and
 words alongemotional with theirwords
 polarity from Dehkhoda (1931) andDehkhoda Moein (1972).
nninn(1972).
 nin (1972).
 (1972).
 (1972).
 (1972). Then,
 (1972).Then,
 Then,for
 Then,
 Then,
 Then, for
 for
 for
 for
 for each
 each
 each
 each
 each
 each emotional
 emotional
 emotional
 emotional
 emotional
 emotional word,
 word,
 word,
 word,
 word,
 word, a
 aa
 aaascore
 score
 score
 score
 score
 score between
 between
 between
 between
 between
 between -2-2
 -2
 -2
 -2 -2 to
 to to
 to
 toto
 22
 222was
 2 waswas
 was
 was
 was considered
 considered
 considered
 considered
 considered
 considered based
 based
 based
 based
 based
 based onon
 on
 on
 on
 on the
 thethe
 the
 the
 the
 Then, for each emotional word, a score between -2 to 2 was considered based
d’s
 ’ss’stranslation
 s’s translation
 translationin
 translation
 translation
 translation inin
 in
 in SentiStrength.
 inSentiStrength.
 SentiStrength.
 SentiStrength.
 SentiStrength.
 SentiStrength.
 on the word’s translation in SentiStrength.

ggggthethe
 the
 the
 the semantic
 thesemantic
 semantic
 semantic
 semantic dictionary,
 semanticdictionary,
 dictionary,
 dictionary,
 dictionary,
 dictionary,
 Using the four
 four
 four
 four
 four
 four auxiliary
 auxiliary
 auxiliary
 auxiliary
 auxiliary
 auxiliary
 semantic features
 featureswere
 features
 features
 features
 features
 dictionary, were
 were
 were
 four extracted
 wereextracted
 were extractedby
 extracted
 extracted
 extracted
 auxiliary byby
 by
 by counting
 bycounting
 counting
 counting
 counting
 counting
 features were the
 the
 the
 the
 the number
 thenumber
 numberof
 number
 number
 number
 extracted of
 ofof
 of
 of positive,
 positive,
 positive,
 positive,
 positive,
 positive,
 by
 tive,
 ve,
 tive,high
 ive,
 ive,
 ve, high
 high
 high
 high positive,
 positive,
 highpositive,
 positive,
 positive, and
 and
 counting
 positive, and
 and
 and high
 high
 andhigh
 high
 high negative
 negative
 thenegative
 high number words
 words
 of
 negativewords
 negative
 negative words
 words
 words inin
 positive, each
 each
 inineach
 inin each
 each
 each tweet.
 tweet.
 negative, These
 These
 high
 tweet.These
 tweet.
 tweet.
 tweet. These
 These
 These features
 features
 positive, were
 and
 featureswere
 features
 features
 features were
 were
 were
 were named
 highnamed
 named
 named
 named
 named as
 negative
 as as pw,
 pw,
 asaspw,
 as nw,
 pw,nw,
 pw,
 pw, nw,
 nw,
 nw,
 nw,
 andand
 andand
 and NW.
 NW.
 NW.
 NW.
 NW. Then,
 Then,
 Then,
 Then,
 Then, inin
 in
 inin line
 line
 line
 line
 line with
 with
 with
 with
 with Bouazizi
 Bouazizi
 Bouazizi
 Bouazizi
 Bouazizi and
 andand
 and
 and Ohtsuki’s
 Ohtsuki’s
 Ohtsuki’s
 Ohtsuki’s
 Ohtsuki’s (2015)
 (2015)
 (2015)
 (2015)
 (2015) works,
 works,
 works,
 works,
 works,
 nd NW. Then, in line with Bouazizi and Ohtsuki’s (2015) works, the ratio of the tweet p(t) is the
 thethe
 the
 the ratio
 ratio
 ratio
 ratio
 ratio of
 ofof
 of
 of the
 thethe
 the
 the tweet
 tweet
 tweet
 tweet
 tweet p(t)
 p(t)
 p(t)
 p(t)
 p(t) isisis
 is
 is
 9
 ded
 ned
 ed
 edd in inin
 in
 in Equation
 inEquation
 Equation1:
 Equation
 Equation
 Equation 1:
 1:1:
 1:
 1:
You can also read