A framework for text mining on Twitter: a case study on joint comprehensive plan of action (JCPOA) between 2015 and 2019

Page created by Virgil Adkins

Food & Drink

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Quality & Quantity
https://doi.org/10.1007/s11135-021-01239-y

A framework for text mining on Twitter: a case study on joint
comprehensive plan of action (JCPOA)‑ between 2015
and 2019

Rashid Behzadidoost1,2 · Mahdieh Hasheminezhad1,2 · Mohammad Farshi1,2 ·
Vali Derhami3 · Farinaz Alamiyan‑Harandi3

Accepted: 2 September 2021
© The Author(s), under exclusive licence to Springer Nature B.V. 2021

Abstract
In the big data era, there is a necessity for effective frameworks to collect, retrieve, and
manage data. As not all tweets are hashtagged by users, retrieving them is a complicated
task. To address this issue, we present a rule-based expert system classifier that uses
the well-known concept of fingerprint in the judicial sciences. This expert system using
defined rules first takes a fingerprint from the tweets of an emerging topic. After that, for
being robust the fingerprint, using a rule-based search, the fingerprint with its neighbor
features is to be updated. For detecting the unhashtagged tweets of the topic, each tweet in
question checks itself with the generated fingerprint. By using the Twitter APIs of Stream-
ing API and REST API, there is no way to access old Twitter data. To address this issue,
we present a hybrid approach of Web scraping and Twitter streaming API. When the pre-
sented framework is compared to other similar works, there are (1) a novel two-class clas-
sification using an expert system approach that can intelligently and robustly detect the
most of tweets of the emerging topics although they do not have the hashtag of the topic.;
(2) a practical method for extracting old Twitter data. Also, we made a comparative text
mining in 195649 collected Persian and English tweets about JCPOA. The JCPOA is one
of the most important international treaties about the nuclear program between the Islamic
Republic of Iran and the USA, China, France, Russia, Germany, and England.

Keywords Text mining · Topic detection · Sentiment analysis · Fingerprint · Twitter ·
JCPOA · Iran deal

1 Introduction

Twitter is a popular social network that anyone can spread information on it. Nowadays,
Twitter is used as a medium for reporting international, national, and regional disputes.
Using Twitter, one can quickly discover the most important discussions in the world by
trends. Daily, Twitter receives 500 million messages and has a lot of viewers (Bose et al.

* Mahdieh Hasheminezhad
hasheminezhad@yazd.ac.ir
Extended author information available on the last page of the article

13
Vol.:(0123456789)

R. Behzadidoost et al.

2019). Famous people such as politicians, celebrities, and official media outlets post their
information on Twitter. The famous users have millions of followers, so millions of users
will view the propagated posts. This shows that Twitter plays a much important role in
spreading people’s thoughts and influencing people’s opinions. Investigating data obtained
from the people’s opinions on different topics of interest leads to knowledge acquisition.
This knowledge can be used to understand the following. (1) Who has had the most impact
on the disputes? (2) What was the most important discussion? (3) Which sentiments (posi-
tive or negative) are common among people?
A huge amount of data is generated in English, whereas the rest is not in English. So, by
analyzing the English language, only part of the big raw data can be analyzed. Therefore,
performing analyses in other languages is necessary for extracting useful information from
generated tweets.
The use of technology in human life is widely increasing. These technologies have cre-
ated huge amounts of data in a variety of areas. Preparing data is one of the most impor-
tant steps in data analysis. Grover et al. (2020) noted that data is new oil nowadays. They
believed that the more data, the more insight. For data analysis on Twitter, if there is plenty
of data, the obtained results will be more reliable. Twitter has a lot of up-to-date big data
from different people and nations, so it is one of the best sources to obtain knowledge
acquisition. Data scientists can take advantage of Twitter data.
For a comprehensive analysis of tweets in a topic of interest, it is ideal to access the
maximum tweets of that topic. However, not all tweets are hashtagged, and so far, there is
no method to extract maximum tweets of the topic of interest. Intelligent topic detection
can lead to access to most of the tweets of a topic of interest. The next effective factor for
a more qualified analysis of Twitter data is access to old Twitter data because some topics
are discussed over time. This study presents a framework that can access old Twitter Data
and intelligently detect the most relevant tweets of the topic of interest although tweets
don’t have the hashtag of the topic.

1.1 Literature review

Text mining is the process of extracting valuable information from textual documents
(Vijayarani et al. 2015). It is vital for business owners and governments because by ana-
lyzing up-to-date big data, one can receive good feedback and make the right decisions.
Sentiment analysis is one of the subcategories of text mining to identify text sentiment,
ordinarily, as positive, neutral, or negative Mostafa (2013). Sentiment analysis is a field
of research that has recently emerged and sought out by data scientists. It is the process
of extraction of information, such as opinions, beliefs, and attitudes of people (Agarwal
et al. 2015). There is a lot of research done in this area. In one of the good works, Bae
et al. (2013) made a text mining on tweets of South Korea’s 2012 presidential elec-
tion. Kušen and Strembeck (2018), made a sentiment analysis for the 2016 presidential
election in Austria. Kušen and Strembeck indicated that the winner had neutral tweets,
and the losing candidate has more emotional tweets in both positive and negative types.
Mostafa (2013) made sentiment analysis for tweets about famous brands and showed
that there is a generally positive tendency for world-famous brands. Zhang et al. (2012),
developed an expert system that performs sentiment analysis in the Chinese language.
The expert system was for examining customers’ sentiment about the features of a prod-
uct. One of the good researches was done by Öztürk and Ayvaz (2018). They made a
comparative sentiment analysis for tweets of Syrian refugees in English and Turkish.

A framework for text mining on Twitter: a case study on joint…

For determining the sentiment of a text in question, some focus on the whole document,
some on sentences, and some on words of a sentence (Hatzivassiloglou and McKeown
1997). There is a great deal of research done about sentiment classification (Dashtipour
et al. 2016). One of the early works on sentiment classification (Go et al. 2009) used
distant supervision with an accuracy of 82.2%. Wang et al. conducted the sentiment
classification in a large data set (Wang et al. 2012). They made a sentiment classifi-
cation in 36 million tweets of the 2012 United States presidential election, and the
reported accuracy was 59%. Saleena et al. (2018) conducted a comparative sentiment
classification. They illustrated that the ensemble method outperforms machine learning
algorithms such as SVM (support vector machine) and Naïve Bayes. Behzadidoost and
Hasheminezhad (2019) conducted a text mining in tweets of the yellow vest movement.
They visualized the most important words by Wordcloud and performed the sentiment
analysis in French tweets with a machine translation approach.
A huge amount of non-English data are generated by social network users. By analyzing
English data, one can only have access to some of the obtained knowledge of tweets about
the topic of interest. Therefore, it is necessary to perform analyses in other languages. Per-
sian is one of the Indo-European languages spoken by over 100 million people worldwide,
and due to linguistic complexity and resources, researchers have paid less attention to it
(Shamsfard 2011). Recently, Asgarian et al. (2018) conducted Wordnet for Persian with a
machine translation approach. There is some framework for text mining on Twitter. One
of the best frameworks to collect and manage tweets was in the work of Carvalho et al.
(2017). The authors in (Carvalho et al. 2017) showed that all of the previous works suffer
from accessing more than one percent of current Twitter data. Perera et al. (2010) described
a software architecture that could access tweets of specific users using Twitter streaming
API and Python. Marcus et al. (2011) presented TwitInfo platform. TwitInfo performed
sentiment analysis in tweets that were collected in real-time. One of the other systems that
use streaming API for collecting tweets was the work of Oussalah et al. (2013). They col-
lect tweets continuously in real-time, and their analyzing method in collected tweets was
semantically and spatially. These works just access one percent of current Twitter data.
Therefore, they can’t access old data. Carvalho et al. (2017) said that there are some appli-
cation tools such as (www.twitonomy.com), (www.warble.com), (www.twitonomy.com),
(www.socioviz.net) and (www.discovertext.com) that focus on eye-catching reports and
graphics to illustrate relevant tweets. The most powerful of the tools was the commercial
company of Discovertext that could access 100% of tweets. Carvalho et al. (2017) stated
that although Discovertext can access all tweets, it is quite an expensive solution.
By examining Carvalho’s method (Carvalho et al. 2017), it is found that the presented
method was not efficient. The method needs powerful resources to collect the tweets
quickly. However, it may take days to weeks to collect the tweets. In a case study, using
Carvalho’s method (Carvalho et al. 2017) 14 million tweets collected that the most frequent
date of the tweets was from two months ago while accessing from years ago is expected.
Collecting irrelevant tweets was another drawback of Carvalho et al. (2017). Most of the
collected tweets were on different topics, and there were tweets on the same topic rarely.
In this study, using web scraping and Twitter Rest API the old Twitter data on the
desired topic easily can be collected. The advantage of this work over the work of Carvalho
et al. (2017) is in the following reasons: (1) accessing to tweets in desired time; (2) being
accurate in analyzing the topic of interest, the provided framework can extract most of the
favorite tweets over time, while the method of Carvalho et al. (2017) has access to most of
the favorite tweets seldom; (3) being faster, depending on hardware or the speed of con-
nection, the method of Carvalho et al. (2017) may take days to weeks to collect old tweets.

R. Behzadidoost et al.

Currently, it is not possible to index tweets by their contents. The only way to index
tweets is their hashtag mechanism (Carvalho et al. 2017) that completely depends on the
user’s will. As a result, just 16% of tweets are hashtagged (Mazzia and Juett 2009). Mining
more tweets on the topic of interest provide more insight. Therefore, retrieving the remain-
ing 84% unhashtagged tweets is essential. Topic detection tries to assign the correct topic
to a document according to its content. The hashtag of a tweet can be assumed as a tweet’s
topic. Automatically, assigning a suitable hashtag to a tweet has some challenges (Car-
valho et al. 2017): (1) This is a text classification task that has a large number of topics or
even unknown topics. (2) The text length of each tweet is up to 280 characters that make
the classification difficult. (3) There are a lot of tweets that must be considered in topic
detection. Carvalho et al. (2017) distinguished between topic detection and topic classifica-
tion. They said that topic classification is generally known in NLP as text categorization,
and its task is finding the correct topic or topics for each document. The topic classifica-
tion contains a closed set of general categories such as sports, politics, religion, music, etc.
Therefore, the set of tweets very often belongs to one or more of the categories (Carvalho
et al. 2017). They also noted that topic detection tries to discover the topic of the textual
document from a predetermined broad set of possible topics. In this approach, topics are so
unique that there is a good chance for unhashtagged tweets that probably do not belong to
any of the current trending topics. There are two types of topic detection. The most of pre-
vious works for topic detection were for detecting hot topics, emerging topics, or trending
topics, while in few works the attempts were for finding the topic of tweets that don’t have
the hashtag of its emerging topics.
One of the researches done for the topic classification was made by Lee et al. (2011). In
their work, text-based and network-based methods have been used to classify 768 Trends
Twitter into 18 classes. The obtained accuracy of these methods was 65% and 70%, respec-
tively. Another research done was made by Yüksel et al. (2019). In their work, the trans-
former encoder method for the classification of Turkish tweets was employed. There is a
great deal of research done for detecting emerging topics. Some extensive text analysis
techniques such as Bursty Keywords (Mathioudakis and Koudas 2010), Aging theory (Cat-
aldi et al. 2010) and the non-negative matrix factorization framework (Saha and Sindhwani
2012) have utilized to detect the topic of tweets. In 2019, emerging research topic detection
based on dynamic influence model and citation influence model by Xu et al. (2019) is per-
formed. In the work of Choi and Park (2019), topic detection on Twitter was performed by
using high utility pattern mining. The presented method for three data sets outperformed
Bngram, SFPM, G- based, D-pivot, and LDA. The authors in Winarko et al. (2019), for
detecting trending topics on Indonesian tweets, compared two methods document pivot,
and BN-grams. The authors have shown that BN-grams outperform document pivot in the
studied six data sets. Chen et al. (2017) in 2017, by using Markov decision processes, con-
ducted a novel graph-based topic detection. The conducted method outperformed LDA and
Keygraph. One of the unsupervised works for topic detection was conducted by Cigarrán
et al. (2016). They have shown that formal concept analysis works better than traditional
classification and clustering approaches. For topic detection on Twitter, the authors in
(Petkos et al. 2014), utilized the method of the document-pivot algorithm. In their work,
they have taken advantage of URLs and the reply to the tweets. In 2017, for bursty topic
detection on Twitter, a framework using HOSVD (higher-order singular value decompo-
sition) was proposed by Kalpana et al. (2017). They achieved an accuracy of 90%. For
event detection on Twitter, Lee et al. (2011) in 2011, assigned a weight to individual words
by employing the EDCow(Event Detection with Clustering of Wavelet-based Signals)
method.

A framework for text mining on Twitter: a case study on joint…

The fingerprint is an efficient and well-known method in forensic science that deter-
mines who owns the fingerprint (Homem and Carvalho 2011). In computer science, fin-
gerprint means that a large amount of data related to a topic is mapped to a compressed
block of data. Just as in the real world for each fingerprint, one can find one’s own, in com-
puter science, one can also find the origin data by knowing the compact block of data. For
example, the fingerprint can be used to avoid comparing and transferring extensive data,
identifying changes to a file, and identifying a newspaper editor (Homem and Carvalho
2011). There are a few pieces of research done for detecting the topic of unhashtagged
tweets. In 2017, Carvalho et al. (2017) said that the most effective way to detect the topic
of unhashtagged tweets is fuzzy fingerprint (Carvalho et al. 2017). The concept of the
fuzzy fingerprint was first introduced and described by Homem and Carvalho (2011). In
their work, the fuzzy fingerprint to detect the authors of a newspaper in Portuguese was
employed. Topic detection on Twitter with the fuzzy fingerprint method was first intro-
duced by Rosa et al. (2014). They compared the fuzzy fingerprint method with algorithms
of support vector machine and K nearest neighbor and showed the fuzzy fingerprint out-
performed these algorithms in terms of speed and performance. In another work (Carvalho
et al. 2017), employing fuzzy fingerprint, authors could intelligently retrieve the topic of
the most tweets in question although the tweets don’t have the hashtag of the topic.
In this research, we aim to establish a two-class classifier using an expert system such
that the unhashtagged tweets of the topic of interest intelligently are retrieved. The basis of
many fields such as topic detection, sentiment analysis, and information retrieval is key-
word extraction. Extracting relevant keywords will facilitate detecting the main topic of
a document. The main difference between our work and the most similar work for topic
detection (Carvalho et al. 2017) is in the method of keyword extraction. The method of
Carvalho et al. (2017) suffers for taking maximum relevant keywords, where their method
was just for taking the most frequent words, the keywords of a document are not neces-
sarily in the most frequent ones. Using a fingerprint of tweets, the topic of unhashtagged
tweets can be detected if the tweet in question at least has one of the keywords in the fin-
gerprint. So, it is essential to provide an approach to extract the maximum possible relevant
keywords. If there be maximum keywords possible for a topic of interest, there will be
more chance for a tweet in question that finds its correct topic. To address the issue, our
approach is to employ a rule-based search to extract the neighbor words of the top k words
(known as the most frequent words). For each of the top k words, some candidate words
are extracted. If candidate words satisfy some conditions, they will be added to the set of
top k words.
This study considers a real case study of JCPOA for investigating the presented frame-
work. JCPOA (also it is known as Iran Deal) is an international agreement between Iran
and the P5 + 1 countries on 14 July 2015. The P5+1 relates to the UN Security Council’s
five permanent members (the P5). These members are China, Russia, France, the United
Kingdom, and the United States; plus, Germany. JCPOA is one of the most important
international treaties, under which Iran has pledged to suspend part of its nuclear activi-
ties for several years in exchange of lifting the sanctions. News of the deal was promptly
topped in the international media outlets. Social network users have numerously posted
tweets about the deal in various languages. With the new president of America in 2016, the
United States has withdrawn from this agreement in 2018.
One of the objectives of this study is to have a comparative text mining in two differ-
ent languages on a topic that has been tweeted over time. To the best of our knowledge,
JCPOA was one of the topics that met the above conditions. Another reason for choosing
the international agreement of JCPOA was its special political and economic impact on

R. Behzadidoost et al.

Fig. 1 The proposed framework for text mining on Twitter

the Middle East and a group of world powers (P5+1 countries). Since a comprehensive
analysis of the JCPOA has not yet been conducted, the results of this study can be useful
for other researchers and decision-makers. According to a search on the Google Scholar
site, it is found that there are hundreds of theoretical researches about JCPOA that directly
or indirectly discussed it.
In summary, our main contributions are as follows: (1) This paper proposes a novel
two-class classification using an expert system that its accuracy for the English and Per-
sian language is 91.2667% and 96.4667%, respectively. The proposed expert system is
language-independent, so it easily can be expanded to cover other languages. (2) We pre-
sented a practical method for extracting old twitter data in desired time. (3) In this paper, a
comparative text mining in collected tweets of the presented framework is performed. The
investigated real case study can be educationally useful for non-expert users in computer
science. Additionally, new labeled data sets for the presented expert system are created.
The rest of the paper is structured as follows: in Sect. 2, it is explained the proposed frame-
work; Sect. 3 gives a descriptive analysis of results, and eventually, Sect. 4 presents some
conclusions.

2 Proposed framework

In this section, components of the proposed framework are described. The framework is
depicted in Fig. 1. The proposed framework collects favorite tweets either through the
Twitter streaming API or web scraping. The framework easily extracts the desired old
tweets. To support this view, it is employed a web scraper using selenium1, which simulates

1
https://www.seleniumhq.org/.

A framework for text mining on Twitter: a case study on joint…

the search of Twitter. The web scraper searches for tweets in the desired time, and the id of
each found tweet is stored in a collection of MongoDB. Twitter REST API2 using the col-
lected ids fetch the content of each tweet. Twitter REST API takes an attribute of a tweet
(usually by its id) and returns the full details of a tweet in a specific format. Using this
strategy, one can access the old tweets.
In the 7th step, topic detection, sentiment analysis, and user influence are performed,
which this study focuses on topic detection. An expert system approach is used to detect
the topic of unhashtagged tweets, in which the output of its training phase is a set of words
with their scores, where the higher the word score, the more important the word. In the
testing phase, each tweet in question matches its words with generated words of the train-
ing phase, and if the score of the matched words satisfies the defined threshold, that tweet
belongs to the topic; it doesn’t belong otherwise. To perform text mining in the case study,
the relevant tweets about JCPOA on 14–21 July 2015 and the tweets from 14th July to 14th
august each year (between 2015 and 2019) were collected. For a profound understanding
of the changes in the opinion of tweeters of each year compared to the previous year on the
JCPOA anniversary, it is extracted the tweets of one month from 14th July to 14th august
each year. Also, usually, in the first days after each event, many tweets are spread about the
event, which are usually emotional, and is a deep insight into them rarely. Therefore, ana-
lyzing the tweets on 14–21 July 2015 is investigated separately. After preprocessing, vari-
ous analytical methods on the collected tweets are done. The English language is widely
tweeted (Neubig and Duh 2013), (Saha and Menezes 2016), and it is known that tweet-
ers in English are from different countries. So, considering English tweets means that the
views of users in several countries have been analyzed. Based on our observations the most
of the discussions on tweets were about Iran. Thus, it is necessary to mine the opinion of
the Persian tweeters.

2.1 Data management

The framework uses MongoDB to manage the tweets. MongoDB3 easily can analyze and
manage big unstructured data. The retrieving and storing of the tweets are performed by
MongoDB. MongoDB supports many advanced queries. For instance, the "distinct" com-
mand eliminates duplicated tweets. MongoDB can analyze big data, but it cannot analyze
all of them at the same time. Therefore, Python packages of Wordcloud4 and Matplotlib5
for analyzing are used. The programmer determines the amount of the data in both pack-
ages. Therefore, except for the system resources, there is no limit for data analysis. To con-
nect to MongoDB, the Pymongo6 package is employed. Pymongo is written in the Python
programming language.

2
API.statuses_lookup.
3
https://www.mongodb.com/.
4
https://pypi.org/project/wordcloud/.
5
https://pypi.org/project/matplotlib/.
6
https://pypi.org/project/pymongo/.

R. Behzadidoost et al.

2.2 Tweet extraction

In the presented framework, there are two ways for Tweet extraction. The first is Twitter
streaming API that can just extract the past week’s tweets. The second is a hybrid approach
that employs web scraping and Twitter Rest API7. The second method can circumvent the
official Twitter API. Based on Fig. 1, in the first step, the user gives the terms of the topic
of interest to the scraper, and for each term, scraper searches on Twitter in the desired time.
The second step store each tweet’s id; the stored ids are given to the third step to extract the
full attributes of each tweet by API.statuses_lookup in the fourth step. API.statuses_lookup
for extracting the content of each tweet gets the unique id of each tweet; it is the main rea-
son that the id of each tweet in the second step is stored. The fifth and sixth steps employ
Twitter streaming API for tweet extraction that is optional because steps 1 to 4 are enough
for extracting. However, web scraping is not efficient in terms of speed. Tweet extraction
by the web scraper approach (steps 1 to 4) requires several steps, while Twitter streaming
API (steps 5 to 6) is a systematic process supported by Twitter. Thus, although the pro-
posed approach can extract old tweets, it is not as speedy as the Twitter streaming API.
Twitter streaming API is a perfect solution if tweets in the current week are needed. This
study aims to mine different opinions. So, there is not a necessity for collecting retweets
(Öztürk and Ayvaz 2018). A retweet is a copy of the other tweet that does not indicate any
new opinion. Therefore, the scraper should not extract retweets that in detail to ignore the
retweets the command “include: retweets” don’t send to the scraper.

2.3 Data collection and preprocessing

To the best of our knowledge, there was no prepared data set for JCPOA before this study.
The presented framework collects the JCPOA tweets spread from 14th July 2015 to 14th
August between 2015 and 2019 and 14–21 July 2015. It observed that , ,
, , , “Irannucleardeal”, “Irandeal” “Iran_talks”,
and “JCPOA” are the most common keywords in the first week of the agreement on 14–21
July 2015. Likewise, “Irandeal”, “JCPOA”, “Irannucleardeal”, and are the most
common keywords between 2015 and 2019. There are more keywords for both languages
that this study considers the most well-known ones. Let = 1 , 2 , … be the set of
words that user gives to the scraper, and Ti = t1 , t2 , … t be the set of words that are in
tweet T i , for i = 1, … , n. To collect as many tweets as possible, is given to the scraper
and it searches for j , for j = 1, … , . T i may have more than one word in , so, for the
next j , the scraper will receive the same tweet again. It is the reason that duplicate tweets
arise.
As preprocessing for sentiment analysis, all duplicated tweets, and as preprocessing for
data visualization and word frequency, besides removing duplications, all numbers, web
addresses, emoticons, stop words, special characters, and the lexical with one character are
removed. For the visualization of the most important discussions, keywords are considered.
Keywords lead to the rapid detection of the main discussed topics. As stop words and spe-
cial characters are widely used in any textual documents, it is reasonable to remove them.
‘in’ and ‘.’ are instances of stop words and special characters, respectively. To the best of
our knowledge, web addresses for the source of a tweet, emoticons for the sentiment of

7
API.statuses_lookup.

A framework for text mining on Twitter: a case study on joint…

Fig. 2  The comparison of the number of collected tweets in English and Persian

Table 1  Detail of collected English and Persian tweets on 14–21 July 2015
The Persian tweets The English tweets

The total number The total number without The total number The total number without dupli-
 duplication cation
5220 5122 38463 35272

a tweet, and numbers for the scale of the objects mentioned in a tweet are written. Also,
lexicons with one character other than special characters are usually alphabetic characters
such as ‘a’. These mentioned items are not keywords. Therefore, they are not only ineffec-
tive in detecting the main topics but also cause the main topics not to be detected quickly.
These items in the preprocessing step are removed. The preprocessing step for sentiment
analysis differs because, in the Vader method, lexical like stop words and special characters
are rated. In total, 123567 tweets in English and 72082 tweets in Persian were collected
that Fig. 2 depicts the detail of collected tweets for the five years. Also, Table 1 depicts the
detail of collected tweets of the first week.

 13

R. Behzadidoost et al.

Fig. 3  Structure of topic detection

2.4 Analytics

In the 7th step of Fig. 1, topic detection, sentiment analysis, and user influence in collected
tweets are performed.

2.5 Intelligent topic detection

In this part, the structure of the proposed two-class classifier will be described. This clas-
sifier can intelligently detect the topic of unhashtagged tweets. Although the tweets don’t
have the hashtag of that topic. Figure 3 depict the structure of topic detection.

13

A framework for text mining on Twitter: a case study on joint…

2.5.1 Training

Formally, the training for the expert system in Pseudocode 1 is shown. The following items
will describe Pseudocode 1.

• Data collection
To the best of our knowledge, there had been no labeled data set from Twitter for this
type of topic detection. We have created new labeled data sets for the experiments. It
should be pointed out that none of the tweets in the training data and testing data have
duplicate tweets because analyzing them does not indicate any new result. Step 1 of
Fig. 3 depicts more information about Data collection.
• Data preprocessing
Step 2 of Fig. 3 depicts more information about Data preprocessing. Features to train
the classifier must be unique to that topic. In a simple statement, the features indicate
that they belong to one topic. A set of words belong to just one topic rarely. Thus, in
practice, the words as unique as possible are desired. Features such as stop words, web
addresses, and special characters that are not as unique features are removed. To pre-
pare features for the next steps, they are tokenized. For tokenizing, the Unigram method
is employed, Unigram8 considers the words of a document one by one.
• Weighting
This part is the most important step to train the classifier. This part assigns a score to
each extracted unique feature. Steps 3 and 4 Fig. 3 depict the processes of this part.
In Pseudocode 1, TopicMain is a tweet set of the topic of interest that one seeks
to unhashtagged tweets. TL is a set of all topics except TopicMain. TL is called topic
library. C is a set of extracted candidate words that will be added in primary finger-
print9 of TopicWords. Formally, TopicWords is represented by application of 1. The
value S K , 0≤ S K ≤1, is the score of word WK and K ∈ ℕ, K ≥ 20 is cardinality of the
TopicWords. The initial value of S K is equal to the frequency of WK in TopicMain. Fre-
quency is taken by function of WordFrequency that takes a set of words and return the
the frequency of each word.
TopicWords = {W1 ∶ S1 W2 ∶ S2 , … , WK ∶ SK }. (1)
Let ST be for storing the neighbor words each word in TopicWords. For each word in
TopicWords, the empirical number of , ∈ ℕ, determines the number of its next and
previous neighbor words. Based on the frequency of each word in ST the empirical
threshold of , ∈ ℕ determines which word should be stored in C. For TopicMain
and TL, a set of top K words as features are extracted. Each tweet has at most 280
characters; therefore, few unique features for the topics will be extracted. To extract
more unique features, it is considered all collected tweets on each topic. These unique
features for each topic are the preprocessed top K words. The most frequent words in
the set of tweets are as top K words. It should be noted that the topic of unhashtagged
tweets for TopicMain is retrieved, and there will be more chance for detecting the topic
of unhashtagged tweets if the fingerprint is as large as possible. Therefore, it is just

8
split function in Python can be used for employing Unigram method.
9
A fingerprint that has not increased in the number of elements is called a primary fingerprint.

R. Behzadidoost et al.

extracted the neighbor words of TopicWords. For each topic in TL, the number of its
elements is equal to the number of elements in the primary fingerprint.
At first, for employing the two-class classifier, a set of top K words from preproc-
essed tokenized tweets as features of TopicMain are extracted. Lines 2–11 Pseudocode
1 are employed for extracting the top K words. These top K words as primary finger-
print are considered. In the next step, for each word in TopicWords, if the word is in the
word list of each tweet TW, TW ∈ TopicMain, then the next and previous neighbor
words will be stored in ST. Let Z be cardinality of C. For U = 1, … , Z , if frequency CU
> , then CU will be kept in C; otherwise will be ignored. Formally, C is represented by
application of 2. E is the number of words that satisfy the threshold.
C = {W1 ∶ S1, W2 ∶ S2, … , WE ∶ SE } (2)
Based on Pseudocode 1, the initial score of TopicWords is equal to their frequency in
TopicMain, while the frequency of the extracted words in C is equal to their frequency
among the candidate words. Therefore, for each word in C, its frequency from Topic-
Main is taken (line 35). In the next step, C ∪ TopicWords is considered as the finger-
print such that the updated form of TopicWords is represented by application of 3. In
this case,the number of element in TopicWords is D = K + E .
TopicWords = {W1 ∶ S1, W2 ∶ S2, … , WD ∶ SD } (3)
Lines 12-37 are for extracting the neighbor words of TopicWords. In lines 38-49, the
top K words in TL are extracted. Lines 50-53 order TopicWords by product ITF10 and
Weighting. ITF distinguishes the common words of TopicWords on other topics (TL),
and Weighting reorder each word by product its frequency in TopicWords, and ITF. ITF
takes two parameters NT ∈ ℕ and Frequency, where they are the number of topics and
frequency of each word, respectively. We consider NT + 1 because all extracted top K
words from the topics are important, and none of them should be zero. The main reason
for using ITF is that the words in TopicWords that are in many topics must be scored
less than other words because they are not very unique to that topic. The output of ITF
formula is 0≤ ITF ≤1. Weighting takes three parameters, TopicWords, ITF, and P ∈ ℕ
that are frequent words of TopicMain, ordered words of TopicWords, and the total num-
ber of words in TopicMain, respectively. To normalize each word in TopicWords, we
divide it by P. In line 54, by using a Softmax function, the obtained scores from line 53
are to be normalized in the range 0 to 1. The Softmax function takes two parameters of
Weighting and e. It divides the obtained score of each word by summation of all scores
and returns a set of words with their score such that the larger score in Weighting, the
larger normalized score; e is Euler’s number. The effectiveness of the distribution is in
the testing phase because words that have larger scores have a larger impact on belong-
ing the tweet to its topic.

10
Inverse Topic Frequency.

A framework for text mining on Twitter: a case study on joint…

 13

R. Behzadidoost et al.

2.5.2 Testing

There is a need for a mechanism using data trained in Pseudocode 1, detects the topic of
unhashtagged tweets. Formally, Pseudocode 2 shows the mechanism of testing. Also, Step
5 of Fig. 3 depicts the process of the testing phase. At first, the TestTweets are preproc-
essed. If the similarity score of the matched words between TestTweetsh, for h = 1, … n, n
is cardinality of TestTweetsh, and TopicWords is at least of the threshold, then TestTweetsh
will belong to the topic. The considered threshold is 0.10. In the Pseudocode, TopicWords
is a fingerprinted set of scores and words of TopicMain. TestTweets are the tweets in ques-
tion. S D is the score of each word W D, STopic is the set of words of TopicWords, and K is
cardinality of primary fingerprint. Similarity_Score takes two parameters Tweet11 and Top-
icWords and returns a number between 0 and 1. If the returned number is 1, it means that
the tweet with the highest degree belongs to the topic of interest, and if it is 0, it means the
tweet doesn’t belong to the topic. Line 5 computes the similarity score for each obtained
tokenized tweet of line 4 and lines 6–11 are for determining belonging/not belonging
tweets of TopicMain.

2.5.3 Sentiment analysis methods

For performing sentiment analysis in collected English tweets, the Vader method (Hutto
and Gilbert 2014) is employed. Vader is a rule-based and lexicon-based method that is
specially designed for social networks. The advantage of the Vader method is to use a col-
location of slangs, acronyms, and emotions that are numerously used in social networks.
The performance of the Vader method is better than the other eleven methods available in
this field, such as SentiWordNet, ANEW, LIWC (Hutto and Gilbert 2014). The output of
the Vader method is a value between −1 and 1. When Hutto and Gilbert (2014) presented
the Vader method, they defined some thresholds to determine the sentiment of a given text;

11
A tweet in question that one seeks for its topic.

A framework for text mining on Twitter: a case study on joint…

Table 2 The distribution of labeled data sets
The Persian data set

Training tweets Testing tweets
topic Library topic Relevant to the topic Irrelevant to the topic
1500 23570 1500 1500
The English data set
Training tweets Testing tweets
#irandeal topic Library topic Relevant to the topic Irrelevant to the topic
1500 12535 1500 1500

if the output for the given text is at least 0.05, it is a positive one; if the given text is at most
−0.05, it is a negative one, else, it is neutral. In this work, we use the same thresholds. For
performing sentiment analysis in Persian, we employ the provided REST API12 because
there is no rule-based method like Vader. It is a sentiment classification that assigns one of
the labels positive, negative, or neutral to the given text in the question.

2.5.4 Data visualization

Data visualization is a part of science and art (Xyntarakis and Antoniou 2019). Data visu-
alization is important for engineers, analysts, policy-makers, and decision-makers because
it enables them to make better decisions. A good visualization must provide insight into
decision-makers. Luo (2019), noted that mainly, visualization approaches consist of cre-
ating intuitive tables and diagrams. Luo pointed out that graphical visualization is more
appropriate for complex tasks. In this research, graphical visualization, including word
cloud, line chart, and bar chart, has been used. We have used the Python packages of Word-
cloud and Persian_Wordcloud13 to depict the texts and the Python package of Matplotlib
for other depictions.

2.5.5 User influence

Rogers (2010) says influential users are those who persuade others to do what they want.
Knowing the power of user influence on social networks has received much attention
because one can find out who has either the most positive impact or the most negative
impact on the topic of interest. This knowledge can be very useful for politicians, market-
ers, and sociologists. Page rank, the number of connections in the network, and the number
of followers are some of existed metrics for computing user influence. Page rank is like
a black box, and the main idea behind the method is not clear for others. The obtained
network for a huge data set is very complicated, and it is hard to extract meaningful infor-
mation. Cha et al. (2010) pointed out that the followers may not be active on the topic of
interest. Therefore, it seems this method does not work well for all-purpose.
A word that contains another account username is a mention. It is marked with the "@"
symbol before the username. We extract the top K-mentioned users from the tweets of the

12
https://text-mining.ir/api/SentimentAnalyzer/SentimentClassifier.
13
https://pypi.org/project/persian_wordcloud/.

R. Behzadidoost et al.

Persian and English languages each year. On the other hand, we get the occurrence of men-
tioned users, and the more occurrence, the more influence.

3 Results

This section describes the results of sentiment analysis, data visualization, topic detection,
and detection of the most influential users. We separately make the analyses for each lan-
guage. For topic detection, the best results are bolded.

3.1 Data and results of intelligent topic detection

The tweets that have the #hashtag of interest are considered as training. It should be noted
that none of the testing tweets have the #hashtag that training tweets have.

3.1.1 Data

The classifier for evaluation uses two different data sets of English and Persian. English
data set have 17035 labeled tweets extracted from Twitter that included 20 trend topics con-
sist of understudied #irandeal topic- (TopicMain). The trend topics are for the topic library
(TL). The English data set consisted of 1500 tweets for #irandeal topic, 12535 tweets for
the topic library, and 3000 tweets for testing (TestTweets) that 1500 tweets belong to the
topic of interest (TopicMain), and 1500 tweets belong to various topics. Likewise, the Per-
sian data set have 28070 tweets that 1500 tweets are for the understudied topic of

, 23570 for the topic library, which includes 12 trend topics, and 3000 tweets for testing
that 1500 tweets belong to the topic of interest, and 1500 tweets belong to various topics.
Table 2 show more information about the collected data.

3.1.2 Results

The following standard criteria are employed to evaluate the proposed classifier: In these
equations, TP/FP is the abbreviation of True Positive/False Positive and refers to the num-
ber of tweets that are correctly/incorrectly assigned to the considered topic. TN/FN means
the number of tweets that are correctly/incorrectly rejected to belong to the considered
topic and abbreviates True Negative/False Negative. So, precision is the fraction of rel-
evant tweets of the considered topic among the retrieved tweets, while recall is the frac-
tion of the total amount of relevant tweets that were actually retrieved. The F-measure is
defined as the weighted harmonic mean of precision and recall. The accuracy is the num-
ber of correct predictions made divided by the total number of predictions made.
TP
Precision = (4)
TP + FP

A framework for text mining on Twitter: a case study on joint…

Table 3  Result of topic detection Precision Recall F-measure Accuracy K S
for the Persian data set using the
presented expert system
 95.3146 94.9333 95.1236 95.1333 20 1
 92.5546 96.1333 94.31 94.2 40 1
 90.184 98.0 93.9297 93.6667 60 1
 85.3998 98.2667 91.3825 90.7333 80 1
 83.7288 98.8 90.6422 89.8 100 1
 81.5385 98.9333 89.3976 88.2667 120 1
 95.3146 94.9333 95.1236 95.1333 20 2
 92.5546 96.1333 94.31 94.2 40 2
 90.184 98.0 93.9297 93.6667 60 2
 85.3009 98.2667 91.3259 90.6667 80 2
 83.7288 98.8 90.6422 89.8 100 2
 81.5385 98.9333 89.3976 88.2667 120 2
 95.3209 95.0667 95.1936 95.2 20 3
 92.5546 96.1333 94.31 94.2 40 3
 90.0735 98.0 93.8697 93.6 60 3
 84.6951 98.1333 90.9203 90.2 80 3
 83.2584 98.8 90.3658 89.4667 100 3
 81.3596 98.9333 89.29 88.1333 120 3
 92.7461 95.4667 94.0867 94.0 20 4
 90.75 96.8 93.6774 93.4667 40 4
 88.3693 98.2667 93.0556 92.6667 60 4
 83.8418 98.9333 90.7645 89.9333 80 4
 83.3147 99.2 90.5661 89.6667 100 4
 81.4004 99.2 89.4231 88.2667 120 4
 94.7299 95.8667 95.2949 95.2667 20 5
 92.7017 96.5333 94.5787 94.4667 40 5
 87.2902 97.0667 91.9192 91.4667 60 5
 85.8314 97.7333 91.3965 90.8 80 5
 84.0365 98.2667 90.5962 89.8 100 5
 82.294 98.5333 89.6845 88.6667 120 5
 97.5443 95.3333 96.4261 96.4667 20 6
 89.5191 96.8 93.0173 92.7333 40 6
 88.7668 96.9333 92.6705 92.3333 60 6
 86.7141 97.4667 91.7765 91.2667 80 6
 82.8249 97.7333 89.6636 88.7333 100 6
 81.8994 97.7333 89.1185 88.0667 120 6

 TP
 Recall = (5)
 TP + FN

 Precision ∗ Recall
 F − measure =2 ∗ (6)
 Precision + Recall

 13

R. Behzadidoost et al.

Table 4  Result of topic detection Precision Recall F-measure Accuracy K S
for the English data set using the
presented expert system
 95.0966 85.3333 89.9508 90.4667 20 1
 88.664 87.6 88.1288 88.2 40 1
 87.7763 90.0 88.8742 88.7333 60 1
 85.8573 91.4667 88.5733 88.2 80 1
 85.2798 93.4667 89.1858 88.6667 100 1
 83.5697 94.2667 88.5965 87.8667 120 1
 95.0966 85.3333 89.9508 90.4667 20 2
 88.664 87.6 88.1288 88.2 40 2
 87.7763 90.0 88.8742 88.7333 60 2
 85.8573 91.4667 88.5733 88.2 80 2
 85.2798 93.4667 89.1858 88.6667 100 2
 83.5697 94.2667 88.5965 87.8667 120 2
 94.7137 86.0 90.1468 90.6 20 3
 88.251 88.1333 88.1921 88.2 40 3
 87.3711 90.4 88.8597 88.6667 60 3
 85.8032 91.8667 88.7315 88.3333 80 3
 85.2479 94.0 89.4103 88.8667 100 3
 83.787 94.4 88.7774 88.0667 120 3
 95.314 86.8 90.8583 91.2667 20 4
 88.518 88.4 88.459 88.4667 40 4
 87.7102 90.4 89.0348 88.8667 60 4
 86.4764 92.9333 89.5887 89.2 80 4
 85.6448 93.8667 89.5675 89.0667 100 4
 84.0666 94.2667 88.8749 88.2 120 4
 89.078 83.7333 86.323 86.7333 20 5
 87.5 86.8 87.1486 87.2 40 5
 86.587 86.9333 86.7598 86.7333 60 5
 86.3226 89.2 87.7377 87.5333 80 5
 85.1198 90.0 87.4919 87.1333 100 5
 82.3458 90.8 86.3665 85.6667 120 5
 83.6364 49.0667 61.8488 69.7333 20 6
 82.9474 52.5333 64.3265 70.8667 40 6
 83.2692 57.7333 68.1889 73.0667 60 6
 83.4601 58.5333 68.8088 73.4667 80 6
 79.2028 60.9333 68.8772 72.4667 100 6
 76.6193 64.6667 70.1374 72.4667 120 6

 TP + TN
 Accuracy = (7)
 TP + TN + FP + FN
The experiments in different K and S are done. For each word, if its characters are at least
S, the word will be considered. gamma and values are 4 and 10, respectively. The result

13

A framework for text mining on Twitter: a case study on joint…

Table 5  Result of topic detection Precision Recall F-measure Accuracy Number of S
for the Persian data set using Features
SVM
 74.4954 52.6667 39.0989 52.6667 20 1
 75.4459 54.4667 42.6449 54.4667 40 1
 76.5348 57.3333 47.9098 57.3333 60 1
 76.8823 58.3333 49.6474 58.3333 80 1
 78.959 64.3333 59.1792 64.3333 100 1
 86.8667 82.3333 81.773 82.3333 120 1
 74.4954 52.6667 39.0989 52.6667 20 2
 75.4459 54.4667 42.6449 54.4667 40 2
 76.5348 57.3333 47.9098 57.3333 60 2
 76.8823 58.3333 49.6474 58.3333 80 2
 78.959 64.3333 59.1792 64.3333 100 2
 86.8667 82.3333 81.773 82.3333 120 2
 73.2706 52.5333 38.9273 52.5333 20 3
 75.2367 54.0 41.7445 54.0 40 3
 75.9545 55.7333 45.0257 55.7333 60 3
 76.2741 56.6 46.6043 56.6 80 3
 78.3711 62.6667 56.6706 62.6667 100 3
 78.6046 63.3333 57.6857 63.3333 120 3
 73.2706 52.5333 38.9273 52.5333 20 4
 75.4745 54.5333 42.7724 54.5333 40 4
 75.8007 55.3333 44.2835 55.3333 60 4
 77.134 59.0667 50.8919 59.0667 80 4
 77.7037 60.7333 53.6325 60.7333 100 4
 78.0474 61.7333 55.2219 61.7333 120 4
 24.9833 49.9333 33.3037 49.9333 20 5
 81.4991 71.0 68.3638 71.0 40 5
 82.5666 73.5333 71.5613 73.5333 60 5
 78.3246 62.5333 56.4656 62.5333 80 5
 78.3478 62.6 56.5682 62.6 100 5
 78.7929 63.8667 58.4865 63.8667 120 5
 69.1489 63.0 59.7705 63.0 20 6
 91.3246 90.2667 90.204 90.2667 40 6
 91.1741 90.0667 89.9994 90.0667 60 6
 90.5737 89.2 89.1078 89.2 80 6
 90.2008 88.8 88.7016 88.8 100 6
 90.4357 89.0667 88.9733 89.0667 120 6

of the expert system is compared with two popular machine learning algorithms of SVM14
and KNN15. SVM is a quick and efficient binary classifier that works geometrically, where

14
 Support Vector Machine.
15
 K Nearest Neighbor.

 13

R. Behzadidoost et al.

Table 6  Result of topic detection Precision Recall F-measure Accuracy Number of S
for the Persian data set using neighbors
KNN
 88.0546 85.2667 84.9918 85.2667 20 1
 87.4898 83.9333 83.543 83.9333 40 1
 87.4332 83.6 83.1691 83.6 60 1
 86.5032 81.6667 81.0386 81.6667 80 1
 85.517 80.1333 79.3508 80.1333 100 1
 85.8661 80.4667 79.7028 80.4667 120 1
 88.0546 85.2667 84.9918 85.2667 20 2
 87.4898 83.9333 83.543 83.9333 40 2
 87.4332 83.6 83.1691 83.6 60 2
 86.5032 81.6667 81.0386 81.6667 80 2
 85.517 80.1333 79.3508 80.1333 100 2
 85.8661 80.4667 79.7028 80.4667 120 2
 89.0492 86.8 86.6071 86.8 20 3
 87.8093 84.4667 84.1156 84.4667 40 3
 86.7487 82.4 81.8634 82.4 60 3
 86.3597 81.4 80.7433 81.4 80 3
 85.6934 80.4667 79.7244 80.4667 100 3
 85.4873 79.7333 78.8771 79.7333 120 3
 90.2758 88.5333 88.408 88.5333 20 4
 88.2979 85.2667 84.9693 85.2667 40 4
 88.8762 86.0 85.7362 86.0 60 4
 88.9178 85.8667 85.5841 85.8667 80 4
 88.6745 85.4667 85.1589 85.4667 100 4
 89.4136 86.6667 86.4302 86.6667 120 4
 89.5177 87.2667 87.0827 87.2667 20 5
 89.153 86.5333 86.3042 86.5333 40 5
 89.3123 86.6 86.3648 86.6 60 5
 89.4136 86.6667 86.4302 86.6667 80 5
 90.2682 88.0 87.8286 88.0 100 5
 90.4435 88.2667 88.1066 88.2667 120 5
 62.2306 57.8667 53.7402 57.8667 20 6
 69.8299 65.1333 62.9389 65.1333 40 6
 86.5291 83.8 83.4917 83.8 60 6
 84.7256 78.6 77.6127 78.6 80 6
 83.1224 75.5333 74.0467 75.5333 100 6
 76.53 58.4 49.8285 58.4 120 6

the positive and negative samples by a hyperplane will be classified (Amarappa and Sath-
yanarayana 2014). In the KNN algorithm, an instance is classified by a majority vote of its
neighbors, where the instance is assigned to the class most common among its k nearest
neighbors (Guo et al. 2003). The results obtained from Tables 3 and 4 using the expert
system indicate that the best accuracy for the Persian and English data set is 96.4667%
and 91.2667%, respectively. Also, Tables 3 and 4 indicate that the best performance of the

13

A framework for text mining on Twitter: a case study on joint…

Table 7  Result of topic detection Precision Recall F-measure Accuracy Number of S
for the English data set using Features
SVM
 82.7797 73.7333 71.7868 73.7333 20 1
 83.0396 74.3333 72.5232 74.3333 40 1
 83.3333 75.0 73.3333 75.0 60 1
 83.1602 75.1333 73.5316 75.1333 80 1
 83.6022 75.6 74.0554 75.6 100 1
 83.2742 74.8667 73.172 74.8667 120 1
 82.7797 73.7333 71.7868 73.7333 20 2
 83.0396 74.3333 72.5232 74.3333 40 2
 83.3333 75.0 73.3333 75.0 60 2
 83.1602 75.1333 73.5316 75.1333 80 2
 83.6022 75.6 74.0554 75.6 100 2
 83.2742 74.8667 73.172 74.8667 120 2
 82.7797 73.7333 71.7868 73.7333 20 3
 83.0396 74.3333 72.5232 74.3333 40 3
 83.3333 75.0 73.3333 75.0 60 3
 83.2829 75.4 73.8517 75.4 80 3
 83.1858 74.6667 72.9293 74.6667 100 3
 83.3968 75.4 73.8336 75.4 120 3
 83.0688 74.9333 73.2906 74.9333 20 4
 83.3037 74.9333 73.2527 74.9333 40 4
 83.5721 75.5333 73.9755 75.5333 60 4
 83.3361 75.2667 73.6735 75.2667 80 4
 83.6323 75.6667 74.1352 75.6667 100 4
 83.906 76.2667 74.85 76.2667 120 4
 75.7555 52.9333 39.5397 52.9333 20 5
 76.959 57.2667 47.7195 57.2667 40 5
 77.0758 57.6667 48.4236 57.6667 60 5
 77.0758 57.6667 48.4236 57.6667 80 5
 76.153 56.2667 46.0018 56.2667 100 5
 76.2981 56.6667 46.7241 56.6667 120 5
 76.6525 56.2 45.8025 56.2 20 6
 76.3713 55.2 43.9507 55.2 40 6
 76.3528 55.1333 43.8253 55.1333 60 6
 75.8267 55.4 44.4079 55.4 80 6
 75.5965 56.1333 45.8372 56.1333 100 6
 75.2546 55.4 44.4898 55.4 120 6

expert system in Persian and English was for K = 20, and S = 6 and 4 for Persian and Eng-
lish, respectively. When the expert system is compared with SVM and KNN, for the Per-
sian data set, the accuracy by 6% and for the English data set, accuracy by 4% over SVM
and KNN is outperformed. Tables 5 and 6 depict the result of topic detection using SVM
and KNN in Persian and Tables 7 and 8 depict the results using SVM and KNN in English.

 13

R. Behzadidoost et al.

Table 8  Result of topic detection Precision Recall F-measure Accuracy Number of S
for the English data set using neighbors
KNN
 81.5564 77.4667 76.7121 77.4667 20 1
 82.2961 77.3333 76.4278 77.3333 40 1
 81.2722 75.4 74.1883 75.4 60 1
 80.1167 74.0 72.6092 74.0 80 1
 77.5242 72.0 70.5209 72.0 100 1
 75.7109 70.3333 68.6965 70.3333 120 1
 81.5564 77.4667 76.7121 77.4667 20 2
 82.2961 77.3333 76.4278 77.3333 40 2
 81.2722 75.4 74.1883 75.4 60 2
 80.1167 74.0 72.6092 74.0 80 2
 77.5242 72.0 70.5209 72.0 100 2
 75.7109 70.3333 68.6965 70.3333 120 2
 82.1506 77.7333 76.9413 77.7333 20 3
 83.6271 79.3333 78.6519 79.3333 40 3
 82.0022 76.0667 74.903 76.0667 60 3
 80.4339 75.1333 74.0013 75.1333 80 3
 78.731 73.5333 72.2796 73.5333 100 3
 75.3319 69.1333 67.1221 69.1333 120 3
 82.14 78.0667 77.349 78.0667 20 4
 84.3328 80.9333 80.4494 80.9333 40 4
 85.4168 82.5333 82.1704 82.5333 60 4
 85.4219 83.2667 83.0082 83.2667 80 4
 84.2938 82.8667 82.6865 82.8667 100 4
 84.2284 83.2 83.0729 83.2 120 4
 25.0 50.0 33.3333 50.0 20 5
 87.339 87.2667 87.2605 87.2667 40 5
 86.4422 86.3333 86.3231 86.3333 60 5
 82.4667 82.1333 82.0873 82.1333 80 5
 79.5352 79.3333 79.298 79.3333 100 5
 78.6318 78.6 78.5941 78.6 120 5
 24.9666 49.8667 33.274 49.8667 20 6
 75.8304 65.6 61.8196 65.6 40 6
 75.0671 73.2667 72.7779 73.2667 60 6
 71.3335 71.3333 71.3333 71.3333 80 6
 66.2577 65.6667 65.3518 65.6667 100 6
 62.2514 60.9333 59.8536 60.9333 120 6

3.2 Sentiment analysis in Persian and English

The sentiment analysis is done for non-duplicate tweets in Persian Fig. 4 and English
Fig. 5 between 2015 and 2019. Besides, sentiment analysis for the first week of the
agreement is performed, where Fig. 7 depicts the results for each language.

13

A framework for text mining on Twitter: a case study on joint…

Fig. 4  The results of sentiment
analysis in Persian between 2015
and 2019

Fig. 5  The results of sentiment
analysis in English between 2015
and 2019

Fig. 6  The process of changing the sentiments in Persian and English tweets from 14th July to 14th August
between 2015 and 2019

 Based on Fig. 4, for Persian tweets, there is an increase in negative opinions over
the considered years. Also, positive opinions are always higher than neutral sentiments.
The results of sentiment analysis in English are depicted in Fig. 5 that between 2017 to
2019, the negative opinions are dominant. For investigating the changes in the senti-
ments over time, we employ the line chart Fig. 6.

 13

R. Behzadidoost et al.

Fig. 7  Results of sentiment analysis from 2015/7/14 to 2015/7/21

Fig. 8  Depiction of the most frequent words for Persian tweets

13

A framework for text mining on Twitter: a case study on joint…

Fig. 9  Depiction of the most frequent words for English tweets

 Based on Fig. 6, in contrast to Persian tweets, the number of positives, negatives, and neu-
trals in English tweets are close to each other. As a result, for Persian tweets, the negative sen-
timents are always predominant during 2015–2019. Likewise, for English tweets, neutrals and
negatives sentiments were jointly predominant in 2015, neutrals sentiments were predominant
in 2016, and negatives sentiments were predominant in 2017–2019. The conducted sentiment
analysis in the first week Fig. 7 indicates that positive sentiments were predominant in Persian
tweets, and negative sentiments were predominant in English tweets. Comparing the tweets
on 14–21 July 2015 and 14th July to 14th august 2015–2019 of the agreement indicates that
the predominant positive sentiment of the Persian views to negative is turned.

3.3 Data visualization

Word count The most important discussions of the textual documents can be found from
the produced visual depiction of them by Wordcloud. For each language, based on counting

 13

R. Behzadidoost et al.

Fig. 10 Depiction of the most frequent words for Persian tweets during the first week

the occurrence of the words of each year, we have created a list containing 500 words for
five years (the top 100 words for each year). As a result, Figs. 8 and 9 depict the obtained
results of performing Wordcloud on lists of Persian and English words.
It is found that among 500 words for every language, just 196 and 118 words are dis-
tinct in Persian and English, respectively. Based on the frequency of the words in Fig. 8,
we have concluded that the main discussed topics in Persian tweets are “foreign policy”,
“nuclear negotiations”, “economic”, “government officials”, and “war”. Likewise, the main
discussed topics in English tweets are “foreign policy”, “nuclear negotiations”, “united
states official governments”, “sanctions”, and “Iranian”. Also, Wordcloud was applied for
the first week of the agreement, where it is just regarded the top 200 most frequent words
in each language. For Persian tweets, Fig. 10 depicts that there is a satisfaction view of
the agreement because there are words such as ,”‫ “ﺭﻗﺹ‬,”‫ “ﺷﺎﺩﻣﺎﻧﯽ‬,”‫“ﺟﺷﻥ‬,”‫“ﺟﺷﻥﺗﻭﺍﻓﻕ” “ﺷﺎﺩی‬
. Likewise, for the English tweets, Fig. 11 depicts that in contrast to Persian tweets, they do
not show a sense of satisfaction.
However, as there are many words about the agreement, such as “Irandeal”, “Jcpoa”,
“Irannucleardeal”, one can conclude the most tweets are highlighting the deal.

You can also read