Sentiment Analysis and Time-series Analysis for the COVID-19 vaccine Tweets - Gowtham Kumar Sandaka Bala Namratha Gaekwade

Page created by Lee Holmes
 
CONTINUE READING
Sentiment Analysis and Time-series Analysis for the COVID-19 vaccine Tweets - Gowtham Kumar Sandaka Bala Namratha Gaekwade
Bachelor of Science in Computer Science
 May 2021

 Sentiment Analysis and Time-series
Analysis for the COVID-19 vaccine Tweets

 Gowtham Kumar Sandaka
 Bala Namratha Gaekwade

 Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden
This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial
fulfilment of the requirements for the degree of Bachelor of Science in Computer Science. The thesis is
equivalent to 10 weeks of full-time studies.

 The authors declare that they are the sole authors of this thesis and that they have not used any
sources other than those listed in the bibliography and identified as references. They further declare that
they have not submitted this thesis at any other institution to obtain a degree.

 Contact Information:
 Author(s):
 Gowtham Kumar Sandaka
 E-mail: gosa20@student.bth.se

 Bala Namratha Gaekwade
 E-mail: baga20@student.bth.se

 University advisor:
 Suejb Memeti
 Department of computer science (DIDA)

 Faculty of Computing Internet : www.bth.se
 Blekinge Institute of Technology Phone : +46 455 38 50 00
 SE-371 79 Karlskrona, Sweden Fax : +46 455 38 50 57
 ii
ABSTRACT
Background: The implicit nature of social media information brings many advantages to realistic
sentiment analysis applications. Sentiment Analysis is the process of extracting opinions and emotions
from data. As a research topic, sentiment analysis of Twitter data has received much attention in recent
years. In this study, we have built a model to perform sentiment analysis to classify the sentiments
expressed in the Twitter dataset based on the public tweets to raise awareness of the public's concerns
by training the models.

Objectives: The main goal of this thesis is to develop a model to perform a sentiment analysis on the
Twitter data regarding the COVID-19 vaccine and find out the sentiment’s polarity from the data to
show the distribution of the sentiments as following: positive, negative, and neutral. A literature study
and an experiment are set to identify a suitable approach to develop such a model. Time-series analysis
is performed to obtain daily sentiments over the timeline series and daily trend analysis with events
associated with the particular dates.

Methods: A Systematic Literature Review is performed to identify the most suitable approach to
accomplish the sentiment analysis on the COVID-19 vaccine. Then, through the literature study results,
an experimental model is developed to distribute the sentiments on the analyzed data and identify the
daily sentiments over the timeline series.

Results: A VADER is identified from the Literature study, which is the best suitable approach to
perform the sentiment analysis. The KDE distribution is determined for each sentiment as obtained by
the VADER Sentiment Analyzer. Daily sentiments over the timeline series are generated to identify the
trend analysis on Twitter data of the COVID-19 vaccine.

Conclusions: This research aims to identify the best-suited approach for sentiment analysis on Twitter
data concerning the selected dataset through the study of results. The VADER model prompts optimal
results among the sentiments polarity score for the sentiment analysis of Twitter data regarding the
selected dataset. The time-series analysis shows how daily sentiments are fluctuant and the daily counts.
Seasonal decomposition outcomes speaks about how the world is reacting towards current COVID-19
situation and daily trend analysis elaborates on the everyday sentiments of people.
Keywords: COVID-19 vaccine, Sentiment analysis, Time-based analysis, Twitter data, VADER.

 iii
ACKNOWLEDGEMENTS
Most importantly, praise and thanks to God, the Almighty, for showering us blessings throughout our
research work, allowing us to accomplish the research favorably. We want to express our heartiest
gratitude to our supervisor, Suejb Memeti, for allowing us to conduct research and for providing
appropriate support throughout this process. Working and studying under his supervision was a
wonderful experience and honor. We are grateful for everything he has done for us. Ultimately, we
would like to express our heartfelt gratitude to everyone who assisted us, directly or indirectly, in
carrying out this research.

 iv
CONTENTS
 ABSTRACT ................................................................................................................................................ III

 ACKNOWLEDGEMENTS ....................................................................................................................... IV

 CONTENTS ..................................................................................................................................................V

 LIST OF FIGURES ...................................................................................................................................... 0

 LIST OF TABLES ........................................................................................................................................ 1

 LIST OF EQUATIONS ............................................................................................................................... 2

 1 INTRODUCTION ............................................................................................................................... 3
 1.1 AIMS AND OBJECTIVES: ................................................................................................................. 4
 1.2 RESEARCH QUESTIONS: ................................................................................................................. 4
 1.3 OVERVIEW:.................................................................................................................................... 4
 2 BACKGROUND .................................................................................................................................. 5
 2.1 NATURAL LANGUAGE PROCESSING: .............................................................................................. 5
 2.1.1 Sentiment Analysis .................................................................................................................... 5
 2.1.2 Lexicon-based Approach .......................................................................................................... 5
 2.1.3 Machine Learning Approach .................................................................................................... 6
 2.1.4 Natural Language Toolkit (NLTK): .......................................................................................... 6
 2.1.5 Valence Aware Dictionary and sEntiment Reasoner (VADER): .............................................. 6
 2.2 KERNEL DENSITY ESTIMATION (KDE): ......................................................................................... 6
 2.3 CUMULATIVE DISTRIBUTION FUNCTION (CDF):............................................................................ 7
 2.4 TIME SERIES ANALYSIS: ................................................................................................................ 7
 2.4.1 Autocorrelation for Time Series Analysis ................................................................................. 7
 2.4.2 Seasonal Decomposition of the Timeline series ....................................................................... 7
 2.5 ETHICAL CONSIDERATIONS ........................................................................................................... 8
 3 RELATED WORK .............................................................................................................................. 9

 4 METHOD ........................................................................................................................................... 11
 4.1 LITERATURE REVIEW .................................................................................................................. 11
 4.2 EXPERIMENT................................................................................................................................ 11
 4.2.1 Setting up Software Environment ........................................................................................... 11
 4.2.2 Data collection ....................................................................................................................... 12
 4.2.3 Data Overview ........................................................................................................................ 12
 4.2.4 Pre-processing the data .......................................................................................................... 13
 4.2.5 Analysing Tweet sentiment ..................................................................................................... 13
 4.2.6 The KDE Distribution for Analyzed data ............................................................................... 14
 4.2.7 Sentiments in Word cloud ....................................................................................................... 14
 4.2.8 Distribution of Daily Sentiments over each partition of the time series analysis ................... 14
 4.2.9 Autocorrelation analysis and the Decomposition of sentiments into systematic components 14
 4.2.10 Daily Trend Analysis with Events Associated to that Particular Date ............................... 15
 5 RESULTS AND ANALYSIS ............................................................................................................ 16
 5.1 LITERATURE REVIEW RESULTS ................................................................................................... 16
 5.2 DATASET COLLECTED USING TWITTER API. ................................................................................ 17
 5.3 RESULTS AFTER PRE-PROCESSING THE DATA .............................................................................. 17
 5.4 RESULTS USING VADER ............................................................................................................. 18
 5.5 RESULTS FOR THE KDE DISTRIBUTION FOR ANALYZED DATA .................................................... 21
 5.6 RESULTS FOR SENTIMENTS IN WORD CLOUD ............................................................................... 22
 5.7 RESULTS FOR DISTRIBUTION OF DAILY SENTIMENTS OVER EACH PARTITION OF THE TIMELINE .. 23
 5.8 RESULTS FOR AUTOCORRELATION ANALYSIS AND THE DECOMPOSITION OF SENTIMENTS INTO
 SYSTEMATIC COMPONENTS ................................................................................................................................ 24

 v
5.9 RESULTS FOR DAILY TREND ANALYSIS WITH EVENTS ASSOCIATED TO THAT PARTICULAR DATE
 25
6 DISCUSSION ..................................................................................................................................... 27

7 CONCLUSION AND FUTURE WORK ......................................................................................... 28
 7.1 CONCLUSION ............................................................................................................................... 28
 7.2 FUTURE WORK ............................................................................................................................. 28
8 REFERENCES .................................................................................................................................... 2

 vi
LIST OF FIGURES
Figure 5.1: Overall sentiments Distribution ......................................................................................... 20
Figure 5.2: Funnel-chart of Sentiment classification Distribution ....................................................... 20
Figure 5.3: Normal Distribution of Sentiments Across Our Tweets ..................................................... 21
Figure 5.4: CDF of Sentiments Across Our Tweets .............................................................................. 21
Figure 5.5: Word cloud of the top positive and the negative sentiments .............................................. 23
Figure 5.6: Distribution of Daily Sentiments Over the Timeline of Each Partition.............................. 24
Figure 5.7: Autocorrelation of positive and negative sentiments ......................................................... 24
Figure 5.8: Decomposition of sentiments into Trends, Level, Seasonality and Residuals .................... 25
Figure 5.9: Daily Trend Analysis with Events Associated to that Particular Date .............................. 26

 0
LIST OF TABLES
Table 5.1: Results for the Literature Review ......................................................................................... 16
Table 5.2: Dataset Overview ................................................................................................................. 17
Table 5.3: Result after pre-processing the tweets ................................................................................. 18
Table 5.4: Sentiment score of tweets using the Vader ........................................................................... 19
Table 5.5: Overall sentiment polarity for every tweet ........................................................................... 19
Table 5.6: Trigram of 15 sentences one of the top ten positive tweets .................................................. 22
Table 5.7: Trigram of 15 sentences one of the top ten negative tweets ................................................. 23
Table 5.8: Mean and the SD of the sentiments in each partition .......................................................... 23

 1
LIST OF EQUATIONS
Equation 4.1: probability density function .............................................................................................. 6
Equation 4.2: Gaussian kernel function .................................................................................................. 7
Equation 4.3: CDF of the standard normal distribution ......................................................................... 7

 2
1 INTRODUCTION
The approval of the COVID-19 vaccine, such as Pfizer-BioNTech and Moderna vaccines sent waves of
excitement and relief across the world. However, according to World Health Organization (WHO)
research suggests that some people remain hesitant about receiving a vaccine for COVID-19. The World
Health Organization noted in 2019 that one of the greatest threats to global health was vaccine
hesitancy[1]. So, most of the COVID-19 vaccine information is available online through various social
media platforms and one popular social media platform in this regard is Twitter. Twitter is popular in
sentiment analysis and with the vast information available in Twitter, it is effective to perform sentiment
analysis and time series analysis[2].

Nowadays, social media sites such as Twitter, Facebook, and YouTube are valuable resource referred
to as social data. Events that happen in everyday life are explained on media platforms, and anyone is
free to discuss and give their thoughts about these events. Furthermore, social media platforms are
enormous sources of information for businesses to oversee public perception and obtain poll results
about the products they manufacture. Microblogging services today have become eminent and
consistently used platforms. However, due to the informal language used, non-textual content, dialects
and acronyms, the classification of data extracted from such microblogging websites is instead a
challenging task[3].

Compared with other social media platforms, Twitter is chosen because it provides bountiful
information that is majorly used in sentiment analysis and time-series analysis. Twitter is efficient to
perform sentiment analysis than other social media platforms. Twitter is a well-known microblogging
service that lets users share, deliver, and interpret real-time, short, and simple messages called tweets.
In social media data, users type multiple punctuation marks, acronyms, and emoticons express their
sentiments [4]. Twitter data enables researchers to obtain large samples of user-generated content,
thereby garnering insights to inform early response strategies. Most modern studies have used sentiment
analysis to obtain and classify Twitter opinions on various topics, including predictions, reviews,
elections, and marketing[5]. In Twitter trends classification, tweets are classified into generic classes
like Sports, Education, News, etc. Twitter tweets have unique features like #tag, Targets@, Emoticons,
Special symbols, URLs, colloquial terms and so on[3], [6].

Sentiment analysis these days is considered one of the most in-demand research topics in the area of
Natural Language Processing (NLP). Some interesting scientific and commercial applications of
sentiment analysis include opinion mining, recommender systems and event detection. Sentiment
analysis has been handled as a Natural Language Processing task at many levels of granularity[7].
Numerous applications of sentiment analysis can be found in social media, news articles, and product
reviews. The results of sentiment analysis are employed in public market research and decision-making.
To perform sentiment analysis, we have considered a dataset extracted from Twitter using Twitter API
with a tweepy python package which is required to predict the sentiments from the data.

The sentiment analysis approach has been applied to the data we have collected, and a detailed
explanation is stated. Literature study propounds that many researchers are working on sentiment
analysis on Twitter social media. While compared to those research works, our thesis explains
performing the sentiment analysis on the Twitter data and time-based analysis on the Twitter trends over
the timeline of the COVID-19 vaccine.

As a whole, this thesis drives you through a systematic literature review to select best suitable way to
perform sentiment analysis and then performing Twitter data sentiment analysis and time series analysis
of the sentiments for the COVID-19 vaccine tweets. Sentiment analysis (SA) is an intellectual process
of extricating a user's feelings and emotions. It is one of the pursued fields of Natural Language
Processing (NLP). The Time-based Analysis is a sequence of observations collected in constant time
intervals which means developing models to evaluate the observed time series. In our current work, the
Valence Aware Dictionary and sEntiment Reasoner (VADER)[4] assess tweet polarity and classify

 3
tweets using multi-class sentiment analysis. From the data available, we can view the COVID-19
vaccine outbreaks in last year.

1.1 Aims and Objectives:
The thesis aims to build a model that can evaluate a set of tweets and deliver them based on their meaning
into positive, neutral, and negative sentiments and show the distribution of the daily sentiments over the
timeline series COVID-19 vaccine.

 Objectives:

 1. Investigate which algorithm is used for the optimal implementation of sentiment
 analysis.
 2. To select which algorithm is better suited for sentiment analysis classification.
 3. Preparing a model that could make accurate predictions of sentiment analysis on the
 COVID-19 vaccine.
 4. Performing the time-series analysis on the predicted sentiments.

1.2 Research Questions:
RQ1: Which approach can be used to obtain optimal results for sentiment analysis classification?

RQ2: How can we distribute the daily sentiments over the timeline series?

1.3 Overview:
The thesis is divided into several sections. The overview explains the methods and approaches used to
perform the sentiment analysis and time-series analysis in this thesis. The first section is Introduction,
which introduces our thesis study, aim and objectives, and research questions. The second section
includes discussing the background and various approaches used like Natural Language Processing
approach, sentiment analysis and approaches, tools used to perform sentiment analysis, and time-series
analysis approach. The third section includes previous surveyed work and various research done by the
researchers related to this thesis. This particular section contains methods for answering research
questions. It incorporates experimental work such as environment setup, data collection, how the data
collected is pre-processed, analyzing tweet sentiments and time series analysis on predicted sentiments
and other tools used in the experiment, and experimental setup information. In the next section, the
procured results are exhibited like (1) After pre-processing and analyzing the data, the results are shown,
(2) Sentiments obtained in the word cloud are exhibited, and (3) Time-series analysis outcomes are
explained in detail. The discussion section contains a discussion of the research questions and a
discussion on results obtained. In the last section, we sum up the conclusion of the thesis based on the
outcomes and future work to improve further.

 4
2 BACKGROUND
The fundamental understanding of the topic related to sentiment analysis is explained in this chapter.
The study includes levels of sentiment analysis as well as approaches to sentiment analysis. In this thesis,
the various techniques that are used to implement sentiment analysis are explained.

2.1 Natural Language Processing:
Natural Language Processing (NLP) is a set of theory suggesting computer approaches for evaluating
and modelling naturally occurring texts at one or more levels of linguistic analysis to achieve human-
like speech recognition for various activities and applications. It integrates statistical, machine learning,
and deep learning models with computational linguistics rule-based modelling of human language.
These methods, when used collectively, allow digital computers to learn the natural language as a
medium of textual or speech data and 'fully comprehend' its exact implications, including the presenter
or writer's intention and mood[8], [9].

2.1.1 Sentiment Analysis
Sentiment analysis is a method of forecasting people's feelings or emotions toward an entity or a subject.
Sentiment analysis uses algorithms to classify various samples of related text into overall positive and
negative categories. Most approaches to sentimental analysis include one of two forms: polarity-based,
which organize texts as positive or negative or valence-based, considering the strength of the context.
For example, in a polarity-based approach, the words excellent and good would be treated the same. In
contrast, the excellent would be regarded as favorable as good in the valence-based approach. The
Natural Language Processing methods and lexicons with annotated word polarities can carry out this
research. The widespread availability of user-generated information on the internet most likely resulted
in effective sentiment analysis research[3].

2.1.2 Lexicon-based Approach
As the name implies, the lexicon-based approach utilizes lexicon or dictionaries. In this step, to calculate
the orientation for a document, semantic orientation or polarity of words or phrases are used. Unlike a
machine learning approach, the lexicon-based method does not necessarily require storing a large corpus
of data. It evaluates the orientation of a document using a lexicon or dictionaries. Semantic Orientation
(SO) is the measure of subjectivity and opinion in the text, and it captures the polarity and strength of
words or phrases. All these sentences govern the complete sentiment orientation of the document.
Opinion lexicon is manually or automatically created. The manual approach for developing the opinion
lexicon can be time-consuming. It must combine with other automatic methods—these details majorly
into two categories of manual lexicons: common lexicon and category-specific lexicon. The advantage
of using the lexicon-based approach is that the list of sentiment words and their polarity can be searched
very quickly[10]. The standard lexicon comprises default sentiment words with the same sentiment
value, split words, negation words, and blind negation words. A comprehensive, high-quality lexicon is
often essential for fast, accurate sentiment analysis on large scales[6].

Using this dictionary gives errors that can eliminate by manually checking them, even if this would be
time-consuming. One drawback of the VADER-based approach is that it is domain-independent. For
example, the word "quiet" would positively describe a car but would be harmful when describing a
speakerphone.

 5
2.1.3 Machine Learning Approach
Machine Learning is one of the common approaches for sentiment analysis, which is short compared to
the lexicon-based approach. Classified training data is needed when using a machine learning approach.
It will then train the algorithm on this data to structure it to predict the classification of previously unseen
data. A few examples of standard algorithms used for this purpose are Naïve Bayes and Support Vector
Machines. Investigating and testing all these machine learning algorithms is time-consuming and slow.
For the better scope of the report, the lexicon-based approach is chosen. The machine learning approach
will not be explored further[10].

2.1.4 Natural Language Toolkit (NLTK):
NLTK is a free, open-source python package that includes several tools for programming and data
classification. Linguists, engineers, students, educators, researchers, and developers who work with
textual data in natural language processing and text analytics will benefit from NLTK. NLTK makes it
simple to access the interfaces of over 50 corpora and lexical resources. It collects text processing
libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning[4].

2.1.5 Valence Aware Dictionary and sEntiment Reasoner (VADER):
The implementation is majorly carried out with the application VADER (Valence Aware Dictionary and
sEntiment Reasoner). VADER, an open-source application, is a text sentiment analysis model sensitive
to both polarity (positive/negative) and intensity (strength) of emotion. This VADER gives us whether
a statement is positive, negative or neutral and how much we will get positive, negative or neutral[11].
VADER includes a sentiment lexicon that has been systematically built and some syntactic rules to
develop sentiment analysis. VADER is constructed especially for tweets and contains both abbreviations
and emojis. Emojis are emotional tokens that are commonly used on the internet. It is an entirely free
open-source tool. VADER also takes into consideration word order and degree modifiers[12].

The VADER lexicon-based approach has the advantage of quickly searching the list of sentiment words
and their polarity. The dictionary might have errors, but these can be eliminated with a manual check,
although this would be time-consuming. One disadvantage of the VADER-based approach is that it is
domain-independent. For example, the word "quiet" would positively describe a car but would be
harmful when describing a speakerphone[6].

2.2 Kernel Density Estimation (KDE):
Kernel density estimates (KDEs)[13], [14] are a technique for converting continuous data like
histograms or scatter plots into a smoothed probability density function in a non-parametric manner
using a smoothing parameter h, also known as the bandwidth, as in equation 4.1. Based on a random
sample , =1…, a kernel density approximation of a univariate probability density function (pdf) f
(Thomas H., Christoph W., 2007) is
 1
 ̂( ) = ℎ
 
 =1 ( − )/ℎ,

 Equation 2.1 : probability density function

In equation 4.2, where n denotes the sample size, h represents the kernel bandwidth, and K means the
kernel function. The bandwidth h determines the approximate density's smoothness. A smaller
bandwidth h leads to a more variable estimated density, while a larger h leads to a more acceptable

 6
estimated density. Random variable density is calculated using kernel functions. The kernel function
used here is the Gaussian kernel function[15], defined as (Che and Wang, 2014).

 1 ( − )2
 ( ) = − 2 2
 √2 

 Equation 2.2: Gaussian kernel function

2.3 Cumulative Distribution Function (CDF):
A normal cumulative distribution function (CDF) is a technique that returns the percentage of the normal
distribution function that is less than or equal to the random variable specified.

The Φ function denotes the CDF of the standard normal distribution:
 
 1
 Φ(x) = ( ≤ ) = − ∫ { − 2⁄2 } 
 √2 
 −∞

 Equation 2.3: CDF of the standard normal distribution

2.4 Time Series Analysis:
A time series is a series of data points indexed in time. The fact that time-series data is ordered makes
it unique in the data space because it often displays serial dependence. Serial dependence occurs when
the value of a datapoint is statistically dependent on another data point at another time. However, this
attribute of time series data violates one of the fundamental assumptions of many statistical analyses
that data is statistically independent[16].

2.4.1 Autocorrelation for Time Series Analysis
Autocorrelation is a type of serial dependence. Autocorrelation occurs when a time series is linearly
related to a delayed version of itself. Correlation, on the other hand, is simply when two independent
variables are linearly related.

Furthermore, some time series forecasting methods (specifically, regression modelling) rely on the
assumption that the residuals are free of autocorrelation (the difference between the fitted model and the
data). Perhaps the most compelling aspect of autocorrelation analysis is how it can assist us in
uncovering hidden patterns in our data and selecting the appropriate forecasting methods. We can use it
specifically to identify seasonality and trend in time series data[17].

2.4.2 Seasonal Decomposition of the Timeline series
It is an essential technique for all types of time series analysis, especially for seasonal adjustment. It
seeks to construct several component series from an observed time series (reconstructing the original by
additions or multiplications). Each of these has a particular characteristic or type of behaviour[18]. For
example, time series are usually decomposed into:

 • Level: The average value in the series.
 • Trend: The increasing or decreasing value in the series.
 • Seasonality: The repeating short-term cycle in the series.
 • Noise: The random variation in the series.

 7
2.5 Ethical Considerations
On the internet, user privacy is becoming increasingly difficult to manage. Big companies like Facebook
and Google have mocked in this regard. Information of a personal nature in the tweets will not be
presented in this report to respect the authors' privacy, making the data anonymous.

 8
3 RELATED WORK

Kaur and Sharma[19] investigated the sentiments regarding coronavirus disease (COVID-19) and the
sentiments of various individuals concerning this COVID-19 virus. Twitter API was used to acquire
coronavirus-related tweets, which were then analyzed using machine learning processes and methods to
determine positive, negative, and neutral emotion. Furthermore, the NLTK library is used for pre-
processing fetched tweets. The text blob dataset has been used for tweet analysis. The exciting results
in positive, negative, and neutral sentiments are displayed using various visualizations. In comparison
with this research, they have used ML methods to find products for sentiment analysis. In contrast, we
also use a lexical-based approach for sentiment analysis and perform the time series analysis in our
study.

Alhaji et al.[20] used Naive Bayes Machine Learning model to perform sentiment analysis on Arabic
tweets using Python's NLTK library. The hashtags tweets were related to seven government-urged
public health initiatives. In total, they examined 53,127 tweets in this study. Excluding one measure, the
results have shown that negative tweets are less compared to positive tweets.

According to Prabhakar Kaila et al. [14], the raw data collected, tweets related to #coronavirus, was
ideal and fully deserving of being applied and studied using sentiment analysis experiments concerning
the novel COVID-2019 outbreaks. They used the Latent Dirichlet Allocation (LDA) method to analyze
the information gathered in the document term matrix from the datasets. During the operation, the LDA
techniques discovered that the vast majority of the information about the COVID-19 infection paramedic
and negative sentiments like fear and positive sentiments such as trust.

Medford et al.[21], created a list of COVID-19-related hashtags to search for specific tweets over two
weeks from January 14 to January 28, 2020. API extracts tweets and saves them as plain text. The
correlated frequency keywords like infection prevention practices, vaccination and racial prejudice are
identified and analyzed. Following that, sentiment analysis is used to determine each tweet's emotional
valence (positive, negative, or neutral) and dominant emotion (anger, disgust, fear, joy, sadness, or
surprise). And at last, relevant topics in tweets are studied and discussed over time using an unsupervised
machine learning method.

Gilbert et al.[6], generated VADER, which is a simple rule-based model for general sentiment analysis.
Particularly in comparison with its efficiency to 11 typical state-of-the-art benchmarks, such as
Affective Norms for English Words (ANEW), Linguistic Inquiry and Word Count (LIWC), the General
Inquirer, SentiWordNet, and machine learning-oriented techniques that rely on Naive Bayes, Maximum
Entropy and Support Vector Machine (SVM) algorithms. The research study has identified the
formation, verification, and testing of VADER. The researcher has used qualitative and quantitative
methods to process and verify a sentiment lexicon used in the social media field. To evaluate the
sentiment of tweets, VADER employs a simple rule-based model. According to the findings, VADER
enhanced the advantages of traditional sentiment lexicons such as LIWC. VADER distinguished itself
from LIWC by being more sensitive to sentiment expressions in social media environments and
generalizing more closely to other fields.

Cherish Kay Pastor et al.[22], expresses the feelings of Filipinos as a result of the COVID-19
Pandemic's extreme society quarantine, especially in Luzon. The researcher also examines severe
community quarantine and other Pandemic effects on current life based on the users' tweets. The Natural
Language Processing methodology is often used to improve user sentiments from extracted tweets.
Opinions are viewed as data to be analyzed. Also used a qualitative method to evaluate the effect of the
extreme community quarantine in the Luzon area.

In this study, AD Dubey, A. D et al.[23], collected and analyzed tweets from a total of twelve states.
The tweets were collected between March 11 and March 31, 2020. Furthermore, all tweets are related

 9
to the novel COVID-19 disease. The goal of this work is to learn how people in those countries are
reacting to disease outbreaks. There is no wonder that specific measures must be taken whilst also
obtaining and executing tasks such as pre-processing and removing irrelevant information from tweets.
The results say that these experiments prove that most people from these societies think positively and
feel good that things will get better, but it is also worth noting that there are signs of grief and pain.
Nevertheless, four states, primarily from the European continent, believe they cannot trust the situation
due to outbreaks and pandemics affecting huge populations.

When comparing previous studies, most researchers employed Python's NLTK package and used the
Twitter API to extract corona virus-related tweets. Both approaches were used to perform the sentiment
analysis, i.e., machine learning approaches and VADER sentiment analysis approaches. Other methods,
such as LDA, were also utilized (Latent Dirichlet Allocation). In this thesis, based on the systematic
literature review, we have used a VADER sentiment analyzer to perform sentiment analysis using NLTK
python’s library and Twitter API to extract the dataset from Twitter and also time-series analysis is
conducted to know the sentiments of the people daily and to find out the tweet count per day.

 10
4 METHOD
We used two research methods in this study: a literature review and an experiment. First, we conducted
a systematic literature review in which we carefully analyzed the literature and chose the approach based
on the results. Next, we experimented with research question 2, in which we determined the distribution
of sentiments.

4.1 Literature Review
A systematic literature study was conducted to address RQ1, by following Marcus Gustafsson and Eric
Gilbert's guidelines. This literature review focuses on identifying appropriate approaches for sentiment
analysis. We took several steps in our research, which is as follows:

 1. Identifying the keywords: We have identified the following keywords: sentiment analysis,
 time-based analysis, and COVID-19 vaccine.

 2. Formulating the search strings: chose significant keywords from the keywords mentioned
 earlier to develop the search string.

 3. Locating the literature: searched various digital database platforms like Diva, Google Scholar,
 IEEE, and ResearchGate using a search string.

 4. Following the Inclusion and Exclusion criteria for selection: To limit our research, inclusion
 and exclusion criteria are applied to the collected literature, such as articles and conference
 papers.
 Inclusion Criteria:
 • Papers on approaches to sentiment analysis.
 • All articles must be written in English.
 Exclusion Criteria:
 • Incomplete articles.
 • Articles that are not written in English are not considered.

 5. Evaluating and selecting the literature: After implementing the inclusion and exclusion
 criteria, the refining is done by carefully assessing and selecting the gathered literature.

 6. Summarizing the literature: The gathered literature's overall findings outline and represent
 for analysis.

4.2 Experiment
An experiment is set to carry out the consequences of RQ1 obtained from SLR (Systematic Literature
Review), in which we identified the appropriate approach for performing sentiment analysis on Twitter
data. The experiment is to develop a model for categorizing sentiments as positive, negative, or neutral
and evaluating RQ2 to predict the distribution of daily sentiments over a time series.

4.2.1 Setting up Software Environment
Python was used in the creation of this model. The following Python libraries are used to develop the
machine learning models in this experiment:

 11
• Python V .3.9: It is a high-level scripting language that is interpreted, interactive, and object-
 oriented. Python is intended to be highly readable, with fewer syntactical constructions than
 other programming languages.

 • NLTK V .3.6.2: It is a Python package that works with human language data and provides a
 simple interface to various lexical resources such as WordNet and text processing libraries.
 Classification, tokenization, stemming, tagging, parsing, and semantic reasoning are all
 performed using these lexical resources.

 • Pandas V .1.0.1: A Python package that functions as a data analysis tool and works with data
 structures. Pandas perform the entire data analysis workflow in python, eliminating the need to
 switch to a more domain-specific language like R.

 • Tweepy V .3.10.0: It establishes a connection to the Twitter API and retrieves tweets from
 Twitter. This module is used to stream real-time tweets directly from Twitter.

 • NumPy V .1.18.1: NumPy is the basic Python computing package. It is used to extend the
 functionality of multi-dimensional arrays and matrices by providing an extensive collection of
 high-level mathematical functions.

 • Scikit-learn V .0.22.1: It is a simple and efficient tool for data mining and data analysis.

 • Matplotlib V .3.1.3: It is a python library that generates plots, histograms, power spectra, bar
 charts, etc. In this work, matplotlib.pyplot module is used to plot the metrics.

4.2.2 Data collection
As mentioned, the social media platform that has been chosen to collect data sets is Twitter. To illustrate
the entire work step by step, we have to describe each stage as follows. Firstly, the essential step is to
establish the connection between Python and Twitter Microblog. Twitter makes its data available
through public APIs that may access via URLs. Python includes a tweepy package that allows you to
access Twitter's data via the API. Calling required libraries, such as Tweepy, is the initial step in
implementation. The data we have collected from Twitter were the tweets that were alphabetic, primarily
character.
Still, users nowadays also include emotional signs such as laughing and sad and even emojis to express
their feelings. The data collection is performed for one week, and each day's data is kept in different
CSV files. The information that has been targeted was content and also the timestamps of the tweets.
The main operation is when the tweets are retrieved from Twitter and sending the tweets to a method
that performs the sentiment analysis using python's library. The methods used to extract Twitter data
allowed real-time access to publicly available raw tweets. To gather the data, we used Twitter API.
Twitter API will enable users to download tweets officially from a user account and save the tweets in
a suitable file format like csv[24]. We collected a total 7,313 tweets concerning the COVID-19 vaccine,
which were published on Twitter's public message board. To manage the most relevant tweets, we used
keywords to retrieve tweets, such as #Pfizer & BioNTech vaccine, #corona vaccine 2020, # COVID-19
vaccine 2020[1].
4.2.3 Data Overview
The dataset that is extracted consists of various fields in it as we see in figure 5.2. The several areas that
are more likely user details and activities are described here. There are 16 fields in total and & 7,783
tweets. The fields are user_id, user_name, user_location, user_description, user_created,
user_followers, user_friends, user_favourites, user_verified, date, text, hashtags, source, retweets,
favourites and is_retweet. The essential fields are user_id, user_name, date, text, hashtags, where these
are majorly required and engaged in analyzing the data for the sentiment analysis.

 12
4.2.4 Pre-processing the data
On Twitter, a tweet is a microblog message. It’s only allowed to be 140 characters long. The majority
of tweets include text as well as embedded URLs, photos, usernames, and emoticons. There are also
misspellings in them. In the model presented in this study, unstructured data on a novel coronavirus
(2019-nCoV) are extracted from Twitter and then subjected to text cleaning, so-called screening or
filtering, and finally, classification operations[25]. Hence, we carried out a series of pre-processing steps
to remove irrelevant information from the tweets. For analyzing the text, HTML characters, slang words,
stop words, punctuations, URLs etc. are needed to be removed[26]. Splitting of attached words is also
be noticed for cleansing[26]. The reason is that the cleaner the data, the more suitable they are for
mining and feature extraction, which leads to the improved accuracy of the results. The tweets were also
pre-processed to eliminate duplicate tweets and retweets from the dataset, which led to a final sample
of 7,783 tweets—processed each tweet to extract its central message. To pre-process this data, we used
Python's Natural Language Toolkit (NLTK). First, run a regular expression (Regex) in python to detect
and discard tweets special characters, such as URLs ("http://url"), retweet (RT), user mention (@), and
unwanted punctuation. Because hashtags (#) often explain the subject of the tweet and contain helpful
information related to the topic of the tweet, they are added as a part of the tweet, but the "#" symbol
was removed[5].

 cleaned_tweets = []
 for tweet in tweets:
 # String search - remove searched substring from string
 # RE for links: r'http\S+
 # RE for @mentions: @[A-Za-z0-9]
 cleaned_tweet = re.sub(r”http\S+|@[A-Za-z0-9]+”, ““, tweet[0])
 # Store in a new list of lists with cleaned tweets
 cleaned_tweets.append([cleaned_tweet, tweet[1]])

Then converted the tweets to lowercase, stop words (i.e., words that have no meaning, such as is, a, the,
he, they, etc.) were removed, tokenized the tweets into individual words or tokens, and the tweets were
stemmed using the Porter stemmer. When the pre-processing steps are complete, the dataset was ready
for sentiment classification[4].
4.2.5 Analysing Tweet sentiment
The sentiments expressed in the tweets were classified by VADER Sentiment Analyzer which is applied
to the dataset. First, we created a sentiment intensity analyzer (SIA) to categorize our dataset. Then used
the polarity scores method to determine the sentiments. Then used VADER Sentiment Analyzer to
classify the pre-processed tweets as positive, negative, neutral, or compound. The compound value is a
valuable metric for measuring the sentiment in a given tweet. The compound score is computed by
summing the valence scores of each word in the lexicon, adjusted according to the rules, and then
normalized to be between -1 and +1. The threshold values categorize tweets as either positive, negative,
or neutral[4], [6]. Typical threshold values used in this study are Refer to “(1)”:

Classification of sentiments:

Positive sentiment: compound value > 0.000001, assign score = 1

Neutral sentiment: (compound value > -0.000001) and (compound value < 0.000001), assign score =0

Negative sentiment: compound value < -0.0000001, assign score = -1

 13
4.2.6 The KDE Distribution for Analyzed data
A tweet with a compound value more significant than the threshold is classified as a positive tweet. In
contrast, a tweet with a compound value less than the threshold is classified as a negative tweet
throughout the current study. The tweet was deemed impartial in the remaining instances. We divide the
total sentiments into three categories based on their emotional values: positive, neutral, and negative.
The length of the model input is determined by the sentiment value, which is crucial for model growth.
Following that, we offer a summary distribution of all sentiments. Kernel density calculations will be
applied first before the distribution is shown.

Using Seaborn, a Python data visualization library built on Matplotlib, provides a high-level interface
for drawing KDE graphics while implementing KDE graph[13], [14]. Then, using the CDF(Cumulative
Distribution Function) to observe the significant variations in the strength of positive, neutral, and
negative feelings in the data are measured based on emotion values. It returns the per cent of the normal
distribution function less than or equal to the specified random variable. Therefore, according to the
CDF of the standard normal distribution, the overall sentiments are distributed into the positive, neutral,
negative according to their sentiment values and the density.

4.2.7 Sentiments in Word cloud
The most frequently occurring words are found in this analysis from the above-distributed sentiments,
giving both positive and negative sentiments regarding the tweets. The comments represented in the
word cloud with a set of the probability of the sentence, which aids in highlighting the most commonly
cited words in the reviews. The word cloud of words having a higher likelihood in the sentence. The
word cloud is created using 'Word Cloud' packages for each of the topmost positive and the negative
sentiments[27].

4.2.8 Distribution of Daily Sentiments over each partition of the time series
 analysis
Using a time series overview of daily Twitter volume, the sample timeline is split into smaller time
intervals. Time series analysis identifies the peaks in Twitter activity to explain the underlying process
at work in a specific period. Using continuous data as feedback, this type of study detects shifts in
situational information about a topic over time. This time series analysis describes events in real-time
and has been introduced in various applications such as economics, the environment, science, and
medicine. We analyzed where the changes happened and defined the times of occurrence using multiple
methods, including the autocorrelation and seasonal decomposition of sentiments. To break out into
separate time series, we illustrated both sudden shifts in relative volume and events.

Firstly, we calculate the mean and standard deviation (SD) for the positive sentiments and the negative
sentiments by dividing into three partitions periods to distribute the daily sentiments over the timeline
for each partition. After partitioning these tweets, we build up a model to show the SD and mean for
positive and negative sentiments.

4.2.9 Autocorrelation analysis and the Decomposition of sentiments into
 systematic components
To remove the lags in the built-in model, we perform the autocorrelation analysis. We used the
Pandas.Series.autocorr() function, which returns the value of the Pearson correlation coefficient. The
Pearson correlation coefficient is a representation of two variables' linear correlation. The Pearson
correlation coefficient ranges from -1 to 1, with 0 representing no linear correlation, >0 representing a
positive correlation, and
indicates that two variables change in a loop, while a negative correlation coefficient indicates that the
variables change in the opposite direction. We used a lag=1 (or data(t) vs. data(t-1)) and a lag=2 (or
data(t) vs. data(t-2) to compare the data (t-2). Then used autocorrelation plot to measure the
autocorrelation function (AFC) values against various lag sizes. We compared fewer and fewer
observations as there is an increase in the lag value. A general guideline is that the total number of
observations (T) should be at least 50, and the greatest lag value (k) should be less than or equal to T/k.
Since we have 60 observations, we have only considered the first 20 values of the AFC[16], [17], [28].

Then we visualize the data using a time-series decomposition that allows us to decompose time series
into four distinct components: trend, seasonality, residual, and noise. We use the seasonal_decompose
() function that returns a result object. The result object contains arrays to access four pieces of data
from the decomposition [28].

4.2.10 Daily Trend Analysis with Events Associated to that Particular Date
After performing, the seasonal decomposition and the autocorrelation analysis, the daily sentiments are
predicted from the data. We broke our dataset into “usernames", “date”, “text", and "hashtags” fields
and created new data which had an additional area as "count" (just a standard counter). Then we merged
the data based on the date field to see daily analysis of the tweets in my data.

 15
5 RESULTS AND ANALYSIS

5.1 Literature Review Results
A Systematic Literature Review (SLR) is conducted to answer RQ1. Which approach can be used to
obtain optimal results for sentiment analysis classification? The goal of the SLR is to identify the most
suitable system that would facilitate accurate results of sentiment analysis?

 Title Findings

 VADER: A Parsimonious Rule- Sentiment scores from VADER and 11 other highly regarded
 based Model for Sentiment Analysis sentiment analysis tools/techniques are compared to give the
 of social media Text best performance in all metrics. VADER has reached its
 peak performance with large datasets [6].

 Sentiment Analysis for Tweets in In this paper, the standard approach for sentiment analysis
 Swedish will be introduced shortly for comparison. When using a
 machine learning approach, one needs classified training
 data. An algorithm will then be trained on this data to shape
 the algorithm to predict the classification of unseen data.
 Investigating and testing machine learning algorithms was
 considered which is time-consuming for the scope of this
 report. Instead, the VADER sentiment analyzer was
 chosen[10].
 We are using VADER sentiment and Though this paper belongs to a different domain, as accuracy
 SVM for predicting customer comparison between algorithms is performed, it has been
 response sentiment. considered. Machine learning algorithms and lexicon-based
 approaches such as Support Vector Machines (SVMs) and
 VADER conclude that VADER gives the highest accuracy
 than ML algorithms like SVM[29].

 A Review of Social Media Posts In this paper, VADER was chosen for sentiment analysis
 from UniCredit Bank in Europe: A because it works best and provides optimal sentiment analysis
 Sentiment Analysis Approach on short documents like tweets (social media text)[30].

 A Comprehensive Study on Lexicon This paper belongs to a different domain, as an accuracy
 Based Approaches for Sentiment comparison between lexicon-based approaches is performed,
 Analysis such as VADER, Textblob, and NLTK. All VADER
 approaches give high accuracy metrics precision, recall, and
 F1 score[31].

 Hybrid approach: naive Bayes and This paper uses a hybrid approach to perform the sentiment
 sentiment VADER for analyzing the analysis by VADER and naive Bayes methods. Here,
 idea of mobile unboxing video Sentiment VADER's lexicon approach for the social media
 comments text has a good impact on the Naive Bayes classifier in
 predicting the sentiments[32].

 Table 5.1: Results for the Literature Review

 16
Numerous works using machine learning algorithms and lexicon-based approaches in the sentiment
analysis domain were identified in the Systematic Literature Review (SLR). Most of the articles included
a comparison between machine learning techniques and lexicon-based techniques as per [6], [29], [31],
[32], Twitter datasets require comparison between algorithms to identify the most suitable one. The
VADER is recognized as the most utilized technique to obtain optimal results for sentiment analysis
classification.

5.2 Dataset collected using Twitter API.
Table 5.2 shows the overview of the dataset that is collected from the Twitter API. This dataset consists
of the essential fields: id, user_name, date, text, hashtags, where these are majorly required and engaged
in analyzing the data for the sentiment analysis.

 S User_name date text Hashtags
 No.
 1 ###### ### 20- Same folks said ['PfizerBioNTech']
 12-2020 daikon paste could treat a
 06:06:4 cytokine storm
 4 #PfizerBioNTech
 https://t.co/xeHhIMg1kF
 2 ###### 12- Explain to me again ['whereareallthesickpeople'
 #### ###### 12-2020 why we need a vaccine , 'PfizerBioNTech']
 20:17:1 @BorisJohnson
 9 @MattHancock
 #whereareallthesickpeopl
 e #PfizerBioNTech…
 3 ######### 12- There have not been ['BidenHarris',
 # ####### 12-2020 many bright days in 2020 'Election2020']
 20:04:2 but here are some of the
 9 best
 1. #BidenHarris
 winning #Election2020…
 4 #### 12- Covid vaccine; You ['CovidVaccine', 'covid19',
 ###### 12-2020 getting it? 'PfizerBioNTech', 'Moderna']
 20:01:1 #CovidVaccine
 6 #covid19
 #PfizerBioNTech
 #Moderna
 5 ######### 12- #CovidVaccine ['CovidVaccine',
 ### 12-2020 'COVID19Vaccine', 'US',
 19:30:3 States will start 'pakustv', 'NYC', 'Healthcare',
 3 getting 'GlobalGoals']
 #COVID19Vaccine
 Monday, #US
 Table 5.2: Dataset Overview

5.3 Results after Pre-processing the Data
The pre-processing strategies performed on the dataset produced the results as summarized in Table 5.3.
This process reduced the number of words in reviews and the number of words in vocabulary
significantly. In light of this result, the pre-processing phase was essential to help the researchers clean
and remove unnecessary words.

 17
Text Tokenized No_stopwords Stemmed_porter Stemmed_snowball Lemmatized

 0 The same [same, folks, [folks, said, [folk, said, [folk, said, daikon, [folk, said,
 folks said said, daikon, daikon, paste, daikon, past, past, could, daikon,
 daikon paste, could, could, could, treatcytokin... paste, could,
 paste could trea... treatcytok... treatcytokin... treatcytoki...
 treatcytoki...

 1 while the [while, the, [world, wrong, [world, wrong, [world, wrong, side, [world,
 world has world, has, side, history, side, histori, year, histori, year, hope, wrong, side,
 been on the been, on, year, hope, bigg... bigg... history, year,
 wrong side the, wrong... hopefully... hopefully...
 of ...

 2 Russian [russian, [russian, [russian, vaccin, [russian, vaccin, [russian,
 vaccine is vaccine, is, vaccine, creat, last, 2, 4, creat, last, 2, 4, year] vaccine,
 created to created, to, created, last, 2, year] created, last,
 last 2 4 last, 2, 4... 4, years] 2, 4, year]
 years

 3 facts are [facts, are, [facts, [fact, immut, [fact, immut, senat, [fact,
 immutable immutable, immutable, senat, even, even, ethic, sturdi, immutable,
 senator senator, senator, even, ethic, sturdi, enou... senator,
 even when even, when, ethically, s... enou... even,
 you re n... y... ethically, st...

 4 explain to [explain, to, [explain, [explain, [explain, needvaccin] [explain,
 me again me, again, needvaccine] needvaccin] needvaccine]
 why we why, we,
 need needvaccine]
 vaccine

 Table 5.3: Result after pre-processing the tweets

5.4 Results using VADER
The results of a Twitter sentiment analysis using NLTK and VADER sentiment analysis tools are
discussed in this section 4.2.5. Table 5.4 shows the sentiment score of each tweet as positive, negative,
neutral, or compound as obtained by the VADER Sentiment Analyzer.

 18
You can also read