SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING

Page created by Max Sullivan
 
CONTINUE READING
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING
SENTIMENT ANALYSIS AND CLASSIFICATION
                                OF ASUU WHATSAPP GROUP POST
                                            USING DATA MINING

                             Abubakar Ahmad1, Mukhtar Abubakar,2 Olowojebutu Akinyemi O.3
                                   1, 2
                                        Department of Computer Science and Information Technology,
                                                                                 Federal University
                                                                           Dutsin-Ma, Katsina State.
                                               3
                                                 Department of Computer Science Federal Polytechnic
                                                                               Ado-Ekiti, Ekiti State
                                             Corresponding Author’s: abubakarahmad82@gmail.com;
                                                                       aahmad1@fudutsinma.edu.ng
                                                                                 +2348160532490

Abstract
Present technology is creating rapid growth in online communication such that Social networking sites
such as WhatsApp, Twitter, Facebook, Instagram etc. are becoming even greater source of
communication for internet users. Huge amount of data is generated in volume, velocity and variety, such
data can be used as a source for various analysis and for understanding the opinions, views or emotions
of people. In this paper, we adopted the Text Classification Technique as a Supervised Machine Learning
Method and used python code in Jupyter notebook for the purpose of analysis and visualization of the
different views lecturers have expressed on a WhatsApp Group. The aim is to determine whether views,
opinions or emotions expressed are related (relevant) to the group or otherwise; we classified the views
into relevant, irrelevant, compliments and others. The result shows that of the total messages of over
sixteen thousand, only 8.7% were found to be relevant messages, which is very insignificant compared to
a significant percentage of messages found to be irrelevant constituting 43.3% of the total messages
posted over a period of fourteen months. The dataset was collected from WhatsApp Group Chat FUDMA
ASUU MATTERS (FAM), a chat group of lecturers from Academic Staff Union of University (ASUU),
Federal University Dutsin-ma, Katsina state Nigeria. The research recommends that only relevant
information that conforms to a group objective could be exchanged and discourages unnecessary
complimentary and personal conversation on a Social group.

Keywords: Data Mining, Sentiment Analysis, WhatsApp, Word Cloud, Python libraries

INTRODUCTION
The advent of internet and its associated technologies have continued to disrupt the way information is
being exchanged. Social Media sites such as WhatsApp, Twitter, Facebook, and Blogs among others are

                                                  17
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES      VOL. 1 NO. 2, JANUARY 2021    ISSN 2756-6625

becoming important platform where users can share valuable opinions on certain topics; there comes a
need to analyze such views and sentiments for desired purposes such as determining whether information
shared are relevant or not, or to make further prediction about the possibility of event occurrence. In most
instances, members of such group exchange messages that appears odd to the group; such as views,
opinions having nothing to do with the purpose of creating the group in the first place. For instance,
messages such as jokes, spam, forwarded messages, greeting messages, blank emoticons, devotional
messages, social event wishes, personal comments or any type of irrelevant posts dominate most social
network group. These compel groups to set out rules and regulations to guide exchange of information
with limited or no compliance by recalcitrant members. The process involved in identifying and
extraction of sentiment or opinion from within text is called opinion mining. Opinion mining is a type of
natural language processing to track opinions of people about a particular event or subject (Harshal et al.,
2018).

Opinion mining is the process of computationally identifying and categorizing opinions expressed in a
piece of text, especially in order to determine whether the writer’s attitude towards a particular topic,
product, etc., is positive, negative, or neutral. (Oxford, 2019). The purpose of opinion mining is to
determine the view point of individual or people towards any topic or an event by analyzing their views
on social networking sites. This is achieved through knowledge discovery which is synonymous to data or
text mining in databases. It is a way of discovering hidden pattern or beneficial knowledge from data. The
mining of huge dataset also called Big data is a trending area of research involving methods at the
intersection of machine learning, statistics and mathematics. Companies, organizations and individuals
leverage mining techniques to look for patterns in large batches of data to improve their business,
marketing strategies, learn about customers, increase sales and decrease cost among other benefits.
(Court, 2015)

Dataset are subjected to classification process which is a way of sorting and categorizing the data into
various types, forms or any other distinct class. This classification process provides separation and
categorization of data according to data set requirements for various purposes. Thus target class can be
treated for each and every case in the data. Algorithms driving this data management process are termed
as ‘classifiers’. In machine learning and statistics, classification is a supervised learning approach in
which a program learns from the data given to it and then uses that experience to classify new
observations.

The goal of this paper is to determine whether the views expressed by members are relevant to the main
objective of setting out the group. To achieve this, we analyzed the dataset obtained from FAM
WhatsApp group and extracted the thoughts and opinions of lecturers expressed on the group through
their posts and classified them into different categories. First we determined which post was relevant to
the chat group; any post that was directly or indirectly associated or has any connection with ASUU was
termed relevant. Second, we classified any post that was unconnected with ASUU as irrelevant. The third
categories are classified as compliment, these are posts that shows pleasantries such as congratulatory
messages, thank you messages, commendations etc. The fourth and last category was others, these are
media messages, pictures, deleted messages, empty messages, website links, unrecognized text and all
other messages not present in the three categories mentioned above. The results were obtained and
visualized using python code in Jupyter notebook.

The remaining parts of this paper were organized as follows; in section 2, we reviewed literatures related
to this paper. Section 3 describes methodology for WhatsAapp sentiment analysis utilized in this research.
In section 4, we demonstrate the experimental results conducted on FAM WhatsApp datasets. Finally,
section 5 concludes and suggests some recommendations.

                                                    18
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING

RELATED WORK
So many works have been done to date on sentiment analysis particularly on classifying twitter messages
into positive, negative or neutral and for prediction purposes. Sentiment analysis has been handled as a
Natural Language Processing task at many levels of granularity, word level, sentence level or document
level. (Liu, 2010.). Sentiment Classification techniques can broadly be grouped into Machine Learning
(ML) approach, lexicon-based approach and hybrid approach (Diana & Adam, 2011).

The ML text classification approach is further split into supervised and unsupervised learning methods.
While the supervised methods exploit large number of labeled training documents utilizing classifiers
namely; Decision Tree, Linear classifiers, Rule-based classifiers and Probabilistic classifiers. The
unsupervised method is used when it is difficult to find labeled training documents (Li & Tsai, 2011).

The Lexicon-based approach relies on a sentiment lexicon, a collection of known and precompiled
sentiment terms. It is broken into dictionary-based approach and corpus-based approach. While the
former approach relies on finding opinion seed words, and then searches the dictionary of their synonyms
and antonyms, the later approach uses statistical or semantic methods to find sentiment polarity. The
hybrid Approach however, combines both Machine Learning and Lexicon Based approaches to get better
classification results (Mudinas etal., 2012).

Other concept-level sentiment analysis systems developed such as pSenti is integrated into opinion
mining lexicon-based and machine learning-based approaches. The system achieved higher accuracy in
sentiment polarity classification as well as sentiment strength detection compared with pure lexicon-based
systems using two real-world data sets (CNET software reviews and IMDB movie reviews). It
outperformed the proposed hybrid approach over state-of-the-art systems like Senti Strength (Mudinas
etal., 2012).

In an effort to bridge the cognitive and affective gap between word-level natural language data and the
concept-level sentiments, Cambria et al. (2012) introduced SenticNet 2; a publicly available semantic and
affective resource for opinion mining and sentiment analysis. This system exploits both Artificial
Intelligence and Semantic Web technologies which enabled the system to be embedded in real-world
applications in order to effectively combine and compare structured and unstructured information

Kontopoulos etal. (2013) proposed the use of ontology-based techniques toward a more efficient
sentiment analysis of twitter posts. They worked on the domain of smart phones. Their architecture gives
more detailed analysis of post opinions regarding a specific topic. Other techniques not categorized as
either ML approach or Lexicon-based approach includes; Formal Concept Analysis (FCA) which is a
mathematical approach used for structuring, analyzing and visualizing data and Fuzzy Formal Concept
Analysis (FFCA) developed in order to deal with uncertainty and unclear information (Walaa etal., 2014).
In other researches, multiple techniques were combined to improve performance. Thakare and Sachin
(2016) combined both Lexicon based and Machine Learning approach and achieved an increased
performance in precession and recall for twitter data. Generally, bag-of-words has been used for mining
sentiments online. In this approach, individual words are considered instead of complete sentences.
Therefore, Traditional machine learning algorithms such as Support Vector Machines, Naive Bayes’ and
Maximum entropy etc. are commonly used to solve such classification problems with features such as
unigrams, n-grams, Part-Of-Speech (POS) tags (Paridhi etal., 2018).

Other researchers proposed enhancements of some approaches such as the ensemble model aimed to
improve performance metrics of the existing algorithms like Naïve Bayes, SVM and Linear Regression
model (Mohanavalli etal., 2018). Bhattacharjee etal. (2019) worked on boosting TF-IDF for
effectiveness. Recently, Darwich etal. (2019) presented a comprehensive review on the notable research
works that focus on the corpus-based approach for sentiment lexicon generation. The authors arrived at

                                                   19
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES            VOL. 1 NO. 2, JANUARY 2021   ISSN 2756-6625

eight reasons in favor of corpus-based approach over a dictionary-based approach and concluded on the
note that corpus-based techniques are considered as a vital part of any modern sentiment analysis system.

In this research, the authors adopted the Supervised Machine Learning method of text classification in a
unique way. Most of the researches focused on binary classification of sentiments i.e. weather sentiments
are positives or negatives, or even neutral in some cases. However, a novel dataset of human-annotate
sentiment was formed by manual labeling of all posts on the dataset into different groupings namely,
relevant, irrelevant, compliments and others. This is a different approach from existing researches.

 METHODOLOGY
In this paper, we exploit text classification method. This technique works on the existence of labeled
training documents. Therefore, labeling was done manually by the authors; it was labor intensive and time
consuming process. However, manual labeling especially when done by expert vouches for more
credibility, in contrast to others done by classifiers such as SentiWordNet (Baccianella etal., 2010) that
are compiled by running various existing classifiers on a previously unlabeled document. There are many
kinds of supervised classifiers in literatures. Some of the most frequently used classifiers in Sentiment
Analysis are; Decision Tree, Linear, Rule-based and Probabilistic classifiers. Since the focus in this paper
is not prediction, classifiers are not required.

The text classification steps in this research involves; collection of data from the FAM WhatsApp,
transformation of the data by manual labeling and then exploration, analysis and eventual visualization of
the data using python in Jupyter Notebook. The processes are depicted in figure 1 below. The different
phases in text classification are explained in the following subsection.

Figure 1: Text Classification Technique showing different Processes

                                                          20
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING

Data Collection Process.
The first step was to gather the data. WhatsApp provides a feature for exporting chats through a .txt
format; this was done by visiting the chat group (FAM). To export the WhatsApp data file, the procedures
involved that on the FAM group page, click on the settings, select export data and then select without
media. This simply means that media file is excluded to narrow the scope of the work. Additionally,
exporting along with the media files, will lead to use of larger volume of data and waste of time for data
collection. The text file is then transferred into excel format for further cleansing. The following depicts
how the file looks like before conversion to excel format;

Figure 2: FAM WhatsApp Group Chats exported to Notepad

Data Preparation and Transformation
In this step, transformation and tokenization are done. To achieve that, the plain text file was exported to
Excel sheet, parsed and tokenized in a meaningful manner in a tabular structure with date, time, phone
number (author) and message representing the table head. A Remarks column was added to the table, the
remarks column is made up of four classifying attributes (Relevant, Irrelevant, Compliments and Others).
Each and every sentence expresses a different kind of emotion. Here each post is taken and manually
clustered into the different class attributes. Relevant classifies any post associated or has connection with
ASUU, any post unconnected or has nothing to do with ASUU is termed Irrelevant. Compliment refers to
all forms of congratulatory messages, commendations and thank-you-messages. Finally, all messages not
included in the afore mentioned (relevant, irrelevant and compliment) are classified as Others example of
such messages are: blank post, video post, voice message, deleted post, pictures, symbol, website link
(http), unrecognized text etc. the following depicts how the table is structured in excel format.

                                                     21
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES        VOL. 1 NO. 2, JANUARY 2021     ISSN 2756-6625

Figure 3: FAM WhatsApp Group Chats in Excel with additional “Remarks” Column indicating Labeled Posts

The Excel sheet splits the line into the date, time, phone, remarks and message tokens. It also creates a
data frame with the above five tokens as five separate columns. The data available to this research
covered from 17th of August 2018 when the group was created to 21st October 2019 when the group
migrated to Telegram from Whatapp. A total of sixteen thousand, one hundred and fifty (16,150)
messages were generated and transform to form the dataset for this research.

Data Exploration Process
In this phase, the dataset was again converted to .csv format and uploaded to Jupyter Notebook for
exploration, analysis and presentation. Details for the visualization were presented in ‘Result and
Discussion’ section of the paper. The implementation tools were explained in the following sub section.

Data Mining Platform and Tools
The adoption of python programming language in this research is not only about its popularity and
simplicity but its amazing large collection of libraries that serve various purposes such as data input, data
transformation, data exploration and data visualization. Python packages are directory of python scripts.
Each script is a module which can be a function, methods or new python type created for particular
functionality. Numpy is one such important package imported to handle the multidimensional arrays and
functions that were needed for the classification of chats into days, hours, minutes and seconds. Pandas
were used for data extraction, manipulation, cleaning, and analysis. It was well suited for the kind of data
used in this research. Another package utilized in this research is Python Matplotlib, matplotlib.pyplot is a
plotting library used for 2D graphics. It was utilized for the various graphs plotted in this work.
Furthermore, Seabon, which is a dominant data visualization library, was also imported in this work.
Being a higher-level library, it was able to expand the plot and better beautify it. It doesn’t work alone
hence it works on Matplotlib foundation.

RESULTS AND DISCUSSION
In analyzing the dataset, topics, opinions or views captured within the group’s many posts were inspected.
Based on the findings, there was no formal taxonomy of topics within WhatsApp and, thus, it is necessary
to manually inspect and classify the chats under study. Hence the manual annotation of the sixteen
thousand one hundred and fifty (16,150) posts collected into a set of categories/attributes. Result from the
analysis shows that of the total messages, only 1,397 messages posted were found to be relevant
constituting only 8.7%. A huge number of 6,988 posts were found to be irrelevant! Specifically 43.3% of
the total messages posted. Accordingly, 2,539 messages were found to be complimentary messages

                                                      22
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING

consisting of 15.5%. However, a total sum of 5,226 messages was classified as others which constitute
34.2%. Figure 3 and 4 summarized the findings as follows;

Figure 4: Bar Chart showing Classification and Distribution of Messages

Figure 3 is a Bar chart showing most of the messages posted in ASUU WhatsApp group by members
were Irrelevant and unrelated to the group.

Figure 5: Pie Chart showing percentage Classification and Distribution of Messages

Figure 4 is a Pie chart showing the least percentage of the messages posted in ASUU WhatsApp group by
members as Relevant and related to the group. To corroborate the above findings, we used a Word Cloud;
it is a graph of words which shows the most used words by representing the most used words bigger than
the rest. There are five million two hundred and sixty eight thousands four hundred and eleven
(5,268,411) words in all the over sixteen thousand messages posted over a period of fourteen months.
These words are the building blocks of sentences contained in the post. Looking at figure 5, the most used
words are; ‘Nigeria’, ‘Will’, ‘One’, etc. These words could be said to have less relevance when compared
to more relevant words like ‘Education, ‘University’, ‘Academics’, ‘Students’ etc.

                                                          23
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES        VOL. 1 NO. 2, JANUARY 2021   ISSN 2756-6625

Figure 6: A Word Cloud showing the most used words in FAM Chat group

Let’s make a comparison with the word cloud of one of the authors who sent a total of one hundred and
thirty three (133) messages over the period under review. The most used words in the author’s word cloud
are clear indication to how relevant or otherwise the author’s message represent. ‘University’, ‘Student’,
‘Education’, ’ASUU’, ‘Lecturer’ are the most used words indicating the author post only relevant
information to the group.

Figure 7: A Word Cloud of most used words by an author in FAM Chat group

Summary of Findings
In this study, an attempt was made to determine whether views, opinions or emotions expressed or shared
by participating members in FAM are relevant (related) to the group or otherwise. The result shows that
of the total messages of over sixteen thousands, only 8.7% were found to be relevant messages, which is
very insignificant compared to a significant percentage of messages found to be irrelevant consisting of
43.3% of the total messages posted. Accordingly, 15.5% of the total messages were classified as
complimentary messages testifying that members do more of pleasantries than posting relevant
information to the group. However, a sizable number of the total messages constituting of 34.2%, were
grouped as others - an attribute grouping non text messages such as voice, video, link, empty messages
among others purposely for use in this research.

Future Work
In the future, there is need to extend the scope of this research to ascertain the level of involvement and
participation by members in the chat group. Find out interesting insights like the most used emoji, the

                                                      24
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING

sentiment score of each person, the most active time of the day, most active and inactive users. These
would be interesting insights. The scope of the dataset could be expanded to include multi-media
messages such as voice, video, link, empty messages among others for a more detailed analysis.

CONCLUSION
In conclusion, it can be stated that the capabilities of the WhatsApp application combined with the power
of python in Jupyter notebook is a powerful tool for analyzing any form of dataset. This work discussed
WhatsApp application, its capabilities and how data from the application could be exported, prepared,
transformed into a labeled dataset and later visualized. The analysis was done with Jupyter notebook, and
the Python libraries that were imported include, Numpy, Pandas, Matplotlib and Seaborn. Text
classification Technique was used. At the end of the work the results obtained shows that out of the over
sixteen thousand text messages posted, only eight percent of the messages were found to be relevant to
the given WhatsApp group.

RECOMMENDATIONS
Judging from the outcome of this research, the authors have the following as recommendations for both
members of the FAM WhatsApp group in particular and to other social groups as a whole. In all
circumstances, members are encouraged to:
     Keep to the purpose of the group. Send and or share only relevant messages.
     Extends pleasantries such as greetings, wishes, thanks, praises or acknowledgement via individual
       WhatsApp account instead of the Group.
     Avoid sending one-on-one (personal) conversation in the Group. Switch to individual member
       account for such purposes.

REFERENCES
Bhattacharjee, U., Srijith, P. K. & Maunendra, D. (2019). Term Specific TF-IDF Boosting for Detection
        of Rumours in Social Networks. In D. o. Engineering (Ed.), In Proceedings of the Sixth Social
        Networking Workshop, SN@COMSNETS 2019, Bengaluru,, 116 , pp. 10–19. IIT Hydrabad
        India.
Cambria, E., Havasi, C., & Hussain, A. (2012). SenticNet 2: a semantic and affective resource for opinion
        mining and sentiment analysis. In: Proceedings of the twenty-fifth international florida artificial
        intelligence research society conference (pp. 202 - 208). florida: SCRIBD.
Court, D. (2015). Marketing & Sales Big Data, Analytics, and the Future of Marketing & Sales.
        McKinsey & Company.
Darwich, M., Mohd, N., Shahrul, A., Omar, N., & Osman, N. (2019). Corpus-Based Techniques for
        Sentiment Lexicon Generation: A Review. Journal of Digital Information Management., 17, 296.
        10.6025/jdim/2019/17/5/296-305.
Diana, M. & Adam, F. (2011). Automatic detection of political opinions in tweets. Proceedings of the 8th
        international conference on the semantic web,European Semantic Web Conference (ESWC’11)
        (pp. 88–99). Springer.
Harshal, K., Kalyani, G. & Tanmay, S. (2018, March 03). A review on: Sentiment polarity analysis on
        Twitter data from different Events. International Research Journal of Engineering and
        Technology (IRJET), 05 (03 | Mar-2018), Page 1479.
Kontopoulos, E., Berberidis, C., Dergiades, T. & Bassiliades, N. (2013). Ontology-based sentiment
        analysis of twitter posts. Expert Systems with Applications, 40(10), 4065-4074.
Li, S. & Tsai, F. (2011). Noise control in document classification based on fuzzy formal concept analysis.
        In: Presented at the IEEE. International Conference on Fuzzy Systems (FUZZ). IEEE.
Liu, B. (2010.). Sentiment Analysis and Subjectivity. In Handbook of Natural Language Processing,
        Second Edition. Taylor and Francis Group, Boc.

                                                    25
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES    VOL. 1 NO. 2, JANUARY 2021   ISSN 2756-6625

Mohanavalli, S., Karthika, S., Srividya, K.R., & Uthayan, N. S. (2018). Categorisation of Tweets Using
        Ensemble Classification Methods. nternational Journal of Engineering & Technology, 7 (3.12),
        722-725.
Mudinas, A., Zhang, D. & Levene, M. (2012). Combining lexicon and learning based approaches for
        concept-level sentiment analysis Presented at the. WISDOM’12. Beijing, China.
Neumann, G. (2006). A Hybrid Machine Learning Approach for Information Extraction from Free Text.
        From Data and Information Analysis to Knowledge Engineering (pp. 390 - 397). Springer, Berlin,
        Heidelberg.
Oxford. (2019). Oxford online dictionary. accessed, 12:53PM,16th, October 2019:
        https://www.lexico.com/en/definition/sentiment_analysis.
Paridhi, P. N., Dinesh D. P. & Yogesh, S. P. (2018). Sentiment Classification of Twitter Data: A Review.
        International Research Journal of Engineering and Technology (IRJET). 05, pp. 929 - 931. p-
        ISSN: 2395-0072: ISO 9001:2008 Certified Journal.
Walaa, M., Ahmed, H. & Hoda, K. (2014). Sentiment analysis algorithms and applications: A survey. Ain
        Shams Engineering Journal, 5, 1093 - 1113.

                                                   26
You can also read