Team Sigmoid at CheckThat!2021 Task 3a: Multiclass fake news detection with Machine Learning - CEUR ...

Page created by Irene Franklin
 
CONTINUE READING
Team Sigmoid at CheckThat!2021 Task 3a: Multiclass fake news
detection with Machine Learning
Abdullah Al Mamun Sardar1, Shahalu Akter Salma1, Md. Sanzidul Islam1, Md. Arid Hasan1
and Touhid Bhuiyan1
1
    Daffodil International University, Dhaka, Bangladesh

                 Abstract
                 Fake news is affecting our lives since the internet has become popular. Particularly, in this era
                 of social media it is very easy to spread and be affected by fake news. In this work we have
                 developed machine learning models which can classify a news claim into four classes. This
                 work has been done under the competition of CheckThat!2021 task-3a. We have conducted
                 our experiment on Check that lab’s dataset. Our work has been done only on linguistic features.
                 We have experimented both with traditional Machine Learning algorithms and Deep Learning
                 algorithms. LSTM outperformed other traditional machine learning algorithms and with Adam
                 optimizer LSTM gave a f1-macro score of 26.07%.

                 Keywords 1
                 Fake News Detection, Support Vector Machine, Multinomial Naïve Bayes, LSTM.

1. Introduction

   The internet has become one of the biggest part of our life. We are using it every day in our day to
day life. The usage of the internet and number of users are being increased every day. According to
datareportal.com [16] the number of internet user by April, 2021 is 4.72 billion (which is 60% of the
world’s population). Every day we read news articles, blogs and news content in various forms (e.g.,
images and videos). So it is easy to get distracted and deceived by any sort false representation of an
event or event which does not exist but depicted with a verified style. Fake news is affecting our life
and bringing damage to so many people. In 2016 US presidential Election fake news played a vital role.
Alexandre Bovet et. al. 2019 [22] showed 25% of the news in the time of US presidential election was
biased. In [15], data shows that the number of blog posts appear only in WordPress is 70 million each
month. These findings only show that the number of potential fake news are not so little. And we can
easily be misled by those false news. During the COVID-19 pandemic we have seen so many fake news
spread in different communities all around the world. In India several fake news spread during this
pandemic which created confusion about COVID-19 in the community. In [17] summarized the
COVID-19 related fake news in India. Another fake news spread in India which claim the people who
are taking COVID-19 vaccine may die within two years [18]. In Bangladesh several fake news created
a huge confusion among students during this pandemic. Several fake news claimed the higher secondary
examination will be taken place soon in May-June, 2020. In 2019, A mother was killed in Bangladesh
as a result of a fake news which claims children’s head are being used in the construction of Padma
Bridge but later investigation did not find any clue of this claim and they also found that the mother
was innocent [20]. In October, 2020 a fake news claimed that France footballer Paul Pogba left France
National Football team [19]. Several studies have been done to classify and categorize fake news. B
Bharali et. al. 2017 [21] addressed six categories of fake news: “1. Disinformation, 2. Propaganda 3.

1
 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
EMAIL: mamun35-1930@diu.edu.bd(A. 1); shahalu35-2315@diu.edu.bd (A. 2); sanzid.cse@diu.edu.bd (A. 3); arid.cse0325.c@diu.edu.bd
(A. 4); headcse@diu.edu.bd (A. 5)
ORCID: (A. 2); 0000-0002-9563-2419 (A. 3); 0000-0001-7916-614X (A. 4)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
Hoaxes 4. Satire/Parody 5. Inaccurate 6. Partisanship”.
   To detect fake news automatically we need to use machine learning techniques because the amount
of articles appearing on the internet every single day is impossible to classify with traditional
approaches. The sub-domain of machine learning which deals with text classification is called Natural
Language Processing. Using NLP techniques, we can detect whether a given news article is fake or real
or some further classification. Researchers have done a lot of work in this field. Some have worked
only with text and some others have worked with additional entities like news sources and user opinions.
After analyzing those previous work, we have come up with two research questions to conduct our
study.
RQ1: Can we identify the impact factor in linguistic analysis based fake news detection?
RQ2: Between traditional machine learning and deep learning which one gives better performance?
The rest of the paper is structured as follows: In the literature review section we have discussed how
others have conducted their research and which approach they have found as best performer. Next, the
research methodology section will describe step by step how we have conducted our experiment. And
then in the result analysis section we will compare the result of different methodologies we have used.
The rest three sections are conclusion, acknowledgement and reference.

2. Literature Review
   Fake news detection got wide attention in the machine learning research community after the US
election in 2016. T S Reshmi et. al. 2021 [4] has done a work on fake news detection using source
information. They addressed some common features (Lexical, Syntactic, Visual, Statistical, Users, Post,
Network) which are being used in content based classification. Rohit Kumar et. al. 2021 [1] worked on
fake news detection by using deep neural network. They have experimented on real world dataset
(Buzzfeed and Politifact). They have divided their classification process into three parts. One for text
based classification, the other two are social context based and combination of these two. They have
found their best performing model with deep neural network. Anshika Choudhary et. al. 2020 [2]
worked on linguistic features to detect whether a news is fake or real. They have considered four
linguistic features in their study as follows: syntax-based, sentiment-based, grammatical and
readability-based evidence. With traditional machine learning based ensemble methods they got an
accuracy of 72% but sequential neural network based model outperforms the previous one and got an
accuracy of 86%. Thomas Felber et. al. 2021 [3] worked on a content based experiment. They have
considered unique word count, average word per post and average character per post. They have
experimented with different machine learning algorithms and with support vector machine (SVM) they
got the highest accuracy which is 95.70%. Elena Shushkevich et. al. 2021 [5] worked on an ensemble
method which performed better than a single algorithm based model. Mohammad Hadi Goldani et. al.
2020 [6] worked on fake news detection with a capsule neural network. Their dataset contains two types
of news. One which texts are small in length and the other which texts are medium or large in length.
They have implemented three types of word embedding techniques (Static, Non-static and multichannel
word embedding). Gautam Kishore Shahi et. al. 2020 [7] build a dataset for fake news detection. The
specialty of their dataset is that it contains more than one language (English, Spanish, French,
Portuguese, Hindi, Turkish, Italian, Chinese, Croatian, Telugu). Typical fake news datasets are most of
the time built on only one language. They have done a benchmark study on their dataset and with the
BERT based classification model they got a F1-score of 76%. The table given below is a summary of
some published work on fake news detection tasks.

Table 1
Summary of some published work on fake news detection
   Ref      Year       Contributions                           Dataset               Models
    [8]       2019        Description of different types   Kaggle    Fake  BM25,               Vector
                       of dataset for fake news news dataset, Fake Space                       Model,
                       detection.   Comparison        on news   challenge,
different      dataset       with LIAR, Univ. of             Language      model,
                        experimental result. Different Washington                    LSTM
                        ways of conducting experiments. dataset on fake
                                                           news
    [9]        2018         A new way of doing fake           Their       own           MMFD(multi
                        news        detection       tasks. collected dataset         source multi class
                        Considering so many things other                             fake news detection)
                        than just news content.
    [10]       2020         Experimented on existing          LIAR, FEVER,             SVM,     NBC,
                        dataset and has done a review on FAKENEWSNET                 LSTM, VSM, RTE,
                        those experiments, so that new                               CNN, RST
                        researchers can find some insight.
    [11]       2019         Has done a comparative study      LIAR                     SVM, LR, DT,
                        on different available dataset.                              Naïve Bayes, k-NN,
                        Has shown some impact factors.                               C-LSTM, CNN, Bi-
                                                                                     LSTM,         HAN,
                                                                                     Convolutional HAN
    [12]       2018        Developed an ensemble          LIAR                         Ensemble Model
                        model combining CNN and                                      (CNN and LSTM)
                        LSTM classifier to do multiclass
                        classification
    [13]       2020        Developed an annotation        Their        own      Annotation
                        framework to solve the problem created dataset       model
                        of data collection.
    [14]       2021        Done exploratory analysis on   Their        self-    Data     analysis
                        COVID-19 tweets whether a collected dataset          techniques
                        tweet is fake or real considering
                        various perspective

3. Research Methodology
There are various types of research methodologies in Natural Language Processing for fake news
detection. To detect fake news, researchers usually follow some particular methodologies. Some use
only a content and context based approach, some have taken into account the user opinion on social
media on the same news from various users and some other researchers considered the source of the
news [2] [1] [4]. However, in this research we have only worked with a linguistic based approach as
our dataset only contains news titles and main news content. We have conducted our research in the
following steps.

    3.1. Data Analysis
    As this work has been done under CheckThat!2021, they have provided the dataset. The dataset
contains four columns: public_id, title, text and our rating. ‘our rating’ basically contains the classes we
will classify. A snippet of the dataset has been given below.
Figure 1: Dataset snippet for this research

Our rating contains four classes: False, Partially false, True and Other. A distribution of these classes
in the figure below:

Figure 2: Distribution of four classes

When it comes to news content based fake news detection, some researchers considered only main news
content, not the title and some researchers took account of both title and main news content. In our
research we have applied both approaches and shown the performance analysis in the result analysis
section.

    3.2. Text Preprocessing
    In the preprocessing step, first, we have removed number and punctuation from our dataset. Then
we removed one and two-character length words. We used python regular expressions for this task.
Machine Learning algorithms consider “Word” and “word” as separate entities and this is a problem in
experiment because they both have the same meaning. To avoid this problem, we have converted the
whole dataset into lowercase character. After doing these steps we have removed stop words. We have
used the nltk library to remove stopwords. After then we did word level tokenization on both title and
text using nltk.word_tokenize. For stemming we have used Portstemmer() and for lemmatizaiton we
used WordNetLemmatizer(). After completing the preprocessing steps we have started feature
extraction which we will discuss in the next section.

    3.3. Feature Extraction
    Feature extraction is mandatory for machine learning tasks. It helps to reduce the training time by
reducing dimensionality. In the feature extraction phase we have used two feature extraction techniques
which are common in natural language processing. We used TF-IDF and CountVectorizer in our work.
Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure commonly used in
natural language processing which evaluates how relevant a word is to a document in a collection of
documents. This is done by multiplying two metrics: how many times a word appears in a document,
and the inverse document frequency of the word across a set of documents [23]. CountVectorizer is
used to transform a given text into a vector on the basis of the frequency (count) of each word that
occurs in the entire text.

    3.4. Research Model
    In this research we have followed a research model which has been presented below. After getting
the dataset the first thing we did is preprocessing it and then we have extracted features using some
common feature extraction technique and then used some traditional machine learning algorithm and a
deep learning algorithm (LSTM) to classify the news into one of the four classes given in the ‘our
rating’.

     News Text          Preprocessing           Feature Extraction         ML/DL Algorithms

                                                                                     Result
Figure 3: Research Model throughout this experiment

4. Result Analysis

   In our work we have experimented with both traditional ML algorithms and deep learning
algorithms. We have conducted our experiment in two different ways. In the first approach, we have
conducted the experiment by concatenating both news title and main news content. And in the second
approach we have experimented only with news content without taking the news title into account. For
both approach we took 80% for training and 20% for testing. In the first approach, by using count
vectorizer and tf-idf transformer in the pipeline we got a f1-macro score of 35% with logistic regression.
And by limiting the maximum feature to 1000, we were able to increase the performance by 5%.
Support Vector Machine with linear carnal gave 38% f1-macro score. In the first approach, among all
traditional machine learning algorithms which we have experimented, Multinomial Naïve Bayes
classifier gave the best f1-macro score (43%) on the training dataset.
   In second approach, Random Forest Classifier and XGBClassifier performed better than the first
approach and the rest four algorithm did not do well than the first approach. A performance comparison
has been given below.

Table 2
Performance Comparison for approach – one and two:
                         Algorithm                F1-macro                  F1-macro
                                                   score                     score
                                                    (One)                     (Two)

                        Logistic Regression                 40%                38%
                      Multinomial Naïve Bayes               43%                42%
                      Support Vector Machine                38%                35%
                      Decision Tree Classifier              36%                31%
                      Random Forest Classifier              31%                36%
                           XGBClassifier                    34%                36%
We have got our best performing algorithm with deep learning. We applied our second approach with
LSTM and it performed way better than traditional ML algorithms, with softmax activation function
and adam optimizer we got a validation accuracy of 99%. But we have found our model was over fitted
after the CheckThat!2021 result publication and the performance was very poor (26.07% f1-macro
score). We are working on our LSTM based model to overcome this problem. For task-3a, the best
score was 83.76% and the least score was 13.47%.

5. Conclusion

     In this work we tried to build a machine learning model to classify fake news. We have
experimented with both machine learning and deep learning based models. Our Machine Learning
based model was outperformed by deep learning based model on training data. That’s why we submitted
it to the completion but for the overfitting problem it did not perform well on test data. We are working
on our best performing deep learning model to make it better and improve the performance.

6. Acknowledgements
   This is unbelievable support we’ve got from some of the faculty members and seniors of DIU NLP
and ML Research Lab to continue our whole research flow from the beginning. We acknowledge Dr.
Firoj Alam for guiding us and informing the secretes of research workshops. Also, we’re thankful to
Daffodil International University for the workplace support and the academic collaboration in some
cases. Dr. Touhid Bhuiyan and Dr. Sheak Rashed Haider Noori also supported us with guidance,
motivation, and advocating in institutional supports. Lastly, for sure we’re thankful to our God always
for every fruitful work with our given knowledge.

7. References

       [1] Kaliyar, Rohit Kumar, Anurag Goswami, and Pratik Narang. "DeepFakE: improving fake
           news detection using tensor decomposition-based deep neural network." The Journal of
           Supercomputing 77.2 (2021): 1015-1037.
       [2] Choudhary, Anshika, and Anuja Arora. "Linguistic feature based learning model for fake
           news detection and classification." Expert Systems with Applications 169 (2021): 114171.
       [3] Felber, Thomas. "Constraint 2021: Machine Learning Models for COVID-19 Fake News
           Detection Shared Task." arXiv preprint arXiv:2101.03717 (2021).
       [4] Reshmi, T. S., S. Daniel Madan Raja, and J. Priya. "Fake News Detection Using Source
           Information and Bayes Classifier." IOP Conference Series: Materials Science and
           Engineering. Vol. 1084. No. 1. IOP Publishing, 2021.
       [5] Shushkevich, Elena, and John Cardiff. "TUDublin team at Constraint@ AAAI2021--
           COVID19 Fake News Detection." arXiv preprint arXiv:2101.05701 (2021).
       [6] Goldani, Mohammad Hadi, Saeedeh Momtazi, and Reza Safabakhsh. "Detecting fake news
           with capsule neural networks." Applied Soft Computing 101 (2021): 106991.
       [7] Shahi, Gautam Kishore, and Durgesh Nandini. "FakeCovid--A Multilingual Cross-domain
           Fact Check News Dataset for COVID-19." arXiv preprint arXiv:2006.11343 (2020).
       [8] Ghosh, Souvick, and Chirag Shah.                   "Towards automatic        fake    news
           classification." Proceedings of the Association for Information Science and
           Technology 55.1 (2018): 805-807.
       [9] Karimi, Hamid, et al. "Multi-source multi-class fake news detection." Proceedings of the
           27th international conference on computational linguistics. 2018.
       [10]     Oshikawa, Ray, Jing Qian, and William Yang Wang. "A survey on natural language
           processing for fake news detection." arXiv preprint arXiv:1811.00770 (2018).
[11]     Khan, Junaed Younus, et al. "A benchmark study on machine learning methods for
    fake news detection." arXiv preprint arXiv:1905.04749 (2019).
[12]     Roy, Arjun, et al. "A deep ensemble framework for fake news detection and
    classification." arXiv preprint arXiv:1811.04670 (2018).
[13]     Shahi, Gautam Kishore. "AMUSED: An Annotation Framework of Multi-modal Social
    Media Data." arXiv preprint arXiv:2010.00502 (2020).
[14]     Shahi, Gautam Kishore, Anne Dirkson, and Tim A. Majchrzak. "An exploratory study
    of covid-19 misinformation on twitter." Online Social Networks and Media 22 (2021):
    100104.
[15]     https://hostingtribunal.com/blog/blog-posts-per-day/#gref
[16]     https://datareportal.com/global-digital-overview
[17]     https://www.dw.com/en/india-covid-misinformation/a-57414876
[18]     https://www.hindustantimes.com/india-news/will-you-die-within-2-yrs-after-taking-
    vaccine-centre-busts-fake-news-post-101621961181280.html
[19]     https://www.thedailystar.net/sports/football/news/fake-unacceptable-news-pogba-
    slams-rumors-him-quitting-france-1984513
[20]     https://bdnews24.com/bangladesh/2019/07/24/no-one-listened-to-what-she-said-
    witnesses-recount-lynching-of-a-mother-in-bangladesh
[21]     Bharali, Bharati, and Anupa Lahkar Goswami. "Fake news: Credibility, cultivation
    syndrome and the new age media." Media Watch 9.1 (2017): 118-130.
[22]     Bovet, Alexandre, and Hernán A. Makse. "Influence of fake news in Twitter during the
    2016 US presidential election." Nature communications 10.1 (2019): 1-14.
[23]     https://monkeylearn.com/blog/what-is-tf-idf/
You can also read