Comparison of Three Machine Learning Models for the Detection of Emails Spam

Page created by Ronnie Stanley
 
CONTINUE READING
Comparison of Three Machine Learning Models for the Detection of Emails Spam
Comparison of Three Machine Learning Models for
the Detection of Emails Spam
Shadi Ahmed Khattab
 Al-Balqa' Applied University
Raed Naser Alkaied
 Al-Balqa' Applied University
Mohammad Khaleel Sallam Ma'aitah (  mohammad.maaitah@neu.edu.tr )
 Near East University https://orcid.org/0000-0001-5770-8825

Research Article

Keywords: Machine learning, Naïve bayes, K-Nearest neighbor (KNN), Support vector machine

Posted Date: January 23rd, 2023

DOI: https://doi.org/10.21203/rs.3.rs-1445507/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

                                               Page 1/11
Comparison of Three Machine Learning Models for the Detection of Emails Spam
Abstract
Recently, machine learning has been applied into different major areas such as text classification,
machine translation, and spam detection. The great performance of machine learning algorithms into
several fields provided the humans with opportunities to tackle some of their hard jobs to be handled by
machine learning systems. These tasks seem effortless for machines, and need less time as the amount
of texts or spams need to be classified is huge. Hence, in his paper, we propose three different models for
the task of emails spam detection. The three models are trained and validated on a public spam dataset.
Experimentally, the models performed differently and it was seen that the Naïve Bayes outperformed the
other machine learning algorithms in terms of accuracy and other evaluation metrics.

1. Introduction
Emails exchange is a method of correspondence among the individuals, and billions of cell and laptop
phone users share various emails. In any case, such kind of correspondence is unsafe because of
absence of legitimate message sifting techniques. The spam is the major reason of this weakness, as it
threats the emails exchange between users [1]. Spams are unsolicited junk mails or messages, which are
undesirable for beneficiaries which are shipped off the users without their earlier consent.

The email users invest the vast majority of their significant energy and time in arranging these spam sent
emails [2]. Different duplicates of same message are sent ordinarily which not just influence an
association monetarily, yet in addition aggravates the accepting users. Spam messages are not just
encroaching the user's messages yet they are moreover delivering huge measure of undesirable
information and along these lines influencing the organization's ability and utilization [3].

Mostly, emails have a very similar structure which consists of body of the email and its corresponding
subject. A run of the mill spam mail can be ordered by filtering its substance. The cycle of spam mail
detection depends on the supposition that the substance of the spam mail is not the same as the real or
ham mail. For model words identified with the ad of any item, support of administrations, dating related
content and so on. The cycle of spam email detection can be extensively sorted into two methodologies:
machine learning and knowledge engineering approach [4]. The machine learning approach is to train
machine learning algorithms to classify emails into spam or ham, and it showed more effective results
than humans and knowledge engineering approaches [3, 4]. Moreover, machine learning showed
promising outcomes in different engineering areas such as images classification [5, 6], prediction [7], and
natural language processing [8].

In this study, we apply three different machine learning techniques for the task of classifying emails
messages into spam or ham. The three algorithms classify the emails based on their content and other
parameters. Naïve Bayes (NB), K-nearest neighbor (KNN), and support vector machine (SVM) are
employed to classify the emails into spam and ham. The three models are trained using the Spambase
Dataset [9] and their performance is compared and discussed.

                                                  Page 2/11
2. Related Works
Emails spam detection is an essential classification task in the cybersecurity field due to the importance
of preventing frauds. Several researches have been conducted for the detections of spams. Machine
learning have been extensively applied to solve such problem. Idris [10] proposed the detection of emails
spam using neural network and negative selection algorithm. In this work, the performance of neural
network for such task was compared with support vector machine and it was found the neural network
using backpropagation achieved a higher accuracy than that reached by the svm.

Another research for spam classification using neural networks was proposed by Edstrom [11]. In their
work, it was found that number of hidden layers doesn’t improve the accuracy of the network. A single
hidden layer neural network was seen to achieve the best performance of 94,6% accuracy.

Sharma and Bhardwaj [12] developed a hybrid bagged approach for the classification of emails spams.
Their system is a combination of Naïve Bayes and J48 algorithm trained to classify emails into spam or
non-spam. Their experiments showed that their proposed system achieved a classification accuracy of
87.5%, which is lower than the one our system has reached. Moreover, Pandey et al., [13] proposed a
system based on SVM and NB to classify emails. In their paper, SVM achieved an accuracy of 91% while
NB reached 92%.

3. Materials And Methods
This section discusses the dataset used for training and testing the three models in addition to the
research methods and materials of the paper.

Dataset: the SpamBase dataset [9] is considered to be used for training and testing the three different
employed models. This dataset consists of 4601 instances of both spam and non-spam emails. A
learning scheme of 50:50 is considered for training and testing such models in which 50% is used for
training and 50% for testing.

Table 1 shows a sample of some spam and non-spam instances of the dataset. Note that the two
classes are labeled as “1” for spam and “0” for non-spam.

                                                 Page 3/11
Table 1
                         a sample of spam and non-spam instances of the used dataset
    Sample       Class    Content
    number

    1            Spam     Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of
                          Australia? Text MQUIZ to 82277. B

    2            Ham      As I entered my cabin my PA said, '' Happy B'day Boss !!''. I felt special. She askd
                          me 4 lunch. After lunch she invited me to her apartment. We went there.

    3            Spam     Today’s Voda numbers ending with 7634 are selected to receive a å£350 reward.
                          If you have a match, please call 08712300220 quoting claim code 7684 standard
                          rates apply.

Naïve Bayes (NB): Naïve Bayes is a probabilistic machine learning algorithm for binary or multiclass
classification tasks. Such algorithm is based on Bayes’s theorem [14] and it works by assuming that the
occurrence of a certain feature is independent of the occurrence of other features. Baye’s theorem is used
to determine the probability of hypothesis with prior knowledge. Baye's theorem is utilized to determine
the likelihood of theory with earlier knowledge.

The working formula of Baye’s theorem is:

                                                         P (B|A) P (A)
                                          P (A|B) =
                                                              P (B)

1

Where P      ⃓
          (AB)      is probability of hypothesis A, give that B is true. P     ⃓
                                                                             (BA)   is the likelihood hypothesis B,
given that A is true. P(A) and P(B) are the probabilities of hypotheses A and B, independently.

K-Nearest Neighbor (KNN)

K-nearest neighbors is a basic and simple algorithm that stores every accessible instance and predict
classes of the new cases depending on a distance measure (e.g., Euclidean distance measures) [15]. For
such algorithm, a new case is classified by the majority voting of its neighbors. This case is then
assigned to its k-nearest neighbors by measuring its corresponding distances to all its neighbors.
Different distance measure can be used to compute distance, however, in this work, the Euclidean
distance is used and it is as follows

Support Vector Machine (SVM)

                                                      Page 4/11
SVM is a machine learning algorithm that can be for both classification and regression problems. This
algorithm works mainly by finding a hyperplane in N-dimensional space to classify data points in
different classes [16]. The idea is to find the best plane that has the maximum margin. In other words, the
plane in which the distance between classes and data points is the maximum. It is then important to
maximize this margin. SVM algorithm can maximize the margin between data points and hyperplane by
computing the minimizing cost function, i.e., Hinge loss. Hinge loss, Exponential loss, Logit loss and
many other types of loss can be used to train the SVM. However, in this work, the hinge loss is used and it
is defined as following

Where θ is the angle between the two vectors, x and y.

Metrics for evaluation the models performance

several metrics are used in this work to evaluate the performance the three employed models for
classifying email messages. These metrics include the accuracy, ROC, specificity, sensitivity.

                                                              TP + TN
                              Accuracy (Acc) =
                                                    TP + FN + TN + FP

4

                                                                 TP
                                    Sensitivity (Sn) =
                                                              TP + FN

5

                                                                TN
                                    Specificity (Sp) =
                                                              TN + FP

6

Where TP stands for true negative, and it indicates the number of correctly predicted positive classes. TN
stands for true negative, and it indicates the number of correctly predicted negative classes. FP is the
false positive, and it shows the incorrectly predicted positive data, while FN is the false-negative, and it
indicates the number of incorrectly predicted negative data. AUC is the area under the Receiver Operating
Characteristic curve (ROC), which is a graph shows the performance of the network at tresholds. ROC
plots the True positive rate versus the false positive rate

4. Results Analysis
In this section, the training and testing performance of the employed models are discussed.
                                                  Page 5/11
Preprocessing: In this work, a machine learning inspired approach is proposed for spam mail detection. In
spam mail identification framework, the first step is to collect the data which are unstructured in nature.
Thus, a preprocessing is required [1, 2]. Therefore, to decrease the calculations and to obtain good results,
email information should be pre-processed. Hence, data are first processed by deleting stop words.
Moreover, word tokenization is likewise performed in order to secure important data. Finally, data are fed
into the machine learning algorithms (NB, KNN, SVM) to be classified as spam or non-spam. Figure 1
shows the preprocessing stage of emails filtering.

Classification Results: In this work, the three employed models (BN, KNN, SVM) are trained and tested
using data from the SpamBase dataset. The learning scheme used to train and test the models is 50:50
i.e., 50% of images are used for training, while the remaining 50% are used as a held out test set for
evaluating the network’s performance. Note that the network is evaluated by calculating its training and
testing accuracy and loss using the formulas in Eqs. (3) and (4). Figure 1 shows the percentage of spam
and non-spam (Ham) content in the used dataset.

                                               Table 2
                        shows the evaluation metrics of the models in classifying
                                  emails into messages or non-spam.
                                            NB           KNN (K = 1)    SVM

                        Accuracy            97%          89%            94%

                        Sensitivity         93%          80%            83%

                        Specificity         87%          71%            78%

                        AUC                 93%          79%            81%

                        Processing time     0.494 (s)    0.630 (s)      0.86.032 (s)

Discussion
In this paper, Naive Bayes (NB), K nearest neighbors (KNN) and support vector machine (SVM) are the
classifiers used to detect emails spams. The models are validated with 50% of the data and the
experimental results are shown in Table 1. According to that able, the KNN achieved the highest accuracy
among them all. It achieved an accuracy of 97%, sensitivity of 93%, specificity of 87%, and area under
curve (AUC) 0f 93%. In terms of computation time, Table 1 shows that the NB required a shorter time to
achieve such accuracy than SVM and KNN.

On the hand, the other models achieved accuracies of 89% and 94% for KNN and SVM, respectively. The
AUC graph of the three models is shown in Fig. 3. Moreover, the receiving operating characteristic (ROC)
is shown in Fig. 4. This ROC curve represents the measure of severability of the two classes: spam and
non-spam. Hence, it is noted that the NB outperformed the other models in terms of AUC as it achieved a
better degree of separability of the two classes.

                                                  Page 6/11
Figure 3 shows the area under of curve of the Naïve Bayes (NB) which the model that outperformed all
other models.

Comparison with other related works: Researches have been extensively conducted to develop machine
learning based detection system for emails spam. Sharma and Bhardwaj [13] proposed a hybrid bagged
approach for the classification of emails spams. Their experimental results showed achieved a
classification accuracy of 87.5%, which is lower than the one our system has reached. Moreover, Pandey
et al., [14] applied SVM and NB to classify emails spam. In their paper, it was seen that SVM achieved an
accuracy of 91% while NB reached 92%. Table 3 shows the comparison of our work with the discussed
paper, and it is seen that our proposed NB outperformed the two other systems.

                                                Table 3
                                  Results comparison with earlier works
                                                                Accuracy

                                Hybrid bagged approach [13]     87.5%

                                SVM [14]                        91%

                                NB [14]                         92%

                                Our work                        97%

Conclusion
In this paper, a comparisons of three machine learning algorithms for the detection of emails spam is
presented. This study shows that different models can behave differently over the same spam dataset, in
which the performance can differ. The three models are trained and validated on a public dataset
(SpamBase dataset) and different evaluation metrics are used to evaluate the performance of the
models. The performance of the models was compared and it was noticed that the Naive Bayes model
outperformed all other models in terms of accuracy, sensitivity, specificity, and area under curve.
Moreover, the processing time of all algorithms was computed and the support vector machine (SVM)
required the longest time for processing.

Finally, the NB that achieved highest accuracy was compared to related works and researches. It was
seen that our NB model was capable of reaching higher accuracy than other related models such as
hybrid bagged approach, SVM, and NB.

Declarations
Ethical approval
No animal or humen subject were used in this research

                                                 Page 7/11
Conflict of interest
There is no Conflict of interest

Authorship contributions
All authors contributed equally to this work

References
  1. GuangJun L, Nazir S, Khan HU, Haq AU (2020) Spam Detection Approach for Secure Mobile Message
    Communication Using Machine Learning Algorithms. Security and Communication Networks, 2020
  2. Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Najada A, H (2015) Survey of review spam
     detection using machine learning techniques. J Big Data 2(1):23
  3. Guzella TS, Caminhas WM (2009) "A review of machine learning approaches to spam filtering".
     Expert Syst Appl 36(7):10206–10222
  4. Mohamad M, Selamat A (2015) "An evaluation on the efficiency of hybrid feature selection in spam
     email classification", In: Proc. of 2015 International Conference on Computer, Communications, and
    Control Technology (I4CT), Kuching, Sarawak, Malaysia, pp.227–231,
  5. Helwan A, El-Fakhri G, Sasani H, Uzun Ozsahin D (2018) Deep networks in identifying CT brain
     hemorrhage. J Intell Fuzzy Syst 35(2):2215–2228
  6. Abiyev RH, Ma’aitah MKS (2018) Deep convolutional neural networks for chest diseases detection.
     Journal of healthcare engineering, 2018
  7. Helwan A, Ozsahin DU, Abiyev R, Bush J (2017) One-year survival prediction of myocardial infarction.
     Int J Adv Comput Sci Appl 8(6):173–178
  8. Arik SO, Pfister T (2019) Tabnet: Attentive interpretable tabular learning. arXiv preprint
    arXiv:1908.07442
  9. https://archive.ics.uci.edu/ml/datasets/spambase
10. Idris I (2011) E-mail spam classification with artificial neural network and negative selection
    algorithm. Int J Comput Sci Communication Networks 1(3):227–231
11. Edstrom A (2016) Detecting Spam with Artificial Neural Networks
12. Rish I (2001), August An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on
    empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41–46)
13. Sun B, Du J, Gao T (2009), November Study on the improvement of K-nearest-neighbor algorithm. In
    2009 International Conference on Artificial Intelligence and Computational Intelligence (Vol. 4,
    pp. 390–393). IEEE
14. Zhang Y (2012), September Support vector machine classification algorithm and its application. In
    International Conference on Information Computing and Applications (pp. 179–186). Springer, Berlin,
                                                   Page 8/11
Heidelberg
15. Sharma P, Bhardwaj U (2018) Machine Learning based Spam E-Mail Detection. Int J Intell Eng Syst
    11(3):1–10
16. Pandey P, Agrawal C, Ansari TN (2018) A Hybrid Algorithm for Malicious Spam Detection in Email
    through Machine Learning. Int J Appl Eng Res 13(24):16971–16979

Figures

Figure 1

Emails data preprocessing

                                              Page 9/11
Figure 2

Ratio of spam and non-spam (Ham) content in the used dataset

Figure 3

AUC results of the three models

                                             Page 10/11
Figure 4

Receiver Operating Characteristic (ROC)

                                          Page 11/11
You can also read