Comparison of Three Machine Learning Models for the Detection of Emails Spam

Page created by Ronnie Stanley

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Comparison of Three Machine Learning Models for the Detection of Emails Spam

Comparison of Three Machine Learning Models for
the Detection of Emails Spam
Shadi Ahmed Khattab
 Al-Balqa' Applied University
Raed Naser Alkaied
 Al-Balqa' Applied University
Mohammad Khaleel Sallam Ma'aitah (  mohammad.maaitah@neu.edu.tr )
 Near East University https://orcid.org/0000-0001-5770-8825

Research Article

Keywords: Machine learning, Naïve bayes, K-Nearest neighbor (KNN), Support vector machine

Posted Date: January 23rd, 2023

DOI: https://doi.org/10.21203/rs.3.rs-1445507/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

                                               Page 1/11

Abstract
Recently, machine learning has been applied into different major areas such as text classification,
machine translation, and spam detection. The great performance of machine learning algorithms into
several fields provided the humans with opportunities to tackle some of their hard jobs to be handled by
machine learning systems. These tasks seem effortless for machines, and need less time as the amount
of texts or spams need to be classified is huge. Hence, in his paper, we propose three different models for
the task of emails spam detection. The three models are trained and validated on a public spam dataset.
Experimentally, the models performed differently and it was seen that the Naïve Bayes outperformed the
other machine learning algorithms in terms of accuracy and other evaluation metrics.

1. Introduction
Emails exchange is a method of correspondence among the individuals, and billions of cell and laptop
phone users share various emails. In any case, such kind of correspondence is unsafe because of
absence of legitimate message sifting techniques. The spam is the major reason of this weakness, as it
threats the emails exchange between users [1]. Spams are unsolicited junk mails or messages, which are
undesirable for beneficiaries which are shipped off the users without their earlier consent.

The email users invest the vast majority of their significant energy and time in arranging these spam sent
emails [2]. Different duplicates of same message are sent ordinarily which not just influence an
association monetarily, yet in addition aggravates the accepting users. Spam messages are not just
encroaching the user's messages yet they are moreover delivering huge measure of undesirable
information and along these lines influencing the organization's ability and utilization [3].

Mostly, emails have a very similar structure which consists of body of the email and its corresponding
subject. A run of the mill spam mail can be ordered by filtering its substance. The cycle of spam mail
detection depends on the supposition that the substance of the spam mail is not the same as the real or
ham mail. For model words identified with the ad of any item, support of administrations, dating related
content and so on. The cycle of spam email detection can be extensively sorted into two methodologies:
machine learning and knowledge engineering approach [4]. The machine learning approach is to train
machine learning algorithms to classify emails into spam or ham, and it showed more effective results
than humans and knowledge engineering approaches [3, 4]. Moreover, machine learning showed
promising outcomes in different engineering areas such as images classification [5, 6], prediction [7], and
natural language processing [8].

In this study, we apply three different machine learning techniques for the task of classifying emails
messages into spam or ham. The three algorithms classify the emails based on their content and other
parameters. Naïve Bayes (NB), K-nearest neighbor (KNN), and support vector machine (SVM) are
employed to classify the emails into spam and ham. The three models are trained using the Spambase
Dataset [9] and their performance is compared and discussed.

Page 2/11

2. Related Works
Emails spam detection is an essential classification task in the cybersecurity field due to the importance
of preventing frauds. Several researches have been conducted for the detections of spams. Machine
learning have been extensively applied to solve such problem. Idris [10] proposed the detection of emails
spam using neural network and negative selection algorithm. In this work, the performance of neural
network for such task was compared with support vector machine and it was found the neural network
using backpropagation achieved a higher accuracy than that reached by the svm.

Another research for spam classification using neural networks was proposed by Edstrom [11]. In their
work, it was found that number of hidden layers doesn’t improve the accuracy of the network. A single
hidden layer neural network was seen to achieve the best performance of 94,6% accuracy.

Sharma and Bhardwaj [12] developed a hybrid bagged approach for the classification of emails spams.
Their system is a combination of Naïve Bayes and J48 algorithm trained to classify emails into spam or
non-spam. Their experiments showed that their proposed system achieved a classification accuracy of
87.5%, which is lower than the one our system has reached. Moreover, Pandey et al., [13] proposed a
system based on SVM and NB to classify emails. In their paper, SVM achieved an accuracy of 91% while
NB reached 92%.

3. Materials And Methods
This section discusses the dataset used for training and testing the three models in addition to the
research methods and materials of the paper.

Dataset: the SpamBase dataset [9] is considered to be used for training and testing the three different
employed models. This dataset consists of 4601 instances of both spam and non-spam emails. A
learning scheme of 50:50 is considered for training and testing such models in which 50% is used for
training and 50% for testing.

Table 1 shows a sample of some spam and non-spam instances of the dataset. Note that the two
classes are labeled as “1” for spam and “0” for non-spam.

Page 3/11

Table 1
a sample of spam and non-spam instances of the used dataset
Sample Class Content
number

1 Spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of
Australia? Text MQUIZ to 82277. B

2 Ham As I entered my cabin my PA said, '' Happy B'day Boss !!''. I felt special. She askd
me 4 lunch. After lunch she invited me to her apartment. We went there.

3 Spam Today’s Voda numbers ending with 7634 are selected to receive a å£350 reward.
If you have a match, please call 08712300220 quoting claim code 7684 standard
rates apply.

Naïve Bayes (NB): Naïve Bayes is a probabilistic machine learning algorithm for binary or multiclass
classification tasks. Such algorithm is based on Bayes’s theorem [14] and it works by assuming that the
occurrence of a certain feature is independent of the occurrence of other features. Baye’s theorem is used
to determine the probability of hypothesis with prior knowledge. Baye's theorem is utilized to determine
the likelihood of theory with earlier knowledge.

The working formula of Baye’s theorem is:

P (B|A) P (A)
P (A|B) =
P (B)

Where P ⃓
(AB) is probability of hypothesis A, give that B is true. P ⃓
(BA) is the likelihood hypothesis B,
given that A is true. P(A) and P(B) are the probabilities of hypotheses A and B, independently.

K-Nearest Neighbor (KNN)

K-nearest neighbors is a basic and simple algorithm that stores every accessible instance and predict
classes of the new cases depending on a distance measure (e.g., Euclidean distance measures) [15]. For
such algorithm, a new case is classified by the majority voting of its neighbors. This case is then
assigned to its k-nearest neighbors by measuring its corresponding distances to all its neighbors.
Different distance measure can be used to compute distance, however, in this work, the Euclidean
distance is used and it is as follows

Support Vector Machine (SVM)

Page 4/11

SVM is a machine learning algorithm that can be for both classification and regression problems. This
algorithm works mainly by finding a hyperplane in N-dimensional space to classify data points in
different classes [16]. The idea is to find the best plane that has the maximum margin. In other words, the
plane in which the distance between classes and data points is the maximum. It is then important to
maximize this margin. SVM algorithm can maximize the margin between data points and hyperplane by
computing the minimizing cost function, i.e., Hinge loss. Hinge loss, Exponential loss, Logit loss and
many other types of loss can be used to train the SVM. However, in this work, the hinge loss is used and it
is defined as following

Where θ is the angle between the two vectors, x and y.

Metrics for evaluation the models performance

several metrics are used in this work to evaluate the performance the three employed models for
classifying email messages. These metrics include the accuracy, ROC, specificity, sensitivity.

TP + TN
Accuracy (Acc) =
TP + FN + TN + FP

TP
Sensitivity (Sn) =
TP + FN

TN
Specificity (Sp) =
TN + FP

Where TP stands for true negative, and it indicates the number of correctly predicted positive classes. TN
stands for true negative, and it indicates the number of correctly predicted negative classes. FP is the
false positive, and it shows the incorrectly predicted positive data, while FN is the false-negative, and it
indicates the number of incorrectly predicted negative data. AUC is the area under the Receiver Operating
Characteristic curve (ROC), which is a graph shows the performance of the network at tresholds. ROC
plots the True positive rate versus the false positive rate

4. Results Analysis
In this section, the training and testing performance of the employed models are discussed.
Page 5/11

Preprocessing: In this work, a machine learning inspired approach is proposed for spam mail detection. In
spam mail identification framework, the first step is to collect the data which are unstructured in nature.
Thus, a preprocessing is required [1, 2]. Therefore, to decrease the calculations and to obtain good results,
email information should be pre-processed. Hence, data are first processed by deleting stop words.
Moreover, word tokenization is likewise performed in order to secure important data. Finally, data are fed
into the machine learning algorithms (NB, KNN, SVM) to be classified as spam or non-spam. Figure 1
shows the preprocessing stage of emails filtering.

Classification Results: In this work, the three employed models (BN, KNN, SVM) are trained and tested
using data from the SpamBase dataset. The learning scheme used to train and test the models is 50:50
i.e., 50% of images are used for training, while the remaining 50% are used as a held out test set for
evaluating the network’s performance. Note that the network is evaluated by calculating its training and
testing accuracy and loss using the formulas in Eqs. (3) and (4). Figure 1 shows the percentage of spam
and non-spam (Ham) content in the used dataset.

Table 2
shows the evaluation metrics of the models in classifying
emails into messages or non-spam.
NB KNN (K = 1) SVM

Accuracy 97% 89% 94%

Sensitivity 93% 80% 83%

Specificity 87% 71% 78%

AUC 93% 79% 81%

Processing time 0.494 (s) 0.630 (s) 0.86.032 (s)

Discussion
In this paper, Naive Bayes (NB), K nearest neighbors (KNN) and support vector machine (SVM) are the
classifiers used to detect emails spams. The models are validated with 50% of the data and the
experimental results are shown in Table 1. According to that able, the KNN achieved the highest accuracy
among them all. It achieved an accuracy of 97%, sensitivity of 93%, specificity of 87%, and area under
curve (AUC) 0f 93%. In terms of computation time, Table 1 shows that the NB required a shorter time to
achieve such accuracy than SVM and KNN.

On the hand, the other models achieved accuracies of 89% and 94% for KNN and SVM, respectively. The
AUC graph of the three models is shown in Fig. 3. Moreover, the receiving operating characteristic (ROC)
is shown in Fig. 4. This ROC curve represents the measure of severability of the two classes: spam and
non-spam. Hence, it is noted that the NB outperformed the other models in terms of AUC as it achieved a
better degree of separability of the two classes.

Page 6/11

Figure 3 shows the area under of curve of the Naïve Bayes (NB) which the model that outperformed all
other models.

Comparison with other related works: Researches have been extensively conducted to develop machine
learning based detection system for emails spam. Sharma and Bhardwaj [13] proposed a hybrid bagged
approach for the classification of emails spams. Their experimental results showed achieved a
classification accuracy of 87.5%, which is lower than the one our system has reached. Moreover, Pandey
et al., [14] applied SVM and NB to classify emails spam. In their paper, it was seen that SVM achieved an
accuracy of 91% while NB reached 92%. Table 3 shows the comparison of our work with the discussed
paper, and it is seen that our proposed NB outperformed the two other systems.

Table 3
Results comparison with earlier works
Accuracy

Hybrid bagged approach [13] 87.5%

SVM [14] 91%

NB [14] 92%

Our work 97%

Conclusion
In this paper, a comparisons of three machine learning algorithms for the detection of emails spam is
presented. This study shows that different models can behave differently over the same spam dataset, in
which the performance can differ. The three models are trained and validated on a public dataset
(SpamBase dataset) and different evaluation metrics are used to evaluate the performance of the
models. The performance of the models was compared and it was noticed that the Naive Bayes model
outperformed all other models in terms of accuracy, sensitivity, specificity, and area under curve.
Moreover, the processing time of all algorithms was computed and the support vector machine (SVM)
required the longest time for processing.

Finally, the NB that achieved highest accuracy was compared to related works and researches. It was
seen that our NB model was capable of reaching higher accuracy than other related models such as
hybrid bagged approach, SVM, and NB.

Declarations
Ethical approval
No animal or humen subject were used in this research

Page 7/11

Conflict of interest
There is no Conflict of interest

Authorship contributions
All authors contributed equally to this work

References
  1. GuangJun L, Nazir S, Khan HU, Haq AU (2020) Spam Detection Approach for Secure Mobile Message
    Communication Using Machine Learning Algorithms. Security and Communication Networks, 2020
  2. Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Najada A, H (2015) Survey of review spam
     detection using machine learning techniques. J Big Data 2(1):23
  3. Guzella TS, Caminhas WM (2009) "A review of machine learning approaches to spam filtering".
     Expert Syst Appl 36(7):10206–10222
  4. Mohamad M, Selamat A (2015) "An evaluation on the efficiency of hybrid feature selection in spam
     email classification", In: Proc. of 2015 International Conference on Computer, Communications, and
    Control Technology (I4CT), Kuching, Sarawak, Malaysia, pp.227–231,
  5. Helwan A, El-Fakhri G, Sasani H, Uzun Ozsahin D (2018) Deep networks in identifying CT brain
     hemorrhage. J Intell Fuzzy Syst 35(2):2215–2228
  6. Abiyev RH, Ma’aitah MKS (2018) Deep convolutional neural networks for chest diseases detection.
     Journal of healthcare engineering, 2018
  7. Helwan A, Ozsahin DU, Abiyev R, Bush J (2017) One-year survival prediction of myocardial infarction.
     Int J Adv Comput Sci Appl 8(6):173–178
  8. Arik SO, Pfister T (2019) Tabnet: Attentive interpretable tabular learning. arXiv preprint
    arXiv:1908.07442
  9. https://archive.ics.uci.edu/ml/datasets/spambase
10. Idris I (2011) E-mail spam classification with artificial neural network and negative selection
    algorithm. Int J Comput Sci Communication Networks 1(3):227–231
11. Edstrom A (2016) Detecting Spam with Artificial Neural Networks
12. Rish I (2001), August An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on
    empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41–46)
13. Sun B, Du J, Gao T (2009), November Study on the improvement of K-nearest-neighbor algorithm. In
    2009 International Conference on Artificial Intelligence and Computational Intelligence (Vol. 4,
    pp. 390–393). IEEE
14. Zhang Y (2012), September Support vector machine classification algorithm and its application. In
    International Conference on Information Computing and Applications (pp. 179–186). Springer, Berlin,
                                                   Page 8/11

Heidelberg
15. Sharma P, Bhardwaj U (2018) Machine Learning based Spam E-Mail Detection. Int J Intell Eng Syst
    11(3):1–10
16. Pandey P, Agrawal C, Ansari TN (2018) A Hybrid Algorithm for Malicious Spam Detection in Email
    through Machine Learning. Int J Appl Eng Res 13(24):16971–16979

Figures

Figure 1

Emails data preprocessing

                                              Page 9/11

Figure 2

Ratio of spam and non-spam (Ham) content in the used dataset

Figure 3

AUC results of the three models

                                             Page 10/11

Figure 4

Receiver Operating Characteristic (ROC)

                                          Page 11/11

You can also read

Natural Language Processing with Deep Learning CS224N/Ling284 - Abigail See, Matthew Lamm

EUROXWAY SRL EUROPEAN DISTRIBUTION PROFESSIONAL SOLUTIONS FOR INDUSTRY - f-static.net

John Cronin Managing Director and Chairman - PostalVision 2020

State of Mobile Device Repair & Security - TREND REPORT: Q4 2017 - FreeBit.cz

CAN STATISTICAL MODELS BEAT BENCHMARK PREDICTIONS BASED ON RANKING IN TENNIS?

ROCKWOOD FREEDOM " PREMIER " HIGH WALL " HARD SIDE CAMPERS - Forest River, Inc.

THE NEW GMAIL INBOX TABS - SUCCESS GUIDE What It Is, What It Means for Marketers, and What to Do About It

2019 DIRTcar Nationals Frequently Asked Questions Ticket Hotline: (844) DIRTTIX DIRTcarNationals.com

Development of Dynamic Modulus Predictive Model Using Artificial Neural Network (ANN)

Fundraising Pack - 01794 324249 I I www.dentaid.org Registered Charity No. 1075826 116 Commercial Road, Totton, Hampshire SO40 3AD

Working together for a sustainable future - Allianz

Digital force and torque gauges Sales brochure - Aygenteks

Exosolar Planet Research in Italy 2011 - Raffaele Gratton INAF - Osservatorio Astronomico di Padova

The FUTURe BelONgs TO IO-lINk - FACTORY AUTOMATION

HES HANNOVER ECG SYSTEM - Corscience

Impact Factor: 7.569 Volume 11, Issue 1, January 2022 - Ijirset.com

SARS-COV-2 LAMP DIAGNOSTIC ASSAY - COLOR GENOMICS

FTSE Fixed Income Country Classification Announcement March 2022

RISE COHORT 15 2020 - 2021 Profiles

The FGVC Plant Pathology 2020 challenge dataset to classify foliar diseases of apples

Classification of new data in UK consumer price statistics

An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies - arXiv

Weakly Supervised Active Learning with Cluster Annotation

2021 Policy Update and NHIA's Efforts to Promote Equitable Access for Home Infusion Services - February 24, 2021

Internet Regulation in the EU: Current Trends - Beihang University, Beijing, 27.9.2018 - jura.uni-frankfurt.de

READ ME IF YOU WANT TO CREATE AN EFFECTIVE SALES PROFILE ON LINKEDIN - #1 in a series of 6

Efficacy Data Updates from Novavax' Protein-based Vaccine Candidate - New York Academy of Sciences

The Perspective of China SOI Ecosystem - Jeffrey Wang CEO Simgui Technology World Semiconductor Conference Nanjing May 18, 2019

ICBF Beef Industry Meeting - 1st October 2009 - IRISH CATTLE BREEDING FEDERATION

Bluford-Peeler STEAM Academy - Science Fair Planner Teacher Name: _ - Guilford County Schools

MCIR/SIRS Training Barry-Eaton District Health Department 2020-2021 - Barry Eaton District Health Department

Digital twin technology as a modern approach to quality management

Strategic Plan Progress & Feedback Summaries - Fall 2019 - 2020 Progress Report - University of Arkansas

Chipotle's Super Bowl Ad: Burritos + Sustainability - PROPRIETARY RESEARCH CONSUMER INSIGHT - NORTH AMERICA - Maru/Blue

Machine Learning: Il motore dell'Intelligenza Artificiale e dell'investigazione scientifica guidata dai Big Data. Stefano Leonardi

Broadband Internet Affordability - United Nations ESCAP

Resources Allocation and Sharing in Wireless Virtual Networks

HIAG Immobilien Holding AG Business Year 2017 - 19 March 2018

TLP HOT Day-11 2020 - IASbaba

Semi-Supervised Neural Machine Translation with Language Models

39th ANNUAL J.P. MORGAN - HEALTHCARE CONFERENCE January 14, 2021 - NASDAQ: BTAI

Should All Blockchain-Based Digital Assets be Classified under the Same Asset Class?

INTELLIGENT REAL-TIME SCHEDULING, DISPATCHING AND MONITORING SYSTEM FOR UNMANNED VEHICLES

International Journal of Advance Engineering and Research

Ant Colony System Based Ant Adaptive For Search of the Fastest Route of Tourism Object Jember, East Java - IOPscience

New insights into Sulawesi's apex predator: the Sulawesi civet Macrogalidia musschenbroekii

VARIANTS OF CONCERN RESPONSIBLE FOR SARS-COV-2 VACCINE BREAKTHROUGH INFECTIONS FROM INDIA

Oracle is Leader in SPARK Matrix: Know Your Customer (KYC) Solution, 2021

Trainguard Zub 222c automatic train control system - State-of-the-art technology for cost-effective rail services