A Data Mining based Model for Detection of Fraudulent Behaviour in Water Consumption - Next Hop

Page created by Patrick Simon
 
CONTINUE READING
A Data Mining based Model for Detection of Fraudulent Behaviour in Water Consumption - Next Hop
2018 9th International Conference on Information and Communication Systems (ICICS)

           A Data Mining based Model for Detection of
           Fraudulent Behaviour in Water Consumption

                    Qasem A. Al-Radaideh                                                      Mahmoud M. Al-Zoubi
          Department of Computer Information Systems,                          Department of Commutation and Information Technology,
                     Yarmouk University,                                                     Yarmouk Water Company,
                      Irbid 21163, Jordan                                                       Irbid 21110, Jordan
                       qasemr@yu.edu.jo                                                        mzoubi12@yahoo.com

    Abstract—Fraudulent behavior in drinking water consumption          citizens through restructuring and rehabilitation of networks,
is a significant problem facing water supplying companies and           reducing the non-revenue water rates, providing new sources and
agencies. This behavior results in a massive loss of income and         maximizing the efficient use of available sources. At the same
forms the highest percentage of non-technical loss. Finding             time, the Ministry continues its efforts to regulate the water
efficient measurements for detecting fraudulent activities has been     usage and to detect the loss of supplied water [2].
an active research area in recent years. Intelligent data mining
techniques can help water supplying companies to detect these                Water supplying companies incur significant losses due to
fraudulent activities to reduce such losses. This research explores     fraud operations in water consumption. The customers who
the use of two classification techniques (SVM and KNN) to detect        tamper their water meter readings to avoid or reduce billing
suspicious fraud water customers. The main motivation of this           amount is called a fraud customer. In practice, there are two
research is to assist Yarmouk Water Company (YWC) in Irbid city         types of water loss: the first is called technical loss (TL) which
of Jordan to overcome its profit loss. The SVM based approach           is related to problems in the production system, the transmission
uses customer load profile attributes to expose abnormal behavior       of water through the network (i.e., leakage), and the network
that is known to be correlated with non-technical loss activities.      washout. The second type is called the non-technical loss (NTL)
The data has been collected from the historical data of the             which is the amount of delivered water to customers but not
company billing system. The accuracy of the generated model hit         billed, resulting in loss of revenue [3].
a rate of over 74% which is better than the current manual
prediction procedures taken by the YWC. To deploy the model, a              The management of the Yarmouk Water Company (Jordan)
decision tool has been built using the generated model. The system      has a significant concern to reduce its profit losses, especially
will help the company to predict suspicious water customers to be       those derived from NTLs, which are estimated over 35% in the
inspected on site.                                                      whole service area in the year 2012. One major part of NLT is
                                                                        customer’s fraudulent activities; the commercial department
   Keywords— Fraud Detection, Data Mining, SVM, KNN, Water              manages the detection processes with the absence of an
Consumption.                                                            intelligent computerized system where the current process is
                                                                        costly, not effective nor efficient.
                        I.   INTRODUCTION
    Water is an essential element for the uses of households,               NTL is a serious problem facing Yarmouk Water Company
industry, and agriculture. Jordan, as several other countries in the    (YWC). In 2012 the NTL reached over 35%, ranging from 31%
world, suffers from water scarcity, which poses a threat that           to 61 according to districts, which results in a loss of 13 million
would affect all sectors that depend on the availability of water       dollars per year. Currently, YWC follows random inspections
for the sustainability of activities for their development and          for customers, the proposed model in this paper provides a
prosperity [1].                                                         valuable tool to help YWC teams to detect theft customers,
                                                                        which will reduce the NTL and raise profit.
    According to Jordan ministry of water and irrigation, this
issue always has been one of the biggest barriers to the economic           Literature has abundant research for Non-Technical Loss
growth and development for Jordan. This crisis situation has            (NTL) in electricity fraud detection, but rare researches have
been aggravated by a population increase that has doubled in the        been conducted for the water consumption sector. This paper
last two decades. Efforts of the ministry of Water and irrigation       focuses on customer’s historical data which are selected from the
to improve water and sanitation services are faced by                   YWC billing system. The main objective of this work is to use
managerial, technical and financial determinants and the limited        some well-known data mining techniques named Support Vector
amount of renewable freshwater resources [2].                           Machines (SVM) and K-Nearest Neighbor (KNN) to build a
                                                                        suitable model to detect suspicious fraudulent customers,
    To address these challenges, Jordan ministry of water and           depending on their historical water metered consumptions.
irrigation as in many other countries is striving, through the
adoption of a long-term plan, to improve services provided to

         978-1-5386-4366-2/18/$31.00 ©2018 IEEE                                                                                48
2018 9th International Conference on Information and Communication Systems (ICICS)

                      II.   RELATED WORK                               Furthermore, they introduced two statistical estimators, which
    This section reviews some of the applications of data mining       are used to weigh customers’ trend and the non-constant
classification techniques in fraud detection in different areas        consumption. The model assists in the identification of abnormal
such as Detection of Fraudulent Financial Statement, Fraud             consumption which may arise from abnormal with no fraud so
Detection in Mobile Communication Networks, Detecting                  they can easily be re-billed, and fraud customers where adequate
Credit Card Fraud, and Fraud Detection in Medical Claims. For          procedures can take place. The accuracy of the model reached
example, Kirkos et al. [4] proposed a model for detecting fraud        22%.
in financial statements, where three data mining classifiers were          Filho et al. [15] implemented decision tree classification
used, and namely Decision Tree, Neural network and Bayesian            technique in the detection of suspected fraud customers and
Belief Network. Shahine et al. [5] introduced a model for credit       corrupted measurement meters. They used five months
card fraud detection; they used decision tree and support vector       customers’ consumption data, where a classification of
machines SVM. In addition, Panigrahi et al. [6] proposed a             customers to fraud and non-fraud were applied. The technique
model for credit card fraud detection using a rule-based filter,       raised the hit rate a hit rate of 5% using current techniques to
Bayesian classifier, and Dempsters-Shafer adder.                       40%.
    Carneiro et al. [7] developed and deployed a fraud detection           Jiang et al. [16] suggested an approach using Wavelet
system in a large e-tail merchant. They explored the combination       techniques and a group of classifiers, to automatically detect
of manual and automatic classification and compared different          fraud customers in electricity consumption. The wavelet
machine learning methods. Ortega et al. [8] proposed a fraud           technique was used to express the properties of the meter
detection system for Medical claims using data mining methods.         readings. These readings were used to build models using
The proposed system uses multilayer perceptron neural                  several classifiers, based on the assumption that abnormalities in
networks (MLP). The researchers showed that the model was              consumption appear when fraud occurs. Cabral et al. [17]
able to detect 75 fraud cases per month.                               introduced a fraud detection system using data mining
    Kusaksizoglu et al. [9] introduced a model for detecting           techniques for high-voltage electricity customers in Brazil. The
fraud in mobile communication networks. The results showed             used techniques used customers’ historical data to be compared
that the Neural Networks methods MLP and SMO found to give             with the current consumption and present the possible fraud
best results. In addition, CHEN et al. [10] proposed and               status. The customers are marked as below regular consumptions
developed an integrated platform for fraud analysis and                and used to be investigated by company inspection team.
detection based on real time messaging communications in                   De Faria et al. [18] presented a use case of forensics
social media.                                                          investigation procedures applied to detect electricity theft based
    Nagi et al. [11] [12] [13] introduced a technique for              on tampered electronic devices. Viegas et al. [19] provided an
classifying fraudulent behavior in electricity consumption. The        extended literature review with an analysis on a selection of
proposed method is a combination of two classification                 scientific studies for detection of non-technical losses in the
algorithms, Genetic Algorithm (GA) and Support Vector                  electric grid reported since 2000 in three well know databases:
Machine (SVM), which yield a hybrid model (named GA-                   ScienceDirect, ACM Digital Library, and IEEE Xplore.
SVM). The technique processed the past customers’                          Coma-Puig et al. [20] developed a system that detects
consumption profile to reveal abnormal consumptions of the             anomalous meter readings on the basis of models that are built
customers of Tenaga Nasional Berhad (TNB) electricity utility          using some machine learning techniques using past data. The
in Malaysia. After an investigation, four categories were found        system detects meter anomalies and fraudulent customer
(change of tenant, replaced the meter, faulty meter, and abundant      behaviour (meter tampering), and it is developed for a company
house). An expert system was designed to remove such                   that provides electricity and gas. Richardson et al. [21]
customers by considering characteristics that distinguish              introduced a novel privacy preserving approach to detecting
between these four customer’s categories and theft customers.          energy theft detection in smart grids. Malicious behaviour is
This intelligent system hit rate reached 60% where they                detected by calculating the Euclidean distance between energy
indicated that this model raised the detection of fraud activities     output measurements from installation over a day. These
from 3% using current procedures in the company to, a hit rate         distances are then clustered to identify outliers and potentially
of 60% after onsite inspection.                                        malicious behaviour.
    Ramos et al. [3] presented optimum-path forest classifier to           The available literature related to detecting the fraudulent
detect fraud customers in electricity consumption. The classifier      activities of Non-Technical Loss in water consumption is limited
was compared with other robust classifiers ANN, SVM-Linear,            in comparison to other sectors such as electricity consumption
SVM-RBF. The results showed that OPF accuracy is similar to            and financial issues. For example, Monedero et al. [22]
SVM-RBF but superior in training time, which enables real-time         developed a methodology consists of a set of three algorithms
classification. The other two classifiers accuracy were not            for the detection of meter tampering in the Emasesa Company (a
comparable.                                                            water distribution company in Seville).
    León et al. [14] suggested a model that can reveal electricity        Humaid [23] research is the only research conducted in the
fraud customers. The data was obtained from the Spanish Endesa         Arabic region related to suspicious water consumption activities.
Company. The classification model is based on Generalized              Humaid used data mining techniques to discover fraudulent
Rule Induction (GRI) and Quest Decision Tree methods.                  water consumption by customers in Gaza city. The historical

                                                                                                                             49
2018 9th International Conference on Information and Communication Systems (ICICS)

data of water consumption was used as a training dataset to build       generalization; merely it stores the training tuples or instances
the intelligent model. The author focused on using support              [25]. KNN works by comparing a given test tuple with the
vector machine SVM classifier and compared K-Nearest                    training tuples that are closest or similar. Therefore KNN is
Neighbour classifier (KNN), and Neural Network classifier               based on analogy. The training tuples are stored as points in the
(ANN). As the monthly water consumption data are scattered              n-dimensional pattern space, in the case of unknown test tuple,
over 44 tables, Humaid [23] unified it in one table representing        KNN finds the k tuples that are closest to the test tuple, these
customers’ consumption for 144 months from 03/2000 to                   tuples are the K-nearest neighbors; the test tuple is classified by
02/2012. At the beginning of the period, the consumption was            the majority voting of its k neighbors.
read every two months, so null values were appearing on the
                                                                             The similarity can be measured using several distance
data, to overcome this, he divided the consumption by 2, and the
                                                                        metrics such as the Euclidian distance. Let two tuples x1 = (x11,
result was taken as the related two months consumption equally.
Then an attribute is added to indicate the fraudulent status            x12, ..., x1n) and x2 = (x21, x22, ..., x2n). The Euclidean distance (dist)
‘Fraud_Status’ as the class label for all customers.                    between x1 and x2 is computed using equation (1):

    The customers who are recorded in the municipality of Gaza                   ( 1,    2)   =   2
                                                                                                      ∑ni=1(   1   −   2   )2              (1)
as water theft have been manually marked with the label ‘YES’
in the field Fraud_Status, and the rest were labeled as ‘NO.' The           To determine the best value for K, this can be achieved
data was filtered to remove inconsistency and noise cases. The          experimentally; we start with k=1 and increment by 1 repeatedly
data is normalized using z-score to fit the SVM model. Similar          until we obtain the minimum error rate [25].
to all fraudulent cases, the data classes are unbalanced. SVM has
a parameter set that can be used to the weight and balance the                                III.     THE METHODOLOGY
two classes. The ratio for each class was calculated to compute             The CRISP-DM (Cross Industry Standard Process for Data
the parameter values. The values were multiplied by 100 to              Mining) [26] was adopted to conduct this research. The CRISP-
achieve a suitable ratio weight for SVM. The random sub-                DM is an industry standard data mining methodology developed
sampling was applied to the samples with class label “NO” to            by four Companies; NCR systems engineering, DaimlerChrysler
weight the classes for KNN and ANN. The results showed that             AG, SPSS Inc. and OHRA. The CRISP-DM model consists of
SVM classifier has the best accuracy over the other two                 business understanding, data understanding, data preparation,
classifiers for the balanced samples either with consumption            model building, model evaluation and model deployment.
feature alone or with all selected features, The season
consumption dataset was the most suitable for monthly and               A. Business Understanding
yearly datasets because it takes into consideration the seasonal            Yarmouk Water Company was established in July 2010; it is
consumption changes where the intelligent model raised the hit          entirely owned by Water Authority of Jordan (WAJ), the
rate from 10% random inspection to 80%. This research showed            concession area of the company is the northern governorates
that unbalanced samples gave the best accuracy for all                  Irbid, Mafraq, Jerash, and Ajloun. To achieve its business
classifiers.                                                            efficiently, YWC is structured into six directorates at
                                                                        headquarter; these are the Commercial, Operational Support,
A. The Support Vector Machines (SVM) Classifier                         Technical Affairs, Finance, Human Resources and Information
    Support vector machine (SVM) is a supervised classification         Technology directorates. The company established ten branches
method. It works well for linear and non-linear data, and usable        all over the concession region; each branch is responsible for
for numeric prediction in addition to classification. SVM has           managing the distribution of water and customer affairs in its
been widely used in different applications (i.e., object                designated area, these branches are called Regional
recognition, speaker identification, and hand-written digit             Organizational Units (ROU’s).
recognition). While SVM training is relatively slow, it is highly
                                                                            To understand customers’ billing process well, we collected
accurate, and the problem of overfitting is less in comparison to
                                                                        and analyzed the information related to billing and fraud-
other classifiers.
                                                                        fighting by interviews with the employees, documents review
    SVM works by separating the training data by a hyperplane,          and exploring the billing system. The customers’ bills are issued
due to the difficulty to separate the data in its original dimension.   every three months period, the customers’ meter readings are
SVM uses non-linear mapping of the data into a higher                   divided into three groups, such that each group is read within one
dimension which enables the SVM to find hyperplanes that                month. At the time of group reading the commercial department
separate the data efficiently. After that, SVM searches for the         prepares the routes to be read by connecting with the Geographic
best separating hyperplane [24]. More details about this                Information Systems GIS of the company. The teams who are
technique can be found in the literature and most data mining           responsible for meter reading read the meters at the customers’
books [25].                                                             properties. Once the reader gets the meter reading, he enters it
                                                                        into the billing system on his own Handheld Unit, issue the bill
B. The K-Nearest Neighbour Classifier                                   and submit it to the customer. In case of a problem (i.e., high
    The K-nearest neighbor (KNN) is a type of lazy learners, in         reading or low reading), the system does not print a bill. The
contrast to eager classifiers like Rule-Based, Decision Tree and        HHU’s are connected to the server directly, and the billing data
SVM, where the classifier constructs the model when given the           are transferred to the COBOL billing system through an
training set, so it becomes ready or eager to classify. Instead,        intermediary Oracle application and FTP tools. In the next day,
KNN is a lazy learner when given a training set it does nothing         the billing auditor in the commercial department audits the bills
and waits until the test set becomes available. It does not build a     and solves the problems (i.e., high reading). When all problems

                                                                                                                                     50
2018 9th International Conference on Information and Communication Systems (ICICS)

are solved the bills are approved. The legacy billing system was               DIST_NO                      The district number
developed in the mid-1980’s using COBOL language. It is still                  TOWN_NO                      The village/town number
in service, and it is installed on a mainframe with OPEN-VMS                   CONS_NO                      Customer number
                                                                               CONS_NAME                    Customer name
platform. Yarmouk Water Company implemented an HHU
                                                                               INFRACTION_NO                Infraction number
billing system in Oct 2011. This system is intended to issue the               INFRACTION_DATE              Infraction Date
bills in the field. The HHU billing system is integrated with the
COBOL-based billing system, and all the bills are computed and               C. Data Preparation
issued from this system.                                                         This phase of the knowledge discovery needs huge efforts to
    The commercial departments in the ROU’s try to fight theft               prepare the data with high quality and suitable format to be used
of water. When a zone is supplied with water, a random                       later in the modeling phase. The data preparation phase includes
inspection is performed for the customers’ properties and water              the following tasks: Data Preprocessing, Customer Filtering and
connections. If a theft case detected, they record the case on a             Selection, Features Extraction, Data Normalization, and Feature
dedicated form document and return the form to the department,               Adjustment.
and a penalty is imposed on that customer.                                   • Data Preprocessing
B. Data Understanding                                                            The main steps of this phase are illustrated in Figure 1. The
    The essential part of the data mining process is the data itself.        consumption table of the historical customers’ data contains
The following sections characterize the structure and nature of              around 16 million records for 109 thousand customers. It
the collected data. This paper is limited to the data collected from         includes the consumptions for the interval from 1990 to the
the billing system which is mainly used for issuing the customer’            current time. The customers’ consumption records that are
water bills. Suitable COBOL programs have been developed to                  related to Qasabat Irbid ROU are around 1.5 million
extract the most important customers billing data into a text                consumption records for around 90 thousand customers. The
format data files. Oracle tables are created with a similar format           consumption data for the customers are stored in a vertical
to COBOL data files. The tables are the customers’ main                      format as shown in Figure 2.
information table, Customers’ water consumptions table and the
Customers’ payments table. The description of the Customers’
water consumptions table (relation) is presented in Table 1.                                                   Historical
                                                                                                             C t      ’d t

          TABLE I.       CUSTOMERS’ CONSUMPTIONS TABLE
                                                                                                          Customers Filtering and
          Column                              Description                                                       Selection
 DIST_NO (PK)               The district number
 TOWN_NO (PK)               The village/town number
 CONS_NO (PK)               Customer number                                                                 Feature Extraction
 BILL_NO                    The number of the bill
 BILL_STAT                  The status of the bill
 ISSUE_DATE                 Date of bill issue                                                              Data Normalization
 PRINT_FLAG                 A flag mentions if the bill is printed
 OLD_MET_PREV_RDNG          The old meter previous reading
 FORWARD_BAL                The forward balance                                                             Feature Adjustment
 ADVANCE_PMNT               Advanced payment
 OUTSTANDING_AMOUNT         Outstanding balance
 CALC_CONS_FLAG             A flag mentions the consumption was calculated
                                                                             Fig. 1. Data Preprocessing Framework.
 METER_NO                   Water meter metal number
 METER_STATUS               Water meter valid or corrupted                       The customers’ profiles need to be formatted horizontally to
 ISSUE_DATE                 Bill issue date                                  train the model. A sample of processed customers’ profiles is
 CONS_TYPE                  Customer service type (industrial, housing,..)
 FROM_DATE                  Start of consumption period for the bill
                                                                             presented in Figure 2, in which the customer's profile ordered by
 TO_DATE                    end of consumption period for the bill           the periods of consumption, the water cycles represent the
 UNITS_CONS                 units consumed in cubic meter                    seasons (i.e., cycle number one in 2000 represents the
 UNITS_CONS_VAL             The price of the consumed water                  consumption of the last three months 1/10/2006-1/1/2007). This
 SEWER_VAL                  The price of Sewerage service
 METER_RENT                 The meter fees                                   is fair for classification because the consumption varies
 CYCLE_NO                   Consumption Cycle number                         according to the year seasons.
 CYCLE_YY                   Consumption Cycle year
                                                                             • Preparing the Fraudulent Data
• Collecting the Fraud Data
                                                                                 To extract the fraud customers’ profile, a new table is created
    The theft case which is discovered by the fraud-fighting                 containing the client's number, the water consumption, and a
teams is manually processed on paper documents and MS Excel                  new attribute for fraud class. This attribute is filled with a value
sheets. The total theft data available are 4155 cases in Qasabat             of ‘YES’. Another table for the normal clients is created, and the
Irbid ROU. The attributes of the fraud customers table are                   fraud class attribute is filled with the value “NO”. The two tables
illustrated in Table 2.                                                      are then consolidated into one table containing the customer ID,
                                                                             consumption profile, and fraud class attributes. To filter the data,
          TABLE II.      THEFT CUSTOMERS TABLE COLUMNS.                      some preprocessing operations were performed such as
          Column                             Description                     Eliminate redundancy, Eliminate customers having zero

                                                                                                                                      51
2018 9th International Conference on Information and Communication Systems (ICICS)

consumption through the entire period, Eliminate new clients
who are not present during the whole targeted period, and
Eliminate customers having null consumption values. Filtering
the data resulted in a reduced original dataset of the non-fraud
customer to 16114 record and the fraud customers to 647
records.

                                                                         Fig. 3. Normalized Customers consumption per year season.

                                                                             When choosing K=1, the nearest neighbour is x4 with the
                                                                         class label “Yes,” therefore the model predicts that class label
                                                                         for x1 is “Yes” meaning that the customer is predicted as a
                                                                         suspicious Fraud. Choosing K=3 the nearest neighbors are: x3,
                                                                         x4, x5 with class labels “Yes,” “Yes,” and “No” respectively,
Fig. 2. Customers’ consumptions ordered by year and Cycle_No.            therefore (using the majority voting technique) the predicted
                                                                         class label for x1 is “Yes”, meaning that the customer is
• Data normalization and Feature Adjustment                              predicted as a suspicious fraud.
    As the dataset has different scales, so it needs to be
normalized to be suitable for SVM training, there are different                     TABLE III.              UNKNOWN CLASS LABEL TUPLE TO BE PREDICTED.

techniques for normalization; we used min-max technique [25].                                                                        i
The normalized dataset can be utilized by both SVM and KNN                     1   2   3    4    5     6        7   8     9     10       11   12   13   14   15   16   17   18   19   20
algorithms, while SVM cannot use the not normalized one, it can           xq   20 30   30   40   20    30   30      30    20    20       30   40   40   40   30   40   30   30   30   30
be for KNN. Figure 3 shows a sample of the normalized dataset
per year season.
                                                                                TABLE IV.             THE DISTANCE BETWEEN X1 AND THE TRAINING EXAMPLES.
    The filtered dataset is unbalanced where the customers’
profiles with “Non-Fraud” class labels are 16114, and the                                                  x2              x3                 x4             x5         x6
customers’ profiles with “Fraud” class labels are 647, this will                   Euclidian
                                                                                                      193.641            71.225           65.207        68.586         77.679
result in poor classification. To balance the dataset we made                      Distance
random sub-sampling in the higher class yielding 647
customers’ profiles, and all the customers’ profiles with “Fraud”        • The Support Vector Machine (SVM)
class are selected, the resulted training dataset is 1294 labeled           In practice we can use the learned SVM model to evaluate
customers’ profiles.                                                     whether the customer is suspected fraud or not, this can be
                                                                         achieved by retrieving the customer’s profile from the database
D. Model Building using SVM and KNN Classifiers
                                                                         and compare it to the fraud model.
    To build the models we used the dataset resulted from the
data preprocessing phase which contains 1294 customers’                                     IV.        EXPERIMENTS AND EVALUATION
profiles.                                                                    This section presents the experiments, results, and
• The K-Nearest Neighbours (KNN) model                                   evaluation. The goal of this research is detecting frauds using
                                                                         customers’ historical data. For this purpose, model building
    The models were built using SVM and KNN algorithms. In               phase is carried out using both the SVM and KNN classification
the dataset there are two classes of customers’ profiles (Fraud          techniques. For the experimentation purposes, the WEKA 3.7.10
and Not Fraud) the customers class labels are unbalanced where           data mining tool was used [27]. To achieve the process of model
the known fraud customers compared to normal are 2:10000 this            building, 1294 customers’ profile dataset was used for training
will result in a poor classification accuracy, balancing the dataset     the SVM and KNN models, k-fold cross-validation and holdout
is mandatory for the model to perform well, a random sub-                methods for training and testing.
sampling is used in the higher class (non-fraud) to balance the
dataset, and all filtered fraud customers are selected for use in            The model is trained using the default classifiers parameters.
the training of the models. Let us assume we have an unknown             The SVM and KNN training and testing were done using 10-fold
class label tuple as shown in Table 3. Table 4 shows the                 cross-validation and holdout methods. We analyzed the results,
Euclidian Distance measure, which presents the distance                  specifically the accuracy and a hit rate of classifying the
between x1 and some other tuples from the training data.                 customers into frauds or non-frauds classes. We recorded the
                                                                         accuracy results for each model and compared their performance
                                                                         in classifying new unseen tuples.

                                                                                                                                                                       52
2018 9th International Conference on Information and Communication Systems (ICICS)

• Building the SVM Model                                                                                         Predicted       Accuracy
   To build the SVM model, we used SVMLIB 1.0.6 library                                                   Fraud      Not Fraud      %
[28] which is embedded within WEKA tool [29]. We used the                                     Fraud        445          202          69
10-fold cross-validation, and the holdout methods with 75%-                     Actual
                                                                                            Not Fraud      186          461          71
25% for training and testing respectively.
                                                                                            Accuracy %      70          69          70.01
    In the first experiment, we used the 10-fold cross validation
for training and testing, which is the default parameter of the
algorithm. Table 5 shows the confusion matrix of this model.                 The summary of the results of the experiment for the two
The confusion matrix shows that SVM model scored an                      classification techniques is shown in Table 7. The accuracies of
accuracy of 71%. This result exhibits that 920 records out of            SVM and KNN are 71%, 70% respectively using 10-fold cross-
1294 records were correctly classified, and 374 records out of           validation method for training and testing. This shows that the
1294 records were misclassified.                                         accuracy of the SVM model is slightly higher total accuracy than
                                                                         the KNN model. The fraud records correctly classified by SVM,
    In the second experiment, we modified the test option to             and KNN models are 61%, 68% respectively which presents that
Holdout, that is, to split the input tuples into a separate training     KNN model is better in classifying fraud records than SVM.
set and testing set. This is intended to measure the classifier’s
accuracy when changing the training and testing parameters.                  Using holdout method for training and testing, the accuracies
The parameter split percentage was set to 75%; this means that           of SVM and KNN model are 72%, 74% respectively. This shows
75% of the original dataset is used for training and 25% for             that KNN model is slightly higher total accuracy than SVM
testing. The results accuracy was 72.4.                                  model. The fraud records correctly classified by SVM and KNN
                                                                         models are 68%, 73% respectively which present that KNN
      TABLE V.       SVM CONFUSION MATRIX, USING CROSS VALIDATION
                                                                         model is better in classifying fraud records than SVM. From the
                                                                         experiments, we can say that both SVM and KNN classifiers are
                                    Predicted          Accuracy          a close performance in fraud detection.
                               Fraud    Not Fraud         %
                                                                             TABLE VII.       ACCURACY AND HIT RATE OF SVM AND KNN CLASSIFIERS
                 Fraud          394         253           61
      Actual
                Not Fraud       121         526           81               Classification                   Recall
                                                                                             Accuracy                   Training & Test option
               Accuracy %       76.5        67.5         71.1                Method                        (Fraud)
                                                                               SVM             71.1%         61%
                                                                                                                        10-fold Cross Validation
• Building the KNN Model                                                       KNN             70.0%        68%
    To build the KNN model, we used the IBK algorithm which                    SVM             72.4%         68%        Holdout 75% training &
is built-in in WEKA tool. We used the K-fold cross-validation,                 KNN             74.3%        73%              25% testing
and the holdout training and testing methods. In the first
experiment, the KNN model building was applied using the 10-             • Deployment of The Prediction System
fold cross-validation, and the values of K iterated from 1 to 10.           Figure 4 shows the main user interface for the system that
The best accuracy achieved when K value is 8, the confusion              implements the model. The user is required to enter the customer
matrix is shown in Table 6.                                              number, and the model compares the customer’s profile to
    As we see from the resulting confusion matrix of this                predict if the customer is a suspicious Fraud.
experiment, KNN accuracy is 70%. This shows that 906 (70%)
records out of 1294 records are correctly classified, while 388                                    V.     CONCLUSION
(30%) records are misclassified. The resulting confusion matrix              In this research, we applied the data mining classification
also shows that out of the total 647 non-fraud suspicious records        techniques for the purpose of detecting customers’ with fraud
445 (68%) are correctly classified in their correct class, while         behaviour in water consumption. We used SVM and KNN
202 (32%) records are incorrectly classified as fraud. In addition,      classifiers to build classification models for detecting suspicious
out of 647 fraud customer records 461 (71%) are correctly                fraud customers. The models were built using the customers’
classified, while 186 (29%) records are misclassified. The result        historical metered consumption data; the Cross Industry
from this experiment shows that the model developed with KNN             Standard Process for Data Mining (CRISP-DM).
model is slightly less accurate than SVM in classifying
customers’ profiles to correct class.                                        The data used in this research study the data was collected
                                                                         from Yarmouk Water Company (YWC) for Qasabat Irbid ROU
    In the second experiment, we modified the test option to             customers, the data covers five years customers’ water
Holdout so that we split the input tuples into a separate training       consumptions with 1.5 million customer historical records for 90
set and testing set. This is intended to measure the classifier’s        thousand customers. This phase took a considerable effort and
accuracy when changing the training and testing parameters.              time to pre-process and format the data to fit the SVM and KNN
The parameter split percentage was set to 75%, which means that          data mining classifiers.
75% of the original dataset is used for training and 25% for
testing. The results accuracy was 74.3.                                      The conducted experiments showed that a good performance
                                                                         of Support Vector Machines (SVM) and K-Nearest Neighbours
  TABLE VI.      TABLE 6: KNN CONFUSION MATRIX USING 8-FOLDS CV METHOD
                                                                         (KNN) had been achieved with overall accuracy around 70% for
                                                                         both. The model hit rate is 60%-70% which is apparently better

                                                                                                                                      53
2018 9th International Conference on Information and Communication Systems (ICICS)

than random manual inspections held by YWC teams with hit                                  communications on social networks”, IEICE Trans. Inf. & Syst., 2017,
rate around 1% in identifying fraud customers. This model                                  Vol. E100–D, No.10, pp: 2267-2274.
introduces an intelligent tool that can be used by YWC to detect                    [11]   J. Nagi, K. Yap, S. Tiong, S. Ahmed and A. Mohammad. “Detection of
                                                                                           abnormalities and electricity theft using genetic support vector
fraud customers and reduce their profit losses. The suggested                              machines”, In Proc. IEEE TENCON Region 10 Conf., 2008, pp.1-6.
model helps saving time and effort of employees of Yarmouk                          [12]   J. Nagi, Mohammad A., Yap K., Tiong S., Ahmed S. “Non-Technical Loss
water by identifying billing errors and corrupted meters. With                             Analysis For Detection Of Electricity Theft Using Support Vector
the use of the proposed model, the water utilities can increase                            Machines”, In Proc IEEE 2nd International Power and Energy Conference,
cost recovery by reducing administrative Non-Technical Losses                              2008, pp. 907-912.
(NTL’s) and increasing the productivity of inspection staff by                      [13]   J. Nagi, K. Yap, S. Tiong., S. Ahmed, M. Mohamad. “Nontechnical loss
onsite inspections of suspicious fraud customers.                                          detection for metered customers”, IEEE Transactions on Power Delivery,
                                                                                           2010, 25(2): 1162-1171.
                                                                                    [14]   C. León, F. Biscarri, I. Monedero, J. Guerrero, J. Biscarri and R. Millán,
                                                                                           “Variability and trend-based generalized rule induction model to ntl
                                                                                           detection”, IEEE Transactions on Power Systems, 2011, 26(4):1798 -
                                                                                           1807.
                                                                                    [15]   J. Filho, E. Gontijio, A. Delaiba, E. Mazina., J. Cabral, J and Pinto. “Fraud
                                                                                           identification in electricity company customers using decision tree”,
                                                                                           Systems, Man and Cybernetics, IEEE International Conference, 2004, 4:
                                                                                           3730 – 3734.
                                                                                    [16]   R. Jiang, H. Tagiris, A. Lachsz. and M. Jeffrey “Wavelet-based features
                                                                                           extraction and multiple classifiers for electricity fraud detection”, In Proc.
                                                                                           IEEE/PES Transmission and Distribution Conf. Exhibit. 2002.
                                                                                    [17]   J. Cabral, J. Pinto. E. Martins and A. Pinto, “Fraud detection in high
                                                                                           voltage electricity consumers”. 2008.
                                                                                    [18]   R. De Faria, K. Ono Fonseca, B. Schneider and S. Nguang, “Collusion and
                                                                                           fraud detection on electronic energy meters - a use case of forensics
Fig. 4. The Prediction System of the Fraud Detection.                                      investigation procedures”, in 2014 IEEE Security and Privacy Workshops,
                                                                                           pp. 65-68.
                           ACKNOWLEDGMENT                                           [19]   J. Viegas, P. Esteves, R. Melicio, V. Mendes and S. Vieira, “Solutions for
                                                                                           detection of non-technical losses in the electricity grid: a review”,
    The authors would like to thank Yarmouk Water Company                                  Renewable and Sustainable Energy Reviews, 2017, 80: 1256-1268.
for providing the data to be used for the purpose of this study.                    [20]   B. Coma-Puig, J. Carmona, R. Gavald, S. Alcoverro, and V. Martin,
                                                                                           “Fraud detection in energy consumption: a supervised approach”. In Proc
                                REFERENCES                                                 IEEE Intl. Conf. on DSAA, 2016, pp. 120-129.
[1]  N/A, “Jordan Water Sector Facts & Figures, Ministry of Water and               [21]   C. Richardson, N. Race, and P. Smith, “A privacy preserving approach to
     irrigation of Jordan”. Technical Report. 2015.                                        energy theft detection in smart grids”, 2016 IEEE International Smart
[2] N/A, “Water Reallocation Policy, Ministry of Water and irrigation of                   Cities Conference (ISC2), Trento, pp. 1-4.
     Jordan”. Technical Report. 2016.                                               [22]   Monedero I., Biscarri F., Guerrero J., Roldán M., and León C. “An
[3] C. Ramos , A. Souza , J. Papa and A. Falcao, “Fast non-technical losses                Approach to Detection of Tampering in Water Meters”, In Procedia
     identification through optimum-path forest”. In Proc. of the 15th Int. Conf.          Computer Science, 2015, 60: pp 413-421.
     Intelligent System Applications to Power Systems, 2009, pp.1-5.                [23]   E. Humaid, “A data mining based fraud detection model for water
[4] E. Kirkos, C. Spathis and Y. Manolopoulos, “Data mining techniques for                 consumption billing system in MOG”, Islamic University of Gaza,
     the detection of fraudulent financial statements”, Expert Systems with                Deanery of higher Studies, Information Technology Program, Department
     Applications, 32(2007): 995–1003.                                                     of Computer Science, Master thesis. 2012.
[5] Y. Sahin and E. Duman, “Detecting credit card fraud by decision trees and       [24]   C. Cortes and V. Vapnik, 1995. “Support-Vector Networks”, Machine
     support vector machines”, IMECS, 2011, Vol I, pp. 16 – 18.                            Learning, 1995, 20(3): 273-297,
[6] S. Panigrahi, A. Kundu, S. Sural and A. Majumdar, “Credit card fraud            [25]   J. Han, M. Kamber, J. and Pei. Data mining: concepts and techniques, 3rd
     detection: a fusion approach using dempster–shafer theory and bayesian                Ed, Morgan Kaufmann. 2012.
     learning, information fusion”, 2009, 10(4): 354–363.                           [26]   P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer
[7] N. Carneiro, G. Figueira and Costa M., “A data mining based system for                 and R. Wirth, “CRISP-DM 1.0: step-by-step data mining guide”, SPSS
     credit-card fraud detection in e-tail decision support systems”, Decision             Inc., 2000, USA.
     Support Systems, 2017, 95(C): 91-101.                                          [27]   I. Witten, E. Frank E., L. Trigg, M. Hall, G. Holmes and S. Cunningham.
[8] Ortega P., Figueroa C., and Ruz G. “A Medical Claim Fraud/Abuse                        “WEKA: practical machine learning tools and techniques with java
     Detection System based on Data Mining: A Case Study in Chile”, In proc                implementations”. In Proc the ICONIP/ANZIIS/ANNES Workshop on
     of DMIN, 2006.                                                                        Emerging Knowledge Engineering and Connectionist-Based Information
                                                                                           Systems. 1999, pp. 192–196.
[9] B. Kusaksizoglu, “Fraud detection in mobile communication networks
     using data mining”, Bahcesehir University, The Department of computer          [28]   C. Chang and C. Lin, “LIBSVM: a library for support vector machines”.
     engineering, Master Thesis. 2006.                                                     ACM Transactions on Intelligent Systems and Technology, 2011, 2:27:1-
                                                                                           -27:27.
[10] C. Liang-Chun, H. Chien-Lung, L.Nai-Wei, Y. Kuo-Hui and L. Ping-
     Hsien, “Fraud analysis and detection for real-time messaging                   [29]   Y. EL-Manzalawy and V. Honavar, “WLSVM: integrating LibSVM into
                                                                                           WEKA environment”. 2005.

                                                                                                                                                         54
You can also read