An Analysis of the Use of Machine Learning for Employee Attrition Prediction - A Literature Review

Page created by Alfredo Lawson
 
CONTINUE READING
Journal of Information and Computational Science                                                     ISSN: 1548-7741

              An Analysis of the Use of Machine Learning for
               Employee Attrition Prediction – A Literature
                                  Review
                                            Usha.P.M1,
                           Research Scholar, Department of CA, CS & IT,
                        Karpagam Academy of Higher Education, Coimbatore.
                                E mail Id – pmusha.72@gmail.com

                                         Dr. N.V.Balaji2,
                           Dean, Faculty of Arts, Science and Humanities,
                        Karpagam Academy of Higher Education, Coimbatore.
                               E mail Id – balajinv@karpagam.com

                                                    Abstract
               Machine learning is an area of application where artificial intelligence is
           employed empowering the systems to learn and act from experience without the
           requirement of an explicit program. [1]. Future events can be predicted by
           executing machine learning algorithms on the available past data [2]. Machine
           learning produces labeled classifications of data, and it can also be used to form
           hidden structures from the unlabeled data.
               The top level management of organizations can leverage this prediction potential
           of machine learning algorithms to foretell the likelihood of an employee exiting from
           the organization. This process will in turn help in controlling the factors leading to
           attrition and prevent it from happening. Employee turnover is a grave challenge
           faced by the employer. Retaining talents are crucial for every organization. Hence,
           if the management can obtain a prediction probability of separation of employees,
           as well as the factors influencing the separation, it can be instrumental in making
           decisions to mitigate the risk of attrition. This is where machine learning has a role
           to play. The predictions made by the machine learning algorithms will indicate
           proactive steps the management should take to retain the employees.
               This paper is attempting to review the studies conducted in this area to explore
           various machine learning algorithms that can be used for the predictions and the
           effectiveness of such predictions.

             Key words: Attrition, Machine language, Predictive analytics, Classification

           1. Introduction
              “Take care of your employees and they will take care of your business”, opined Sir
           Richard Charles Nicholas Branson, a British business tycoon [3]. In the current competitive
           business environment, it is inevitable for an organization to invest time and efforts to reduce
           employee attrition and retain the employees who are the most valuable assets of the
           organization. High employee turnover will result in considerable financial stress for the
           organization [4].
             Since the operations of all departments in current organizations are digitalized, there is
           availability of abundant data. The details of human resource domain where details of job,

Volume 10 Issue 3 - 2020                                1429                                             www.joics.org
Journal of Information and Computational Science                                                      ISSN: 1548-7741

           change in job, time in company and other demographic features can be extracted from
           Human Resource Information System. The data resource can be utilized to identify the
           drivers of attrition and also predict the chance of attrition. HR Analytics is gaining
           importance nowadays in organizations. Descriptive analytics of data has been used in
           organizations earlier also. But the current technology of predictive analytics utilizes
           machine learning techniques to predict the future from the available data [5].
              Machine learning is an area of application where Artificial intelligence is employed
           empowering the systems to learn and act from experience without the requirement of an
           explicit program. [1]. Future events can be predicted by executing machine learning
           algorithms on the available past data [2]. Machine learning produces labeled classifications
           of data or it can also be used to form hidden structures from the unlabeled data.
              The top level management of organizations can leverage this prediction potential of
           machine learning algorithms to foretell the likelihood of an employee exiting from the
           organization. This process will in turn help in controlling the factors leading to attrition and
           prevent it from happening. Employee turnover is a grave challenge faced by the employer.
           Retaining talents are crucial for every organization. Hence if the management can obtain a
           prediction probability of separation of employees as well as the factors influencing the
           separation, it can be instrumental in making decisions which mitigate the risk of attrition.
           This is where machine learning has a role to play. The predictions prepared by the machine
           learning algorithms will initiate proactive steps by top management to retain the employees.

           2. Literature review
              Rohit Punnoose, PankajAjit (2016) in the article “Prediction of Employee Turnover in
           Organizations using Machine Learning Algorithms”, is making a comparison of Extreme
           Gradient Boosting with other selected classification algorithms for predicting the turnover
           of employees. The study has used a data set collected from the Information system used by
           Human Resource department of aretailerwith global operations and also data from Bureau
           of Labor Statistics [6]. Since it was a labeled data set, supervised learning was carried out.
           Data set also contained records of employees spreading over 18 years, which had data
           pertaining to every quarter. 73,115 records with labels ‘active’ or ‘terminated’ formed the
           source data set. A preprocessing was done on the data by replacing the missing values by
           mean, by median where outliers were found and by values on the basis of domain expertise.
           Eighty percent of data were used for training the model. A tenfold cross validation was
           performed on each chosen algorithm. The model thus generated after training was used to
           test the remaining twenty percent of data.
              Various classification algorithms implemented in this study are Extreme Gradient
           Boosting, Logistic Regression, Naïve Bayesian, Random Forest, K-Nearest Neighbor,
           Linear Discriminant Analysis andSVM. The researcher has adopted AUC-ROC curve as a
           tool for measuring the performance of the classification algorithms that are used in the
           study. An AUC ROC curve is a graph drawn by plotting True Positive Rate (TPR)shown
           on y-axis and False Positive Rate (FPR) shown on x-axis. The AUC (Area under curve)
           gives a measure of how much the model is able to distinguish between different classes in
           the data set. Other measures of comparison are the memory consumed by the process and
           also the run time of the algorithm. The results after running various algorithms on a
           MacBook OS are given in the table below:

Volume 10 Issue 3 - 2020                                 1430                                             www.joics.org
Journal of Information and Computational Science                                                 ISSN: 1548-7741

                 Table 1: Comparison of performance of classification algorithms

             Reprinted from “Prediction of Employee Turnover in Organizations using Machine
           Learning Algorithms” by RohitPunnoose, PankajAjit, 2016, (IJARAI) International
           Journal of Advanced Research in Artificial Intelligence, Vol. 5
             From the table it is clear that the best performance is put up by XGBoostin case of
           accuracy and memory utilization.
              Ibrahim OnuralpYi˘git,HamedShourabizadeh (2017)in their article “An Approach for
           Predicting Employee Churn by Using Data Mining”, compared the performance of various
           classification techniques in predicting the churn rate of employees [7]. The researcher
           worked on a fictional dataset prepared by IBM data scientists, which had 1470 records and
           35 features. Some features which did not have any significance on attrition prediction were
           removed. Attributes like employee id could not be considered as a variable influencing the
           decision of employee churn. Also, the database had a field over18, where all the employees
           had the value “Yes”. Such attributes were removed from the dataset.
             The comparison made by the researcher with respect to accuracy metric is as follows
             Table II: Comparison of accuracy of classification algorithms
               Algorithm                     Accuracy metric
               Decision tree                 0.813
               Naïve Bayes                   0.839
               Logistic Regression           0.855
               SVM                           0.887
               KNN                           0.867
               Random forest                 0.879

Volume 10 Issue 3 - 2020                              1431                                           www.joics.org
Journal of Information and Computational Science                                                   ISSN: 1548-7741

             Table III: The comparison with respect to precision metric
               Algorithm                    Precision
               Decision tree                0.29
               Naïve Bayes                  0.31
               Logistic Regression          0.35
               SVM                          0.51
               KNN                          0.38
               Random forest                0.45

             Table IV:The comparison with respect to recall
               Algorithm                    recall
               Decision tree                0.43
               Naïve Bayes                  0.32
               Logistic Regression          0.31
               SVM                          0.31
               KNN                          0.23
               Random forest                0.22

              From the above tables it is evident that SVM performed better than other algorithms.
           The author also employed certain methods for feature selection and had also used recursive
           feature elimination for the purpose of extracting relevant variables. The algorithms were
           applied on the dataset after applying feature selection. A slight improvement was shown
           in accuracy, precision, recall and f-measure. But the comparison proved SVM to be the
           performer after using recursive feature elimination.
              Rachna Jain and AnandNayyar(2018),in their article, “Predicting Employee Attrition
           using XGBoost Machine Learning Approach”, tried to predict attrition using XGBoost, a
           popular machine learning technique [8]. OSEMN framework was used in this study while
           designing the project which actually represented obtaining, scrubbing, exploring, modeling
           and interpreting the data. The dataset used was prepared by IBM data scientists. Apart from
           selecting important variables from the dataset, the researcher was creating some new
           attributes like tenure per job which was formed from the number of companies worked and
           compa ratio, which was the ratio between monthly income and midpoint of salary range.
           Naturally, we can understand that if compa ratio is small, the employee will not be satisfied
           regarding his salary, and it may lead to attrition. The researcher has found that the factors
           that highly influence attrition among the attributes in the data set are age, gender, marital
           status, years at company, job satisfaction and distance from home. Glm boost and XGBoost
           techniques were used to give an accuracy of 89% and less than 30% error rate.
              JeelSukhadiya, Harshal Kapadia, Prof. Mitchell D’silva (2018),in their article
           “Employee Attrition Prediction using Data Mining Techniques”,applied various algorithms
           on the same IBM data set as mentioned in a previous article which was prepared by IBM
           data scientists. The algorithms used in this study were Logistic regression, Gradient boosted

Volume 10 Issue 3 - 2020                                1432                                           www.joics.org
Journal of Information and Computational Science                                                    ISSN: 1548-7741

           classifier, Support Vector Machine and Random Forest [5]. After using Random Forest
           Classifier, fifteen features were found to be more influential in deciding the retention of
           employees where overtime and monthly income were more prominent. One drawback seen
           in this was that employee number was also figuring in these fifteen features, whereas
           employee number was a variable which could be avoided from the list. Again the researcher
           had used Extreme Gradient Boosting for finding the most influential factors of attrition.
           Here monthly income was more important. As in random forest, employee number was also
           appearing in the list which should not have been considered at all. Various algorithms like
           SVM, Extreme Gradient Boosting, Logistic regression, Random Forest and Ensemble
           average were used. Among these algorithms Extreme Gradient boosting was found to be
           the best performer. The performance was measured using Area Under Curve (0.845960).
              Dilip Singh Sisodia, SomduttaVishwakarma, AbinashPujahari (2017), in their study,
           “Evaluation of Machine Learning Models for Employee Churn Prediction”, statistically
           analyzed data to find out the factors that affected attrition of employees. The researcher was
           also using machine learning algorithms to build models to predict attrition. The data set in
           Kaggle having 10 attributes and 15000 records was used for the study. An analysis of the
           data revealed that promotion, workload and the time spent with company were major factors
           affecting attrition.
              The study had used KNN, Lagrarian Support Vector Machine, Decision Tree, Naïve
           Bayes, and Random Forest for building model. The models were compared for accuracy,
           precision, recall, F-measure, false positive rate, specificity and false negative rate.
             In terms of accuracy Random forest performs well followed by Decision tree, K Nearest
           Neighbour, Naïve Bayes and at last LSVM
             In terms of precision the order of performance is Random forest, Decision tree, K Nearest
           Neighbour, Lagrarian Support Vector Machine and then Naïve Bayes
             In terms of recall the order of performance is Decision tree, Random forest, K Nearest
           Neighbour, Naïve Bayes and at last Lagrarian Support Vector Machine
             In terms of F-measure the order of performance is Random forest, Decision tree, K
           Nearest Neighbour, Lagrarian Support Vector Machine and then Naïve Bayes
              Heng Zhang, Lexi Xu, Xinzhou Cheng, Kun Chao, Xueqing Zhao in their article
           “Analysis and prediction of employee turnover characteristics based on Machine
           Learning”, tried to find out the factors leading to attrition of employees. An attempt was
           made to employ machine learning algorithm to predict attrition. In this article also the IBM
           data set prepared by data scientists was used. The correlations between data items were
           checked, and it was found that there was high correlation between department and work
           role. There was no significant contribution done by attributes “Standard Hours” and “Over
           18”, “age”, “employee number” and “relationship satisfaction”. Scaling was used to
           minimize the difference between the attribute values. Logistic regression was used for
           predicting the attrition, and 87.2% accuracy was achieved. It was also observed by the
           author that frequent business travel also contributed to attrition. In the same way, employees
           with technical degree had a high probability of attrition. The gender of employees was also
           an important factor in attrition. Model fusion employing Regressor function used in Python
           resulted in an accuracy of 89.32%, which is an improvement on earlier cases.

              In the article, “Early Prediction of Employee Attrition using Data Mining Techniques”,
           the authors used various classification techniques for predicting the attrition of employees.
           The dataset used was extracted from the Kaggle website. Attributes about the employee
           including name and other particulars like details of project, department, promotion and the
           like were stored in the database, where a majority of the fields were numeric. Only name,
           salary and department were stored as categorical variables.

Volume 10 Issue 3 - 2020                                1433                                            www.joics.org
Journal of Information and Computational Science                                                   ISSN: 1548-7741

              For reducing the number of attributes, the author had used brute force approach, one hot
           encoding and feature selection. In brute force approach, department can be divided as
           technical and non-technical and assigned values 0 and 1.
              In one hot encoding, salary which was stored as a categorical variable comprising of
           values low, medium and high, and were replaced by three attributes salary-low, salary-
           medium and salary-high, which could be declared as a numeric field with values 0 and 1.
           Department was represented as ten separate features as Department IT, Department R and
           D and the like, which were numeric variables with values 0 and 1.
              Recursive Feature Elimination with cross validation was used to reduce features.
           Redundant features were found out and eliminated in this method. Only minimum number
           of features having considerable impact on the output was considered.
             Fourth approach was used to explore the causes of attrition of employees, who had more
           experience. Only the data of employees whose “time_spent_in_company” was above 4 and
           “number_of_projects” more than 5 and “last_evaluation” higher than 0.74 were considered.
             The classification techniques SVM, Random Forest, Logistic Regression, Decision Tree
           and AdaBoost were compared with attributes selected using Brute Force, One hot encoding,
           Feature selection and details of employee who were experienced.
             Classification of employees using the above-mentioned algorithms on features
           engineered using Brute Force method showed that Random Forest gave the highest
           accuracy of 0.9863%.
             Classification of employees using the algorithm mentioned on features
           engineered using one hot encoding approach showed that decision tree had the
           highest accuracy of 0.9817%.
              Classification techniques Random Forest and AdaBoost were carried out using
           the set of features which were reduced using Recursive Feature Elimination with
           cross validation. Random Forest gave an accuracy of 0.9863% and Adaboost
           0.9583%.
             When the algorithms were applied on data of employees who had experience,
           decision tree showed an accuracy of 0.9927%.

           3. An analysis of the reviews
              Various studies conducted on the process of employing classification techniques to
           forecast the attrition of human resources have been reviewed. The accuracy of predictions
           of employee attrition is measured in each case for all tested algorithms. It is observed that
           there is a change in the accuracy attained by algorithms on different data sets. The methods
           adopted for preprocessing the data also result in change of accuracy level. The major
           constraint of the research conducted in application of data analytics is regarding the
           availability of data. Real time data from organizations are not available, as they are highly
           confidential. So, the researchers are constrained to make use of readymade datasets
           available in Kaggle.
             The best performing classification method is identified and shown in the Table.

Volume 10 Issue 3 - 2020                               1434                                            www.joics.org
Journal of Information and Computational Science                                          ISSN: 1548-7741

             Table V: A comparison of classification algorithms
            Name of the article Data set used        Algorithms tested   Measurin   Result
            and author                                                   g tool

            RohitPunnoose,      A      data    set   Extreme     Gradient AUC-      Best
                                collected from       Boosting,    Logistic ROC      perfor
            PankajAjit,
                                the        Human     Regression,    Naïve curve     mance
            “Prediction of
                                Resource             Bayesian, Random               is put
            Employee
                                Information          Forest,       Linear           up by
            Turnover in
                                system of a          Discriminant                   XGBo
            Organizations
                                retailer and also    Analysis,    Support           ost
            using Machine
                                data         from    Vector Machine and             (AUC
            Learning
                                Bureau of Labor      K-Nearest Neighbor             0.88)
            Algorithms”
                                Statistics                                          and
                                                                                    maxim
                                                                                    umme
                                                                                    mory
                                                                                    utilizat
                                                                                    ion is
                                                                                    12%.
            Ibrahim             A        dataset Decision tree, Naïve Confusion Best
            OnuralpYi˘git       prepared     by Bayes,        Logistic Matrix   perfor
            ,HamedShourabiza IBM                 regression,    SVM,            mance
            deh ,“An Approach                    KNN, Random forest             is put
            for      Predicting                                                 up by
            Employee Churn by                                                   SVM
            Using         Data                                                  (0.897
            Mining”                                                             ) and
                                                                                precisi
                                                                                on
                                                                                (0.98)
            Rachna Jai1 and A           dataset XGBoost                  Confusion Accur
            AnandNayyar,”      prepared     by                           Matrix    acy -
            Predicting         IBM                                                 89.1 %
            Employee Attrition
            using      XGBoost
            Machine Learning
            Approach”
            JeelSukhadiya,       A    dataset Logistic regression, AUC-             Best
            Harshal Kapadia, prepared     by Support         Vector ROC             perfor
            Prof.       Mitchell IBM          Machine, Gradient curve               mance
            D’silva ,“Employee                boosted classifier and                is put
            Attrition Prediction              Random Forest                         up by
            using Data Mining                                                       Extrem
            Techniques”,                                                            e
                                                                                    Gradie
                                                                                    nt
                                                                                    Boosti
                                                                                    ng(AU
                                                                                    C      -
                                                                                    0.8459
                                                                                    60)

Volume 10 Issue 3 - 2020                             1435                                      www.joics.org
Journal of Information and Computational Science                                       ISSN: 1548-7741

            Dilip Singh Sisodia, Data set from KNN, LSVM, Naïve Confusion Best
            SomduttaVishwaka Kaggle            Bayes, Decision Tree matrix perfor
            rma,                               and Random Forest           mance
            AbinashPujahari                                                is put
            ,”Evaluation      of                                           up by
            Machine Learning                                               Rando
            Models           for                                           m
            Employee Churn                                                 Forest
            Prediction”                                                    in
                                                                           terms
                                                                           of
                                                                           accura
                                                                           cy(0.9
                                                                           897),
                                                                           precisi
                                                                           on(0.9
                                                                           981)
                                                                           and F-
                                                                           measu
                                                                           re(0.99
                                                                           33) .
            Heng Zhang1, Lexi a          dataset Logistic Regression   Confusion 87.2%
            Xu,         Xinzhou prepared     by                        matrix    accura
            Cheng, Kun Chao, IBM                                                 cy
            Xueqing       Zhao,”
            Analysis        and                  After applying model
            prediction        of                 fusion          using
            employee turnover                    regressor function in
            characteristics                      python
            based on Machine                                                     89.32
            Learning”                                                            %
                                                                                 accura
                                                                                 cy
            Sandeep              Data set from LogisticRegression,S Confusion Rando
            Yadav,Aman Jain, Kaggle            VM,Random Forest, matrix       m
            Deepti       Singh,                Decision Tree and              Forest
            “Early Prediction of                                              (0.986
            Employee Attrition                   AdaBoost                     3)
            using Data                                                        when
                                                                              applie
               Mining
                                                                              d on
            Techniques”
                                                                              data
                                                                              prepro
                                                                              cessed
                                                                              by
                                                                              Brute
                                                                              Force
                                                                              Metho
                                                                              d
                                                                                    deci
                                                                                 sion

Volume 10 Issue 3 - 2020                         1436                                      www.joics.org
Journal of Information and Computational Science                                                     ISSN: 1548-7741

                                                                                               tree
                                                                                               (0.981
                                                                                               7)
                                                                                               applie
                                                                                               d on
                                                                                               data
                                                                                               set
                                                                                               prepro
                                                                                               cessed
                                                                                               using
                                                                                               one
                                                                                               hot
                                                                                               encodi
                                                                                               ng
                                                                                               approa
                                                                                               ch
              It is observed in the above studies that Xtreme Gradient boosting, SVM and Random
           Forest exhibit a very high-performance level. The selected literature also shows that the
           algorithms which have used the dataset prepared by IBM Watson having 1470 records and
           35 attributes, out of which seven are categorical, show accuracy below 90%.

           4. Conclusion
              HR Analytics using predictive techniques is an area which is not optimally utilized by
           organizations. The reviews done in the article are exhibiting the use of various classification
           techniques for predicting attrition. Similar techniques can be experimented with to analyze
           and predict the performance of human resources in an organization. More research is to be
           done in the area of HR analytics to explore all untapped areas. This can help organizations
           to do a better selection of human resources to increase their productivity for the benefit of
           the organization.

Volume 10 Issue 3 - 2020                                1437                                             www.joics.org
Journal of Information and Computational Science                                            ISSN: 1548-7741

           References

              1. https://expertsystem.com/machine-learning-definition/
              2. Tom Mitchell, 1997, Machine Learning, McGraw Hill.
              3. https://gordontredgold.com/take-care-of-your-staff-and-they-will-take-care-of-
                  business/
              4. Angelo S. DeNisi; Ricky Griffin, 2005,Human Resource Management
              5. https://competency.aicpa.org/media_resources/212508-using-predictive-analytics-
                  in-employee-retention/detail
              6. Rohit Punnoose, Pankaj Ajit, “Prediction of Employee Turnover in Organizations
                  using Machine Learning Algorithms”,(IJARAI) International Journal of Advanced
                  Research in Artificial Intelligence, Vol. 5, No. 9, 2016
              7. Ibrahim Onuralp Yi˘git , Hamed Shourabizadeh ,“An Approach for Predicting
                  Employee Churn by Using Data Mining”,IEEE, 2017 International Artificial
                  Intelligence and Data Processing Symposium (IDAP)
              8. Rachna Jai1 and Anand Nayyar ,”Predicting Employee Attrition using XGBoost
                  Machine Learning Approach”,IEEE, Proceedings of the SMART–2018, IEEE
                  Conference ID: 44078,2018 International Conference on System Modeling &
                  Advancement in Research Trends, 23rd–24th November, 2018
              9. Jeel Sukhadiya, Harshal Kapadia, Prof. Mitchell D’silva ,“Employee Attrition
                  Prediction using Data Mining Techniques”,International Journal of Management,
                  Technology And Engineering, ISSN Online Number: 2249-7455
              10. Dilip Singh Sisodia, Somdutta Vishwakarma, Abinash Pujahari ,”Evaluation of
                  Machine Learning Models for Employee Churn Prediction”, IEEE, Proceedings of
                  the International Conference on Inventive Computing and Informatics (ICICI
                  2017), IEEE Xplore Compliant - Part Number: CFP17L34-ART, ISBN: 978-1-
                  5386-4031-9
              11. Heng Zhang1, Lexi Xu, Xinzhou Cheng, Kun Chao, Xueqing Zhao ,”Analysis and
                  prediction of employee turnover characteristics based on Machine Learning”,
                  IEEE, Proceedings of The 18th International Symposium on Communications and
                  Information Technologies (ISCIT 2018)
              12. Sandeep Yadav, Aman Jain, Deepti Singh, “Early Prediction of Employee Attrition
                  using Data Mining Techniques, IEEE, Proceedings of 2018 IEEE 8th
                  International Advance Computing Conference (IACC)

Volume 10 Issue 3 - 2020                           1438                                         www.joics.org
You can also read