MODELING AND DATA ANALYSIS IN THE CREDIT CARD INDUSTRY: BANKRUPTCY, FRAUD, AND COLLECTIONS

Page created by Alberto Campos
 
CONTINUE READING
2002 IEEE Systems and Information Design Symposium•University of Virginia

           MODELING AND DATA ANALYSIS IN THE CREDIT CARD INDUSTRY:
                    BANKRUPTCY, FRAUD, AND COLLECTIONS

           Student team: Christopher Allred, Kathryn Hite, Stephen Fonzone, Jennifer Greenspan, Josh Larew

                                          Faculty Advisor: William Scherer
                                  Department of Systems and Information Engineering

                                         Graduate Advisor: Thomas Pomroy
                                  Department of Systems and Information Engineering

                                             Client Advisor: Douglas Fuller
                                                  Providian Financial
                                                   San Francisco, Ca
                                             Douglas_Fuller@Providian.com

KEYWORDS: CART, Clustering, Distressed debt,                    CLASIFYING FRAUDULENT TRANSACTIONS
Fraudster, Identity theft, Regression, Probabilistic
modeling                                                        Background

ABSTRACT                                                             Providian suffers significant losses every year from
                                                                fraudulent transactions on their credit cards. There are
     In order to effectively produce quality decisions in       three main types of fraud that cause the most significant
the modern credit card industry, knowledge must be              losses, adding up to millions of dollars each year. The
gained through effective data analysis and modeling.            three types the accompanying analysis focused upon
Through the use of dynamic data-driven decision                 were lost/stolen, forged response, and non-receipt.
making tools and procedures, information can be                 Lost/stolen fraud occurs when a customer losses their
gathered to successfully evaluate all aspects of credit         card or the card is stolen while the customer has the
card operations. Specifically, areas of bankruptcy,             card. Forged response is when a fraudster fills out an
fraud, and collections were focused upon to show the            application pretending to be someone else with a better
salutary benefits implementation of such practices              credit history. This is done typically after the fraudster
could provide. Methodologies ranging from Markov                steals personal information on someone, called Identity
chains, to clustering, to rule-based decision theory were       Theft. Non-receipt fraud occurs when the card is first
combined with tools such as CART, S+, Excel, and                sent to the good customer once the application is
Access to yield such insights.                                  approved. The card is typically stolen in the mail, and
                                                                the good customer never receives their card. The
INTRODUCTION                                                    following figure depicts the lifetime of a credit card and
                                                                pinpoints where each instance of fraud occurs. Since
      San Francisco based Providian Financial prides            forged response involves identity theft, the figure also
itself on the effective use of data driven decision-            shows when identity theft can take place.
making throughout its business practices. In particular,
their lending strategies tend to encompass the
underserved market share of high-risk creditors. As
with any risk-oriented venture, Providian’s business
stratagem requires the utmost degree of information
quality and quantity. This necessitates the execution of
methodologies and tools discussed in the following
sections.

                                                                                                                       53
Modeling and Data Analysis in the credit card industry:

                                                                 The main achievement of this portion was the
                                                           transformation of raw data into useful information, with the
                                                           first step in this process being used to gain an understanding
                                                           of the data set as a whole. General descriptive statistics are
                                                           important because they provide the basic framework from
                                                           which all other conclusions are derived. Moreover, such
                                                           information acted as a metric to judge whether future
                                                           conclusions make sense and fit with the general data or
                                                           whether those conclusions should be reevaluated for errors.
                                                           Descriptive statistics also checked whether smaller sub-
                                                           samples of the data set are representative of the data as a
                                                           whole.

                                                                The first stage of the modeling process involved
                                                           analyzing fraud and non-fraud transactions based on all
                                                           available transaction data. Of the eleven variables
Figure 1: Fraudulent Transaction Depiction: This           analyzed, the following six were found to be
graphic shows a few common ways of perpetrating            significant: hours since transaction, number of declines,
credit card fraud.                                         number of cash purchases, number of ATM purchases,
                                                           merchant code, and transaction amount. A variable was
     Though fraud causes significant loss, there are       considered significant whenever the percent difference
proportionally few cases of fraud each month compared      between the average for the fraudulent population and
to the total number of accounts. In the sample of data     the non-fraudulent one exceeded 30%. In the case of
given from Providian, only 0.34% of the data set was       merchant code, significance was determined whenever
fraudulent accounts. The following table lists the         the rate of fraudulent transactions for a merchant
numbers of accounts and the percentage for the data set.   significantly exceeded the overall average rate of fraud.
                                                           These characteristics form the foundation of the model,
                                                           due to their potential for flagging transactions as
     General Breakdown of Accounts                         fraudulent.

                       Number of       Percent of               The second phase of modeling developed a risk
     Type of Account
                         Accts         Accounts            scorecard. This model used the significant characteristics
                                                           established in phase one to generate a score for every
      Non-fraudulent                                       transaction, based on that transaction’s data. This score was
           (N)          305,688         99.66%             then used to asses the likelihood of that transaction being
      Fraudulent (L)     1,045           0.34%             fraudulent. The scorecard was implemented using Visual
                                                           Basic scripts in Microsoft Access. Accuracy and
Figure 2: Fraudulent Accounts: This chart shows the        performance were then analyzed using Microsoft Excel.
numeric and percentage values associated with
fraudulent and non-fraudulent accounts in the data set.         Each transaction was evaluated individually for each
                                                           characteristic value. Points were awarded if a characteristic
Rule-Based Modeling Decisions                              differed from the non-fraud average by greater than 10% of
                                                           the non-fraudulent standard deviation. However, points
      There were two main phases to approaching the        were only awarded for a deviation that was in the direction
problem of modeling the fraud risk of individual credit    indicating fraud, as those accounts statistically safer than the
transactions: (1) Gathering data for a series of           average should not be punished. Through iteration, a 10%
transaction characteristics and comparing the fraud and    deviation was statistically significant in maximizing the
non-fraud account averages (2) Incorporating the           classification accuracy of the risk scorecard. In the case of
characteristics which differed significantly into a risk   merchant code, a point was simply awarded whenever the
scorecard capable of predicting the likelihood of fraud    transaction occurred at a high-risk merchant code. Since
in a given transaction.                                    there are six characteristics, any transaction could have a
                                                           scorecard value from 0 to 6, depending on how many
                                                           triggers that account satisfied. For example, an account

54
2002 IEEE Systems and Information Design Symposium•University of Virginia

with very little time since the last transaction, making a $1    how likely the account is fraudulent. For example, a
purchase at a high risk merchant, but who had not recently       large cluster of non-fraudulent accounts is accounts that
made a cash purchase, ATM withdrawal, or been declined,          make a few low charges on their accounts and make a
would have a score of 3.                                         payment in the first month.

     The following figure highlights the performance of the            Five clusters of non-fraudulent accounts were
scorecard by breaking down the percent of fraudulent             identified with a significant degree of accuracy, 99.97%
transactions which fell into each score category.                or better, and the five clusters contained 28% of the
                                                                 total number of accounts. The result led to an
         Score         % Fraud Transactions                      important reduction of the suspected list of accounts by
             0                  15.38%                           over a quarter. Providian can not only significantly save
                                                                 through lowered operation costs but also focus
             1                  30.47%
                                                                 detection efforts on the remaining accounts which have
             2                  47.25%                           a higher probability of being fraudulent.
             3                  54.05%
             4                  71.90%                                 Though the clustering technique could not clearly
             5                  71.36%                           establish which accounts are fraudulent, it did quickly
             6                  76.92%                           split the accounts into suspicious and unsuspicious
                                                                 groups, allowing Providian to better concentrate
Figure 3: Score Fraud Frequencies: This table shows the          resources, time, and money.
percent of transactions that are predicted to be fraudulent
at each risk scorecard value.                                    COLLECTIONS

     If all transactions with a risk score of 3 or greater are   Background
predicted to be fraudulent, the accuracy in predicting
fraudulent transactions is 60.1%. If only those transactions           Providian’s subsidiary, First Select Corporation
with scores 4 or greater are labeled fraudulent, the accuracy    (FSC), is the largest credit card debt collector in the
level increases to 71.8%. The tradeoff faced is that the         United States, purchasing billions of dollars worth of
higher the score cutoff used, the better the accuracy for that   defaulted credit card debt each year for approximately
account segment, yet a smaller number of fraudulent              six cents on the dollar. Accounts are collected through
accounts are actually captured. The accuracy level of 71.8%      calls, letters, and in some cases, legal action.
misclassifies less non-fraudulent accounts as fraudulent, but    Throughout the payment process, FSC continually
also misclassifies more fraudulent accounts as non-              needs to make a decision about what to do with an
fraudulent. Providian would rather contact an account            account: continue to attempt collections or sell the
erroneously to check up upon suspicious purchases than let       account. This makes knowing whether an account will
fraudulent transactions slip through. Since this second, false   continue to pay of the utmost importance.
negative, error is the more serious for Providian, the 60.1%
measure was used to take advantage of the lower false            Value Analysis for Distressed Credit Card Debt
negative rate. Therefore, any transaction with a scorecard
value of 3 or greater is considered to be fraudulent, and this         Isolating key account attributes proved the most
method identifies fraudulent transactions at 60% accuracy.       effective way to value Providian’s distressed credit card
                                                                 debt portfolio. Initially, potential variables were
Clustering                                                       examined relative to desired metrics, to visually see
                                                                 relationships between predictor and target variables.
      A clustering technique to detect the fraudulent            Using this means of analysis, many account attributes
accounts was also applied to the database of credit card         had either positive or negative correlations to the
accounts. The clustering procedure groups accounts               account’s cash flow. The most important predictor
according to similar characteristics using rules. These          variables identified were recency and frequency of past
rules use the values of account characteristics to               payments. If an account made a payment in any
determine to which cluster an account belongs. This              particular month, it was determined that the account
system was applied to the database of fraudulent                 had a 90% chance that it would make a payment in the
accounts in an effort to classify accounts according to          next two months. Correspondingly, a positive
                                                                 relationship between the number of past payments and

                                                                                                                       55
Modeling and Data Analysis in the credit card industry:

probability of future payments was also discovered.
The larger number of past payments increased the                    CART identified rules that would separate the data
probability of future payments.                               depending on different attributes. For example, in a
                                                              model attempting to predict if an account would pay
     After this step was completed, a regression model        again, CART determined that most accounts that have
identified important characteristics that have a              not paid more than 1 payment in the last 5 months
predictive nature. Using a software regression                would not pay again. This rule is an all-encompassing
program, S+, the p-values of many predictor variables         rule; however, at every month that Providian owned the
were generated. In looking at whether an account will         accounts the rules changed, depending on their
pay again, there existed a high significance between the      ownership of the accounts. In doing this, monthly rules
p-values for the predictor variables, recency of past         classifying the accounts were established. This
payments, frequency of payments over the last four            effectively formulated a methodology that could be
months, and percentage of initial balance paid and the        performed on a monthly basis to separate non-paying
target variable. Other variables showed significance at       accounts from accounts that continued to pay.
the 0.05 level: initial balance, balance remaining,
frequency of calls made, frequency of right party                  The final methodology incorporated the rules
contacts, status, and rollout.                                given by CART for months 1-15. These rules, if
                                                              applied every month, increase Providian’s ability to
     Once characteristics of accounts with predictive         identify accounts that will pay again (have worth) from
nature were identified, both target and predictor             accounts that have stopped paying (no worth).
variables were entered into CART. CART is “the most
advanced decision-tree technology for data analysis,          ANALYZING BANKRUPT ACCOUNTS
preprocessing and predictive modeling. CART is a
robust data-analysis tool that automatically searches for          The Providian bankruptcy data was grouped into
important patterns and relationships and quickly              20 discrete states, allowing for a different form of
uncovers hidden structure even in highly complex data”        analysis. In analyzing the bankruptcy data, the flow of
[Steinberg].                                                  an account from state to state facilitated a glimpse at the
                                                              actual state transition process account holders went
                                                              through. By tracing these paths, along with the
                                                              expected income at each state, we are able to accurately
                                                              generate an estimate of the future value of each
                                                              account.

                                                                          p
                                                                 1                  2

                                                              Figure 5: One-Step Transition: This diagram depicts
                                                              the probability, p, for going from state 1 to state 2, or
                                                              rather, given that the model was in state 1 in the first
                                                              time period, p is the probability that the model is in
                                                              state 2 in the next time period.

                                                                   The first step to tracing these paths is to create a
Figure 4: Classification CART Tree: This figure depicts the   matrix of one-step transition probabilities. Following
CART tree used to develop the classification rules for the    these states over the lifetime of Providian’s bankruptcy
model. Each splitting node shows the criteria for that        process allows us to determine some underlying
splitter and the percentage of paying and non-paying          characteristics of their customers.
accounts that made it to that path. Each terminal node
shows the number of accounts of each type that were                The transition matrix for the bankruptcy model
classified in that node and the percentage of paying and      showed high recurrence probabilities: the tendency of
non-paying accounts that make up that node.                   an account to stay in the same state after a transition
                                                              period. This is expected due to the slow nature of many

56
2002 IEEE Systems and Information Design Symposium•University of Virginia

of the bankruptcy stages. Figure 2 shows the                     modeling and data analysis methods, much knowledge
probability of staying in each of the 20 states, as well as      was gained about the various aspects of Providian’s
the corresponding expected stay in each state. This can          credit card operations. The insight gained on basic
be calculated by using the formula:                              account operations is appreciable, because having
                                                                 accurate information influences everything from policy
     Σ n pn-1 (1-p) = 1/(1-p) + p                                implementation to the bottom-line. From bankruptcy,
                                                                 to fraud, to collections, our analysis proved highly
                                                                 beneficial to Providian.
                 Recurrence Probabilities
                                                                 REFERENCES
     States      Transition Prob    Length of Stay
          1          40.1%                 2.0                   Brieman, Freidman, Olshen, Stone, Classification and
          2          65.3%                 3.5                      Regression Trees, St. Louis: Wadsworth, 1984.
          3          78.0%                 5.3
          4          64.3%                 3.4                   Dwyer, Robert. “Customer Lifetime Valuation to
          5          40.0%                 2.0                     Support Marketing Decision Making.” Journal of
                                                                   Direct Marketing. Volume 11, Number 4 (1997): 6-
          6          71.5%                 4.2                     13.
          7          72.2%                 4.3
          8          10.7%                 1.2                   Lucas, Peter. “Why Recoveries are on the Rise; Scoring
          9          75.8%                 4.9                      Models and Databases are Helping Collectors Boost
         10           6.1%                 1.1                      Recovery Rates.” Collections & Recovery. Vol 13,
                                                                    No 7. October 2000. 14 October 2001.
         11          79.4%                 5.6
                                                                    http://web.lexis-nexis.com/universe.
         12          81.9%                 6.3
         13          68.6%                 3.8                   Steinberg, Dan and Phillip Colla. CART--Classification
         14           0.0%                 1.0                       and Regression Trees. San Diego, CA: Salford
         15          74.2%                 4.6                       Systems, 1998.
         16          93.3%                15.9
         17          56.7%                 2.8                   BIOGRAPHIES
         18          89.6%                10.5
         19          51.9%                 2.6                   Josh Larew is a fourth year Systems Engineer from
         20          80.0%                 5.8                   Morgantown, West Virginia. When Josh is not
                                                                 cranking out SQL queries in Access, he can be found at
Figure 6: Transition Matrix Statistics: This chart               the Birdwood Golf Course scrambling to make par.
quantifies the recurrence probabilities associated with          Next year Josh will either be working on a submarine
the one step probability matrix, including the estimated         (no joke) or be unemployed and waiting to go to law
length of stay in each state.                                    school.

      Ultimately, this analysis shows us the important           Stephen Fonzone is a fourth year Systems Engineer
characteristics of the bankruptcy lifecycle. As one can          from Allentown, Pennsylvania. When not using RTPs
see, the average consumer that enters state 16                   and PTPs to predict customer lifetime value, Steve can
(Bankruptcy) stays for 16 months, while others such as           be found singing Springsteen and playing Super Tecmo
10 and 1 do not have strong recurrent properties.                Bowl (although not necessarily at the same time). Next
                                                                 year Steve will live in a van down by the river.
CONCLUSION
                                                                 Kathryn Hite is a fourth year Systems Engineer from
     Providian is constantly modifying and updating its          Huston, Texas. When not clustering transactions to
data-driven decision network to formulate strategies             catch fraudsters, Kathryn can be found extolling the
which best capitalize on the opportunities of this               virtues of her native state of Texas. Next year she will
dynamic market. By effectively using various                     follow Josh wherever he may go.

                                                                                                                        57
Modeling and Data Analysis in the credit card industry:

Jennifer Greenspan is a fourth year Systems Engineer
from Chicago, Illinois. She spends the majority of her
time establishing and analyzing fraud triggers but can
also be seen watching Office Space and running (but
she usually watches Office Space while sitting). The
only group member to actually get a real job prior to
graduation, Jen will be working in DC for Capital One.

Christopher Allred is a fourth year Systems Engineer
from Avon, Connecticut. He can usually be found
taking any kind of data and turning it into a Markov
Chain. He has also been known to drink a lot of cider
and to be surly about staying in Charlottesville for
another year, where he will be completing his masters
degree.

58
You can also read