Unsupervised Fraud Detection in Medicare Australia

Page created by Yvonne Solis
 
CONTINUE READING
Unsupervised Fraud Detection in Medicare Australia
Proceedings of the 9-th Australasian Data Mining Conference (AusDM'11), Ballarat, Australia

                Unsupervised Fraud Detection in Medicare Australia
MingJian Tang*, B. Sumudu.U. Mendis, D. Wayne Murray, Yingsong Hu and Alison Sutinen
                                                  Strategic Data Mining Section
                                                  Department of Human Services
                                             134 Reed Street, Tuggeranong 2900, ACT
                                        *
                                            ming.jian.tang@humanservices.gov.au

Abstract                                                              faces new challenges with respect to efficiently and
                                                                      accurately detecting non-compliant patients. Patients are
Fraud detection is a fundamental data mining task with a              considered as consumers in MA since they consume
wide range of practical applications. Finding rare and                certain medical resources.
evolving fraudulent claimant behaviour in millions of
electronic Medicare records poses unique challenges due                  As part of MA’s integrity program, the Prescription
to the unsupervised nature of the problem. In this paper,             Shopping Program aims at protecting the integrity of the
we investigate the problem of efficiently and effectively             PBS by identifying and reducing the number of patients
identifying potential non-compliant Medicare claimants in             obtaining medicine subsidised under the scheme in excess
Australia. We propose an unsupervised and data-driven                 of medical need (Medicare Australia, 2011). Automating
fraud detection system called UNISIM. It integrates                   the process of detecting possible prescription shoppers is
various techniques, such as feature selection, clustering,            very challenging in nature, due to:
pattern recognition and outlier detection. By utilising the                     Large amount of real-life medical data coupled
beneficial properties of these techniques, we are able to                       with complex and implicit correlations.
automate the detection process. Additionally, useful                            Noise is prevalent in real-life data hampering the
temporal patterns are extracted from the existing data for                      direct application of many state-of-art data
future analysis. Through extensive empirical studies,                           mining techniques (garbage in and garbage out).
UNISIM is shown to effectively identify suspects with
highly irregular patterns. Additionally, it is capable of                       Absence of holistic and standardised domain
detecting groups of outliers. .                                                 knowledge (from data miners’ point of view).
Keywords: unsupervised fraud detection; health data;                            Prescription behaviours are constantly evolving
Hidden Markov Models; temporal pattern recognition.                             (existing predictive models or past knowledge
                                                                                can become obsolete).
1     Introduction                                                              Minimising the number of false positive (i.e.
As a major service delivery program of the Department of                        identifying consumers as prescription shoppers
Human Services, Medicare Australia (MA) looks after the                         when they have legitimate medical reason for
health of Australians through efficient services and                            their PBS load)
payments, such as the Medicare Benefit Schedule (MBS),                   Two analytical systems have been developed in MA
the Pharmaceutical Benefits Scheme (PBS), the                         for facilitating efficient detection of prescription shoppers
Australian Childhood Immunisation Register and the                    by utilising the PBS data. The work in (Ng et al. 2010)
Australian Organ Donor Register. According to MA’s                    focuses on capturing the temporal (explicit) and spatial
annual report (Medicare Australia, 2011), the PBS                     (postcode-based) aspects of consumers’ prescription
subsidies the cost of listed prescription medicine and the            behaviours, whereas the paper (Mendis et al. 2011)
Repatriation PBS (RPBS) provides eligible veterans and                examines sequential prescription patterns from either a
war widows and widowers some additional medicines                     global or a localised view. Due to the complex nature of
and dressings at concession rates. In 2009-2010, MA                   human behaviours coupled with their implicit health
processed approximately 198 million services or $8.3                  conditions, there can be many different fraudulent cases
billion in benefits under the PBS and RPBS indicating an              with peculiar behavioural patterns. With some global
increase of 7.8% over the previous year. As part of the               knowledge (e.g. ranks about consumers prescription
Human Services portfolio, MA bears the responsibility                 history either cost-wise or quantity-wise), some cases can
for ensuring that public funds are used appropriately by              be easily detected yet most of them are disguised deeply
maintaining the integrity of the programs it administers.             amongst       genuine      consumers.     Considering     the
In 2009-2010 alone, MA recovered more than $10.2                      labour-intensive manual approach and the complex nature
million from compliance activities. With the                          of these cases, it is highly unlikely that all cases can be
unprecedented growth of services and payments, MA                     enumerated. Therefore, an automated and adaptive
                                                                      detection system is urged for complementing the existing
                                                                      systems.
Copyright © 2011, Commonwealth of Australia. This paper
appeared at the 9th Australasian Data Mining Conference               1.1     Contributions and Paper Organisation
(AusDM 2011), Ballarrat, Australia. Conferences in Research           In this paper, we investigate an important real-life fraud
and Practice in Information Technology (CRPIT), Vol. 121.             detection problem in the domain of health care data. Due
Peter Vamplew, Andrew Stranieri, Kok-Leong Ong, Peter                 to the complex nature of the underlying data, we divide
Christen and Paul Kennedy, Eds. Reproduction for academic,            the problem into a set of sub-problems and conquer them
not-for-profit purposes permitted provided this text is included.

                                                                                                                                   103
CRPIT Volume 121 - Data Mining and Analytics 2011

accordingly. We propose an unsupervised fraud detection        prescribing doctor or prescriber. The set of PBS items
system called UNISIM targeting in particular prescription      charged along with their respective costs to Medicare is
shoppers. UNISIM is comprised of a number of                   encapsulated in {(Item, Cost)}. Dos and Dop are two time
functional components namely feature extractor, cluster        stamps recording the date of supply and the date of
builder, model constructor and outlier detector. Each          prescribing. Additional consumer information is available
component is responsible for performing data mining            from the consumer directory including consumer ID,
tasks including feature selection, clustering, pattern         name, age, gender and address.
recognition and outlier detection, into an cohesive system.
Completing such an unsupervised system is not only             2.2   Problem Statement
technically more challenging but also more desirable in        In Ng et al. (2010), three classes of drugs were identified
the practical data mining applications. We conduct             as being susceptible to abuse by prescription shoppers
extensive experiments using real-life medical claim data.      namely: opioids, benzodiazepines and psychostimulants.
The system can efficiently extract the hidden consumer         A list of drug names and their respective classes are given
patterns with respect to their temporal prescription           in table 1.
behaviours. We also demonstrate its effectiveness on
identifying potential prescription shoppers.                                 Name                  Class
   The remainder of the paper proceeds as follows. In                     Alprazolam           Benzodiazepine
section 2, we describe the available data and formally                    Clonazepam           Benzodiazepine
define the problem. Section 3 presents the design of
                                                                           Diazepam            Benzodiazepine
UNISIM. Technical details of UNISIM are provided in
                                                                          Nitrazepam           Benzodiazepine
section 4. UNISIM is extensively evaluated by using
                                                                          Olanzapine           Benzodiazepine
real-life datasets in section 5, and section 6 concludes the
                                                                           Oxazepam            Benzodiazepine
paper.
                                                                           Quetiapine          Benzodiazepine
                                                                          Temazepam            Benzodiazepine
2     Preliminaries
                                                                         Buprenorphine             Opioids
2.1    Available Data                                                       Codeine                Opioids
                                                                            Fentanyl               Opioids
All Australian permanent residents and certain categories
                                                                         Hydromorphone             Opioids
of overseas visitors have access to the Medicare Program.
                                                                           Methadone               Opioids
MA pays benefits to any eligible person to cover a set
                                                                            Morphine               Opioids
proportion of their incurred medical expenses. A
                                                                           Oxycodone               Opioids
consumer needs to lodge his/her medical bills in terms of
                                                                            Tramadol               Opioids
claims through MA in order to get the relevant benefits.
MA stores these claims in transactional databases. Each                 Dexamphenidate         Psychostimulant
transaction record holds rich information including                     Methylphenidate        Psychostimulant
consumer details and medical provider details (e.g. name,                        Table 1: target drugs
the date and type of service provided). Additionally, there
are reference databases which contain information about           Besides the above highly specialised knowledge, some
various MA services. Since we are mainly interested in         fragmental and intuitive indicators about typical
identifying prescription shopping consumers (i.e.              prescription shopping can be:
fraudulently gaining access PBS medications in excess of                Contradicting drug prescription (e.g. sleeping
legitimate medical need), we focus ourselves on data                    tablets versus stimulative tablets).
pertaining to consumers and their respective PBS drug                   Visiting a diversity of doctors for similar types
prescription details.
                                                                        of drugs.
   In general, MA stores consumer data in three different
databases namely consumer directory for general                          Excessive drug quantities over a set period.
information, MBS claims for medical services and                         Sudden changes in prescription behaviours.
consultations, and PBS claims for drug prescriptions.                    Recurrent large temporal gaps after getting lots
Linking both MBS and PBS databases would be                              of drugs.
beneficial for substantiating the genuine drug needs of a         The main objective of this paper is to propose a
consumer. Unfortunately, we are only allowed to derive         workable fraud detection system. It is required to identify
data from one of them due to the existing privacy              consumers with irregular prescription behaviours over a
legislation. Therefore, the PBS data is chosen for the         certain period of time (e.g. 1 to 4 years). Defining
purpose of countering prescription shoppers.                   accurate notion of irregularity, to some extents, requires
   Each consumer, who obtains subsidised PBS drugs, is         inputs from domain experts, which adds extra overheads.
represented by at least one transaction in the PBS             Instead, the system needs to autonomously derive and
database. For an example, each transaction may take the        identify such patterns. Since the eventual users of the
following form:                                                system are mainly non-technical and business-oriented,
                                                               rendering interpretable results also plays a vital part.
           (PhID, PrID, {(Item, Cost)}, Dos, Dop)              Overall, the resultant system needs be unsupervised and
                                                               flexible due to the absence of labelled data and
where PhID is the identifier of the pharmacy at which the      practicality issues.
drugs are supplied and PrID uniquely identifies the

104
Proceedings of the 9-th Australasian Data Mining Conference (AusDM'11), Ballarat, Australia

                                         Figure 1: A high-level view of UNISIM
                                                               patterns), each record is rich in features in terms of
3     UNISIM - A Holistic View                                 various types of attributes. Feature rich datasets have
The proposed system consists of several components as          benefits and disadvantages. On one hand, they are
follows:                                                       beneficial for representing the underlying data with
                                                               various characteristics and granularities. On the other
        Feature extractor. It harvests the PBS database
                                                               hand, the curse of dimensionality can hamper the
        and the general consumer directory for
                                                               performance of existing data mining techniques (e.g.
        constructing and preparing featured consumer
                                                               clustering and outlier detection). It is mainly because the
        prescription data.
                                                               projected data points become sparser with the increase of
        Cluster builder. It examines the constructed           feature dimensionalities (database attributes). The earlier
        consumer data and labels them on certain criteria      work (Ng et al. 2010 and Mendis et al. 2011) was
        (e.g. frequency of drug prescriptions or               conducted mainly on grouped consumer prescription data
        similarity of temporal prescription sequences).        based on postcodes. This grouping, to a limited extent,
        Model constructor. It learns hidden temporal           captures the spatial correlation of consumers’ behaviours
        patterns from the identified clusters of               with respect to their drug prescriptions. The benefit can
        consumers and builds respective Hidden Markov          drown in the pool of noisy data introduced by people with
        Models (HMMs) (Rabiner 1989) for capturing             different demographics.
        consumers’ implicit prescription patterns                  The overall aim of the feature extractor is to
        according to their cohorts.                            judiciously select a subset of feature attributes for
                                                               representing the consumer prescription activities in a
          Outlier detector. It generates an n-dimensional
          score vector for each consumer then compares         more compact way. These more compressed features can
          each consumer against his/her reachable peers to     then facilitate more efficient data mining practices and
          derive a final outlier-ness score.                   lead to better results. Intuitively, consumers of similar
   Figure 1 depicts how these components logically fit         ages may suffer similar types of illness resulting in
into a common framework. Potentially we can utilise            demands for similar drugs. Likewise, the clinical
different methods or techniques for each component.            functions of certain drugs may be specific to a particular
Such a semi-open and modularised design maximises the          gender. Therefore, both age and gender can serve as good
flexibility of the system and changes to each component        discriminative features.
can easily be localised.                                           The temporal nature of prescription data carries
                                                               essential information. As mentioned earlier, the
4     Technical Design                                         transactional PBS data is in the form of (PhID, PrID,
                                                               (Item, Cost), Dos, Dop). Collectively, each consumer can
In this section, the intriguing details about each             have a sequence of drugs prescribed over a certain period
component are covered and discussed.                           of time (e.g. quarterly or yearly) by simply appending
                                                               them together chronologically. Though such a flat
4.1    Feature Extractor
                                                               structure alleviates some workload for the later
Feature extraction is a process attempting to filter out       processing steps, it can consequently cause the loss of
components of a data record which are irrelevant to the        temporal information. Therefore, we decide to
task at hand. Albeit the stored PBS data can be                concatenate each consumer’s temporal drug prescriptions
problematic (e.g. noise in terms of highly irregular

                                                                                                                            105
CRPIT Volume 121 - Data Mining and Analytics 2011

into a multi-set. Formally, let I = {in | n > 0} be a set of    where d= max{d1, …, dk} is the largest distance amongst
all the available prescription drug items. A subset of these    Si ’s KNN and l = ||{Sj       S | dist(Sj, Sj) < d}|| is the
items can be organised into an itemset X = {jm | jm I, 0        number of reachable neighbours. The clustering
    m     n}. The itemset symbolises a transaction of           algorithm called Uniform kernel KNN clustering
prescribed drugs. A consumer can accumulate a sequence          consisting of three steps as follows (Kum et al. 2003):
of transactions over a specified period, which is denoted                Step 1. Initialise every sequence as a cluster.
as S =  0>. Eventually, a sequence is constructed
and attached to each consumer.                                           Step 2. Merge nearest neighbours based on the
    The feature extraction and dimensionality reduction is               density of sequences.
an important research field. Our approach relies on                      Step 3. Merge based on the density of clusters.
simple yet effective intuitive knowledge. There exist           The logical output is a set of cohorts so that consumers
various more sophisticated methods such as Principle            having similar prescription patterns over the specified
Component Analysis (Kirby and Sirovich 1990), Linear            period are organised into the same cluster.
Discriminant Analysis (Swets and Weng 1996) and
eigenvalues based analysis (Nguten and Gopalkrishnan            4.3    Model Constructor
2010). Each of them has benefits and disadvantages. We          The main idea behind the model constructor is to model
are planning to investigate their applicability in the          clustered prescription sequences by the stochastic process
future.                                                         of an HMM (Rabiner 1989). The HMM is a double
                                                                embedded stochastic process with a finite set of states
4.2    Cluster Builder                                          governed by a set of transition probabilities. It is widely
Clustering techniques can be used to efficiently explore        used in various applications including bioinformatics,
the data and reduce noise. Such an explorative approach         speech recognition, and genomics (Smyth 1994 and
can effectively group similar consumers based on their          Srivastava et al. 2008). In contrast to typical classification
sequential prescription activities. Mining on sequential        methods, the HMM requires no labelled data and is
prescription patterns is challenging, thus contributing to      relatively robust in the presence of noisy data.
the combinatorial nature of the problem. A density-based           A typical HMM has the following characteristics
clustering algorithm called ApproxMAP (Kum et al.               (Rabiner 1989):
2003) is adopted for accomplishing the task. It favours                  N is the number of states in the model. A set of
discovering approximate and long patterns over short and                 N states is denoted as H = {hj | i =1, 2, 3,… , N}.
trivial ones. The hierarchical edit-distance is utilised for             qt represents the state at time instant t.
calculating the logical distances between prescription
sequences of different consumers. An edit can be of type                 M represents the number of unique observation
insertion, deletion or replacement. The cost is defined as               symbols per state, which corresponds to the
the minimum editing operations required to change one                    physical output of the system being modelled.
sequence to the other. For example, changing (p, e, t) into              The set of symbols is denoted as V = {vk | k =1,
(p, e, t, e, r) incurs two-unit of cost (e.g. two insertion              2, 3, …, M}.
operations). The cost associated with a replacement is                   The state transition probability matrix A = [aij],
assumed to be less than or equal to the aggregated cost of               where aij = P(qt+1 = hj | qt = hi ), 1 i, j N and t
an insertion and a deletion. Formally, we denote IND() as                > 0. For all i and j, we have aij > 0 indicating
the cost for either an insertion or a deletion and REPL()                that any state can be reached by any other state
as a replacement cost. The eventual edit-distance D(S1,                  in a single step.
S2), between two sequences S1 =  0> and S2 = <                  The observation symbol probability matrix B =
Ym | m > 0>, can be computed by dynamic programming                      [bj(k)], where bj(k) = P(vk | hj), 1 j N, 1 k
using a set of recurrence relations (Kum et al. 2003). We                M and 1 k M bj(k) = 1, 1 j N.
can then derive a normalised distance dist(S1, S2) by
dividing D(S1, S2) by max(||S1||, ||S2||) (e.g. the length of            The initial state probability distribution   i   = P(q1
the longer sequence). The calculation of REPL() is based                 = hj),1 i N.
on Sørensen coefficient (Sørensen 1957) reflecting the                  A sequence of observations O = {ol | l = 1, 2, …,
normalised set difference.                                              R}, where each observation ol is one of the
    Given a database of sequences S, the density of each                symbols from V. R is the number of observations
sequence Si is calculated based on its k-nearest                        from sequence O.
neighbours (KNN) as follows:                                        A complete specification of an HMM model requires
                                           l                    two model parameters (N and M) and three probability
               density    (Si)                                  measures (A, B and ), which can be denoted as
                                    || S || d
                                                                  .

106
Proceedings of the 9-th Australasian Data Mining Conference (AusDM'11), Ballarat, Australia

                                                                  probability based on the given HMM. The number n is
   To build HMMs for capturing the common                         tuneable based on the results from the cluster builder.
prescription behaviours from consumer cohorts, we
incorporate the consumer-visiting-prescriber pattern into
various states denoted as H = {h1, …, hN}, N > 0. Each
sate indicates a consumer visit to a unique prescriber (e.g.
the first time visit), otherwise the visit is not considered
as unique and can be mapped accordingly based on the
PrID and the respective state. Therefore, a temporal
visiting pattern can be organised into a sequence of
various states, for instance, (h1, h1, h2, h3, h4, h3).
Considering the large number of registered prescribers,
we decided to focus on the intra consumer and prescriber
relations (visits) aspects of the problem. Each prescriber
visit can incur certain drug prescriptions, and we can
model them as physical outputs from the set V of all
observable outputs (e.g. all PBS drugs stored in the PBS
database). Modelling all PBS drugs can make the model
cumbersome and even infeasible due to two factors.
Firstly, it can increase the training time dramatically.
Secondly, derived probabilities associated with rare drugs
can be extremely small, thus hampering the performance                           Figure 2: An auxiliary HMM
of HMM. Therefore, a compressed list of drugs
(observations) is utilised, which covers all targeted drugs       4.4     Outlier Detector
in table 1. Additionally, we establish a generic drug             The model constructor provides a common ground for
observation for capturing all non-targeted ones. Based on         comparing prescription behaviors of different consumers.
these two sets, a sample HMM is shown in figure 2.                Medicare consumers, based on their temporal activities,
   The state h0 is a dummy state representing the start and       are projected into an n-dimensional hyper-plane (e.g. n
the end of a consumer’s temporal pattern. Likewise, the           consumer cohorts and their respective HMMs), and each
v0 is an artificial observation symbol for the sake of            dimension implies a featured pattern encoded as an
model integrity and consistency. The inference of HMMs            HMM. The spatial distance between any two consumers
also requires tunning parameters for three probability            then can be computed. Various distance metrics (Cha
distributions (e.g. A, B and ), which is essentially an           2007) are available for facilitating the task. Since each
optimisation problem. Given an observation sequence O             HMM score implicitly conveys the likelihood of a
= {o1, o2, …, ot}, the objective is to estimate = (        )      consumer being a member of the respective cohort, we
so that P( ) is maximised. We adopt the well-known                adopt the City Block distance (Cha 2007) to augment the
Baum-Welch algorithm (Rabiner 1989), which can be                 difference between two score vectors along all
described as follows:                                             dimensions. Based on these spatial distances, we can
         Input: O = {o1, o2, …, ot} and .                         adopt outlier detection techniques to automate the process
         Output:       (       )                                  of identifying potential fraudulent consumers. The LOCI
                                                                  (Local Correlation Integral) (Papadimitrious et al. 2003)
         Step 1: Let initial model be   0.                        is adopted as the underlying outlier detector, which
         Step 2: Compute new model           based on   0   and   produces outlier-ness scores rather than binary YES or
         observation sequence O.                                  NO answers. It is a density-based approach, which is
         Step 3: If log(P(     )) - log(P(O| 0)) <      go to     effective on discovering micro-clusters (e.g. groups of
         Step 5.                                                  outliers). Additionally, it uses statistical reasoning (such
                                                                  as standard deviation) to determine the outlier-ness.
           Step 4: Else set 0    and go to Step 2.                   We briefly describe some terms used in the LOCI and
           Step 5: Stop.                                          detailed algorithm description can be found in
where we consider a uniform distribution model for                (Papadimitrious et al. 2003).
initializing 0.                                                             r-neighbourhood of an object pi: a set of objects
    For each of the consumer cluster, a profile HMM is                      within r distance of pi.
inferred. These profiles then can be used to evaluate new
consumer prescription patterns, namely given a model 1                      n(pi    ): the number of objects in the
= (A1, B1 1) and a sequence of observations Onew, we                          –neighbourhood of pi.
can compute the probability (Pr) that the sequence is                       ñ(pi     ): the average number of objects over all
produced by the model. For each new consumer, we can                        objects p in the r-neighbourhood of pi.
examine his/her prescription patterns against each HMM,                     Multi-granularity deviation factor (MDEF) for pi
from which an n-dimensional scoring vector (Pr1, Pr2, …,                    at radius r:
Prn) can be derived. The forward and backward algorithm
(Rabiner 1989) is adopted to efficiently compute each                                                         n ( pi , r )
                                                                              MDEF      ( pi , r,   )    1
                                                                                                              ñ( p i , r, )

                                                                                                                               107
CRPIT Volume 121 - Data Mining and Analytics 2011

             Standard deviation of n(pi,                             ) over the   experiments. Instead, we briefly discuss how suitable
             r-neighbours:                                                        values can be selected.
                                                                                     There is a trade-off for choosing the number of
                                  p N ( pi , r )
                                                   (n( p, r ) ñ( pi , r, )) 2     clusters. On one hand, more clusters can reflect
        ñ   ( pi , r, )                                                           finer-granularity of the underlying cohorts and HMMs.
                                                        n( pi , r)                On the other hand, data is inevitably projected into
             Normalised deviation:                                                higher-dimensional spaces, which can hamper the
                                                                                  performance of the outlier detector. Additionally,
                                                     ( pi , r, )                  building more clusters incurs more overheads in terms of
                                                    ñ
                    MDEF ( p i , r , )
                                                                                  time taken to train HMMs, data projection and spatial
                                                   ñ( p i , r, )                  distance calculations. We find that a value between 3 and
                                                                                  5 is experimentally sufficient to produce promising
   The above terms essentially reflect the local integral                         results. The number of HMM states implies the number
correlation with respect to each projected data point (e.g.                       of unique prescribers that a consumer visits over a year.
each consumer). Given a distance within [rmin, rmax], we                          A value of 100 proves to be large enough, and it is almost
can compute the value of MDEF(pi, r, ) with MDEF(pi, r,                           impossible for a consumer to visit that many doctors over
 ) and evaluate how far they deviate from each other.                             a year. It can be used as an upper bound for designing the
Alternatively, both rmin and rmax can be replaced by the                          number of HMM states. Throughout various experiments,
number of neighbours for comparison so that they can be                           a few consumers are identified as visiting a large number
dynamically identified.                                                           of unique doctors (e.g. 45). Though these consumers
   The automated outlier detection process can efficiently                        represent an absolute minority, it is necessary to make the
narrow down the number of potential prescription                                  total number of allowable states large enough. As
shoppers, which enables more effective and targeted                               mentioned before, we utilise a compressed list of
manual investigation by medical experts.                                          observation symbols (e.g. prescription drugs). All
                                                                                  targeted drugs are uniquely denoted, whereas a generic
               Parameter                                   Value Range            observation symbol is defined to cover the rest of the
              No. of clusters                                  3 to 5             drugs covered by the PBS. The list can be easily extended
           No. of HMM states                                 50 to 100            to target more drugs. The number of LOCI neighbors for
    No. of HMM observation symbols                           42 to 100            comparison can be set to a range of values between 10
         No. of LOCI neighbours                               10 to 20            and 20, which indicates coverage of 100 to 200 data
           Times of deviation                                  2 to 4             objects. Finally, the value for times of deviation
                                                                                  represents the tolerance towards considering an outlying
                    Table 2: parameter setting                                    data object. It can be tuned accordingly to facilitate
                                                                                  varying investigation scope.
5     Experimental Results
We have conducted extensive experimental studies to                               5.1   Detecting Known Outlying Consumers
evaluate the performance of UNISIM. The system is                                 Due to the unsupervised nature of UNISIM, we first
implemented in C++ and OpenMP directives are inserted                             examine its validity through a mock-up dataset. The
wherever possible to allow the parallelising of functional                        dataset contains a sample of 1938 MA consumers along
computations. A standalone workstation hosts the system,                          with a year worth of their prescription activities. These
which has 8 CPU cores @ 2.66GHz and 32 GB of main                                 individuals are randomly picked from the studied
memory. The underlying operating system is Ubuntu                                 demographics. Through previous studies (Ng et al. 2010
version 10.04. A year worth of consumer data is extracted                         and Mendis et al. 2011), we obtain 20 identified outlying
from the PBS and consumer directory databases for all                             consumers resembling similar temporal behavioural
eligible Australian residents. For the purpose of this paper,                     patterns. They are deliberately injected into the sample
we further select consumers with certain demographics                             dataset. Based on the HMM scores (e.g. the likelihoods of
(e.g. male aged between 30 and 39), which has a                                   each consumer belonging to respective HMMs), we can
population of more than 300,000. We randomly choose                               treat each consumer as a data point in a 3-dimensional
1% of the sampled consumers for building cluster cohorts                          hyper-plane. We plot these points in figure 3. For the sake
and training the HMMs. The training dataset size is not                           of presentation, each HMM score is multiplied by
only manageable but also sufficient for building                                  100,000. As it can be seen, the majority of data points are
consumer behavioural models. It is largely because we                             crammed together posing challenges for visual analysis.
focus on extracting common prescription patterns.                                 Furthermore, outlying consumers are generally good at
Additionally, HMMs are intrinsically robust to noisy                              disguising themselves by emulating patterns of genuine
data. Experimentally, the size of 1% is reasonably                                consumers.
well-balanced between under-training and over-fitting the                              As it can be observed from table 3, all 20 outlying
HMMs. The training dataset is excluded from the                                   consumers are successfully detected by UNISIM. The
proceeding testing.                                                               outlier-ness score is calculated as the difference between
   Table 2 presents the set of parameters along with their                        MDEF and 3 times of MDEF. The bigger the score, the
value ranges required for UNISIM. Due to the                                      more likely a data point can be classified as an outlier. A
departmental policy, we are restrained on revealing exact                         value of 0.57414 is big enough to indicate further
parameter values that have been used during our                                   investigation is warranted. Experimentally, we have
                                                                                  found that the score is monotonically increasing. It is

108
Proceedings of the 9-th Australasian Data Mining Conference (AusDM'11), Ballarat, Australia

worth noting that UNISIM is also capable of identifying a       compromised of temporal prescription data of 10,253
group of outliers. In this particular case, the group of        random consumers selected from the studied
known outlying consumers all have the same outlier-ness         demographics sample. Considering the sheer quantity of
score, which is a very appealing feature. As table 3            all involved transactions, manual investigation is deemed
shows, all 20 pre-identified fraudulent consumers belong        to be infeasible.
to one group, which can be regarded as an outlier group.            Before delving into individual results comparison
In terms of HMM scores, all these consumers are                 against each consumer, we focus on the group results
characterized with significantly small scores (e.g. close to    generated by UNISIM so that we can observe some
0) implying that their behavioural patterns are highly          attractive features of it. Overall, there are 5,489
irregular compared with common consumers (e.g.                  consumers identified by the system having a greater than
captured patterns during the HMMs training).                    0 outlier-ness score. Such a large number shows that the
                                                                real-life data is indeed very complex (e.g. irregular
                                                                patterns from genuine consumers). Interestingly, some
                                                                groups of outliers are amongst them (e.g. same score with
                                                                similar prescription patterns). Table 4 shows 5 such
                                                                groups along with their respective outlier-ness scores and
                                                                number of members. Together they represent a population
                                                                of 3,579 consumers or around 65% of 5,489 consumers
                                                                (outlier-ness score > 0). By closely analysing the patterns
                                                                in group 1, we can observe that all its member consumers
                                                                have one-off prescription over the chosen year. They can
                                                                be easily filtered before further investigation. All four
                                                                other groups have the similar properties. The capability of
                                                                detecting micro-clusters of outliers allows us to quickly
                                                                examine a group of consumers with similar temporal
                                                                behaviours. Accordingly, we are able to effectively filter
                                                                them out before further more costly investigation. For
                                                                example, if we set the cut-off value to above 0.44901,
                                                                there is an instant reduction of 87% leaving 693 potential
                                                                suspects. It is very flexible to scope the investigation by
                                                                the eventual business users.
               Figure 3: HMM score plot
                                                                       Group ID        Outlier-ness        Number of
                                                                                         Score             Consumers
      De-identified Consumer ID     Outlier-ness Score                     1             0.44901              1974
                   1                      0.57414                          2             0.449009              656
                   2                      0.57414                          3             0.159298              545
                 3                        0.57414                          4             0.335087              270
                 4                        0.57414                          5             0.155503              134
                 5                        0.57414
                 6                        0.57414                         Table 4: representative outlier groups
                 7                        0.57414
                 8                        0.57414                    De-identified Consumer ID          Outlier-ness Score
                 9                        0.57414                                 1                           6.68674
                 10                       0.57414                                 2                           6.65758
                 11                       0.57414                                 3                           6.53518
                 12                       0.57414                                 4                           6.46845
                 13                       0.57414                                 5                           6.35029
                 14                       0.57414                                 6                           5.97089
                 15                       0.57414                                 7                           5.97089
                 16                       0.57414                                 8                           5.71688
                 17                       0.57414                                 9                           5.32089
                 18                       0.57414                                 10                          5.27353
                 19                       0.57414
                                                                     Table 5: top 10 individual outlying consumers
                 20                       0.57414
                                                                   We further examine the top 10 individual consumers
 Table 3: known outlying consumers and their scores             and their prescription patterns, which are included in
                                                                table 5. On average, 4 different doctors have been visited
5.2     Detecting Unknown Outlying Consumers                    by these consumers. The transaction records reveal that
In this section, we study the generalised performance of        some extreme cases have multiple visits to different
UNISIM over unlabelled data. The dataset is                     doctors on one day. By looking at their prescription

                                                                                                                             109
CRPIT Volume 121 - Data Mining and Analytics 2011

drugs, we can notice that the majority of them are                Conference on Data Mining, San Francisco, California,
targeted ones (i.e. listed in table 1 as suggested by subject     USA, pp311-315.
matter experts). With such a combination, we can                Medicare Australia (2011): Medicare Australia Annual
confidently classify the consumer as suspicious and pass          Report                                     2009-2010.
the information onto the compliance division for further          http://www.humanservices.gov.au/spw/corporate/publi
investigation.                                                    cations-and-resources/annual-report/medicare/index.ht
                                                                  ml. Accessed 29 July 2011.
6     Conclusion and Future Work
                                                                Mendis, B.Sumudu.U., Murray, D.W., Sutinen, A., Tang,
The main focus of the paper is unsupervised fraud                 M.J. and Hu, Y.S. (2011): Enhancing the Identification
detection particularly targeting MA claimants with                of Anomalous Events in Medicare Consumer Data
potential prescription shopping behaviours. We propose a          Through       Classifier   Combination.    Proc.    6th
data-driven system, called UNISIM, for tackling the               International Workshop on Chance Discovery,
problem. UNISIM is comprised of comprehensive data                Barcelona, Spain, pp39-44, Springer Press.
mining components including feature extractor, cluster
builder, model constructor and outlier detector for             Ng, K.S., Shan, Y., Murray, D.W., Sutinen, A., Schwarz,
effective and efficient analysis of MA consumer data.             B., Jeacocke, D. and Farrugia. J. (2010): Detecting
Importantly, we provide effective HMM for encoding                Non-compliant Consumers in Spatial-Temporal Health
essential knowledge into UNISIM enabling it to automate           Data: A Case Study from Medicare Australia. Proc.
the fraud detection process. We have demonstrated the             IEEE International Conference on Data Mining
effectiveness of UNISIM on detecting potential                    Workshops, Sydney, Australia, pp613-622, IEEE Press.
non-compliant consumers using real-life health care data.       Nguyen, H.V. and Gopalkrishnan, V. (2010): Feature
We need to emphasise that UNISIM itself serves as a               Extraction for Outlier Detection in High-Dimensional
complementary tool to assist with the subject matter              Spaces. Proc. 4th Workshop on Feature Selection in
experts. For consumers identified as obtaining large              Data Mining, Hyderabad, India, pp64-73.
quantities of PBS medications, we are still reliant on the      Papadimitriou, S., Kitagawa, H., Gibbons, P.B. and
subject matter experts to decide if they have behaved             Faloutsos, C. (2003): LOCI: Fast Outlier Detection
fraudulently.                                                     Using the Local Correlation Integral. Proc. 19th
   In the future, we are planning to experiment with              International Conference on Data Engineering
different techniques or algorithms other than the ones that       (ICDE'03), California, USA, pp.315, 2003
have been implemented. Currently, complex real-life             Rabiner, L.R. (1989): Investigating Hidden Markov
interactions, either explicit or implicit, are not the focus      Models Capabilities in Anomaly Detection. Proc.
of UNISIM. Capturing these coupled and intriguing                 IEEE, vol. 77, no. 3, pp357-286, 1989.
relations are technically challenging yet can be beneficial
especially for identifying more professional and                Smyth, P. (1994): Markov Monitoring with Unknown
organised fraud. The HMM can be built differently (e.g.           States. IEEE Journal on Selected Areas in
to introduce contradictions). We expect to design and             Communications, vol. 12, no. 9, pp1600-1612, 1994.
implement other stochastic models for evaluating                Sørensen, T. (1957): A method of establishing groups of
consumer patterns.                                                equal amplitude in plant sociology based on similarity
                                                                  of species and its application to analyses of the
7     Acknowledgements                                            vegetation on Danish commons. Biologiske
The authors wish to thank Dr. David Jeacocke for his              Skrifter/Kongelige Danske Videnskabernes Selskab,
helpful clinical insights. We would also like to thank            vol. 5, no. 4, pp1-34, 1957.
Leonie Greenwood, Thach Van, Alex Dolan, Rory King,             Srivastava, A., Kundu, A., Sural, S. and Majumdar, A.K.
and Paul Cowan for providing timely management                    (2008): Credit Card Fraud Detection Using Hidden
support for this paper. Last but not least, we are grateful       Markov Model. IEEE Transactions on Dependable and
for invaluable comments from both reviewers for making            Secure Computing, vol. 5, no. 1, pp37-48, 2008.
any improvement on the paper possible.                          Swets, D.L. and Weng, J.Y. (1996): Using Discriminant
                                                                  eigenfeatures for image retrieval. IEEE Transactions
References                                                        on Pattern Analysis and Machine Intelligence, vol. 18,
Cha, S.H. (2007): Comprehensive Survey on                         no. 8, pp831-836, 1996.
  Distance/Similarity Measures between Probability
  Density Functions. International Journal of
  Mathematical Models and Methods in Applied
  Sciences, vol. 1, no. 4, pp300-307, 2007.
Kirby, M. and Sirovich, L. (1990): Application of the
  Karhunen-loeve procedure for the characterization of
  human faces. IEEE Transactions on Pattern Analysis
  and Machine Intelligence, vol. 12, no. 1, pp103-108,
  1990.
Kum, H.C., Pei, J., Wang, W. and Duncan, D (2003):
  ApproxMAP: Approximate Mining of Consensus
  Sequential Patterns. Proc. 3rd SIAM International

110
You can also read