Using Market Basket Analysis to Integrate and Motivate Topics in Discrete Structures

Page created by Kimberly Mueller
 
CONTINUE READING
Using Market Basket Analysis to Integrate and Motivate Topics
                   in Discrete Structures
                                              Michael R. Wick and Paul J. Wagner
                                                     Department of Computer Science
                                                     University of Wisconsin-Eau Claire
                                                           Eau Claire, WI 54701
                                                   {wickmr, wagnerpj}@uwec.edu

ABSTRACT
Nearly every computer science curriculum includes a course                  Following the lead of other educators [4], we have modified our
called “Discrete Structures” or “Discrete Mathematics”. Over the            course by moving it from being taught in the Mathematics
past few years, considerable attention has been paid to this course         department to being taught in the Computer Science department.
in an attempt to overcome the misperception by students that the            Further, we have infused into the course some topics typically
material is mathematics and not related to computer science.                delegated to a theory of algorithms course (for example, divide
Most of these efforts deal with attempting to explicitly show               and conquer, and dynamic programming). Likewise, as some
students the application of discrete mathematics within computer            have done [4], we have infused into the course some topics
science. We present an application that adds to the efforts of this         typically delegated to a data structures course (for example,
community by giving instructors a modern, powerful, and elegant             implementation of a Set abstract data type). However, we have of
example to motivate student engagement in discrete structures.              course maintained other core topics in the course such as formal
                                                                            logic, counting, and proof techniques.
Categories and Subject Descriptors                                          Overall, we have found that the students appreciate the course
K.3.2 [Computers & Education]: Computer &                                   more after these changes and are more readily accepting of the
Information Science Education – Computer Science                            potential importance of discrete structures in computer science.
                                                                            However, our overall curriculum is highly applied and as such our
Education.
                                                                            students tend to reserve their most favorable impressions for those
                                                                            courses that solve problems they see as directly applicable to the
General Terms                                                               “real world”. Therefore, we have worked to find a “real-world”
Computer Science Education                                                  application for inclusion in our discrete structures course that
                                                                            integrates several of the topics of the course and does so in a way
                                                                            that convinces the students of the value-added of each of these
Keywords                                                                    topics to their overall computer science education. In particular,
Market Basket Analysis, Sets, Dynamic Programming,                          we have developed lecture materials based on an approach to
Discrete Structures.                                                        market basket analysis that has significantly improved the
                                                                            students’ perception of the discrete structures course as relevant to
                                                                            and important in their knowledge arsenal.
1    INTRODUCTION
Recent literature in computer science education has highlighted a
problem that has plagued most instructors of discrete mathematics           2    MARKET BASKET ANALYSIS
courses within a computer science curriculum [4, 10]. In
                                                                            Binary market basket analysis is a form of data mining [5] in
particular, educators report that students perceive a significant
                                                                            which an automated system attempts to find and use previously
disconnect between the topics of a discrete structures course and
                                                                            unknown associations between items purchased from a store
the topics of the other courses within the computer science
                                                                            (binary reflects the fact that the number of each item purchases is
curriculum.     While not all institutions have this problem
                                                                            not recorded – just 0 for none and 1 for at least 1). The classic
(particularly those that emphasize a more theoretical or
                                                                            example of market basket analysis is online retail suggestive sell
mathematical approach to computer science), we have
                                                                            (like that used by major online retailers such as amazon.com and
experienced it at our institution and have attempted to modify the
                                                                            bestbuy.com). Here, a computer program analyzes a large set of
discrete structures course to more explicitly connect with other
                                                                            purchase records (transactions) to find sets of items that are
courses in our curriculum.
                                                                            frequently purchased together. Such sets are called frequent item
                                                                            sets. The definition of “frequent” is based on a user-provided
Permission to make digital or hard copies of all or part of this work for   frequency and is called the necessary support. Once the frequent
personal or classroom use is granted without fee provided that copies are   item sets are known, a separate process can use these associations
not made or distributed for profit or commercial advantage and that
                                                                            to help suggest additional items that a current customer might
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,    purchase. This is the suggestive sell. Each suggestive sell rule
requires prior specific permission and/or a fee.                            indicates that when certain items are held in a customer’s basket,
                                                                            based on past experience, there is a reasonable chance that the
SIGCSE’06, March 1-5, 2006, Houston, TX, USA.
Copyright 2006 ACM 1-59593-259-3/06/0003...$5.00
                                                                            customer would also be interested in purchasing some other
item(s). This definition of “reasonable chance” is based on a                   1.    Find all learned rules that match the current content of
user-provided threshold and is called the necessary confidence.                       the customer’s basket.
Let’s take a simple example. Assume that we have a log of 1000                  2.    Select one or more rules to apply in suggesting
previous customer purchases from our online hardware store.                           additional items to the customer.
Assume further that we establish a minimum necessary support of          A classic algorithm for finding frequent items sets (Step 1 of the
250. This means that in order for any set of items to be                 transaction-analysis tasks) is called the Apriori algorithm [1]. In
considered frequent, that set of items must have been purchased          the remainder of this paper, we primarily focus on the discrete
(perhaps in addition to other items at the same time) in 250 of the      structure aspects of this algorithm. However, we also present
1000 transactions. Suppose that our transaction log contains 300         examples of how the other tasks involved in market basket
purchases that included a hammer, a screwdriver, and a tape              analysis can be used to help connect computer science students to
measure. The set {hammer, screwdriver, tape measure} is                  topics and concepts in a discrete structures course.
therefore a frequent item set as it occurs in at least the minimum
percentage of our transactions.                                          Before we dive into the details of the Apriori algorithm, it is
                                                                         important to remember our goal – to show the students a “real-
From the frequent item set {hammer, screwdriver, tape measure},          world” application of the topics from discrete structures in order
the following rules are possible:                                        to help them connect those topics to their own experiences and the
          Rule 1: If the customer has placed a hammer in the             content of other seemingly more applied courses. What follows is
          basket, suggest a screwdriver and a tape measure.              a description of the way in which we present market basket
          Rule 2: If the customer has placed a screwdriver in the        analysis within the discrete structures course. We typically
          basket, suggest a hammer and a tape measure.                   introduce this application as a case-study in applied discrete
                                                                         structures and it is typically given in the course after the students
          Rule 3: If the customer has placed a tape measure in the       have already studied logic, proof, sets, and dynamic
          basket, suggest a hammer and a screwdriver.                    programming.
          Rule 4: If the customer has placed a hammer and a
          screwdriver in the basket, suggest a tape measure.
                                                                         3.1 A SET THEORETIC DEFINITION OF THE APRIORI
          Rule 5: If the customer has placed a hammer and a tape         ALGORITHM
          measure in the basket, suggest a screwdriver.
                                                                         The Apriori algorithm is a set-based algorithm that is
          Rule 6: If the customer has placed a screwdriver and a         exceptionally efficient at finding all possible frequent items sets
          tape measure in the basket, suggest a hammer.                  from a set of transactions. We present our students with the
The decision as to which of these rules to learn is based on the         following definitions that summarize the algorithm.
user-specified minimum necessary confidence. Let’s assume that
the user has set the confidence at 75%. This means that in order
to be learned as an association rule, the number of transactions         Let I = {a, b, c,…} be the set of all items
                                                                         available at our store
that include both the item(s) in the condition and the item(s) in the
suggestion of the rule must be at least 75% of the number of             Let R ⊆ I be a transaction record
transactions that include at least the item(s) in the condition of the   Let T = {S             |    S   ⊆     I}    be   the   set   of   all
rule. So, for example, if we had 600 transactions that include a         transactions
hammer, then for Rule 1 to be learned there must be at least 450
                                                                         Let support(S) = the cardinality of{B | B ∈ T ∧ S
(75% of 600) of the 600 transactions that also include a
                                                                         ⊆ A}.
screwdriver and a tape measure.
Once we have all the learned rules, these rules can be used to           Let L1 = { {i} | i ∈I ∧ support({i})             min_support }.
suggest items that are likely to be purchased with the items in a        and
customer’s basket.                                                       ∀k [ (k > 1) ∧ (Lk-1             ∅)
                                                                           Lk = {Si ∪ Sj |
3    THE APRIORI ALGORITHM                                                           (Si ∈ Lk-1) ∧ (Sj ∈ Lk-1)
As you can see from above, there are several steps involved in                       (|Si–Sj| = 1 ∧ |Sj-Si| = 1) ∧
market basket analysis. Before a new customer ever enters the
                                                                                      ∀S [((S ⊆ Si ∪ Sj) ∧
                                                                                     (∀
store, we must perform the following transaction-analysis tasks.
                                                                                           (|S| = k-1))             S ∈ Lk-1]) ∧
     1.   Analyze the transactions for frequent item sets that meet
          the minimum necessary support.                                             (support(Si ∪ Sj)          min_support) } ]
     2.   Construct the candidate rules from the frequent item           then
          sets.
                                                                         L =    ∪Lk    is the set of all frequent items sets.
     3.   Prune the candidate rules that fail to meet the minimum
          necessary confidence.                                          Figure 1: A Set Theoretic Definition of the Apriori Algorithm
Once a new customer enters the store and places an item into the         Notice how this definition uses sets, set operations, and formal
basket, we must perform the following basket-analysis tasks.             logic together in one application. For the vast majority of our
                                                                         students, this is the first time they have ever considered using
anything other than pseudo-code to describe the essence of an              2 elements common to the sets plus 1 element from the first set
algorithm. For most of them, the above definition is both obscure          and 1 element from the second set). This is important as we are
and intriguing (most of our students are excited about algorithms          attempting to find frequent item sets of size k exactly. Subsequent
and how to represent them). To remove the obscurity of the                 iterations will find larger frequent item sets if they exist. The
definition, we walk through each element.                                  second constraint, again defined as a logical implication, acts as a
The set I simply represents the set of all items we sell at our store      filter on the resulting unions. This constraint eliminates any sets
and is easily understood by the students.                                  of size k that contain even one subset of size k – 1 that is not
                                                                           contained in Lk-1. Again, after some reflection, this makes sense.
The set R represents a single transaction record and therefore             There is no way that a set of say four elements can be purchased
contains a subset of the items that we carry in our store (i.e, the        at least min_support times if a subset of three of those four
set I).                                                                    elements exists that wasn’t purchased min_support times. Notice
The set T represents the transaction log of our store. Clearly, a          that neither of these constraints requires inspecting the original set
transaction log is simply a collection of single transactions. So as       of transactions T. This is important as that set is typically
to enable this collection to form a set (and thus have no                  extremely large and often held in secondary storage, making
duplicates), we tell the students that each element of T has a             access to that set an efficiency bottleneck for the entire system.
unique id representing a transaction number. Alternatively, we             Essentially, the constraints on the members of Lk have allowed us
could introduce the concept of a bag of items.                             to generate a superset of all possible members of Lk, some of
                                                                           which can be filtered by finding infrequent subsets.
The definition of support(S) represents the number of times that
                                                                           Unfortunately, the final constraint does require inspection of the
the items in S were purchased as part of one of the records in the
                                                                           transaction set T to ensure that the remaining sets are in fact
transaction set. Therefore, support(S) is simply the cardinality of
                                                                           frequent.
the set of all sets from our transaction log (T) that include S as a
(possibly proper) subset.
L1 is the set of all frequent item sets of cardinality 1. That is, it is   4    APRIORI AS DYNAMIC
the set of all single items that have been purchased at least
min_support times (keeping in mind that some of the purchases
                                                                                PROGRAMMING
may have included other items as well). As the algorithm                   In the previous section we illustrated how the Apriori algorithm
executes, each Li will hold the set of all frequent item sets of           from market basket analysis can be succinctly and correctly
cardinality i (i.e., all sets of i items that have been purchased alone    defined using sets and formal logic. In this section, we use a
or with other items at least min_support times). Therefore, the            dynamic programming approach to implement this set theoretic
entire collection of frequent items sets of any size is given by the       definition.
union of Li for all i.                                                     Dynamic programming is a wonderfully efficient approach to
That leaves the implication to discuss. Up to this point, the              solving optimization problems where problems are solved by
students readily see how the formal set definitions represent the          caching sub-problem solutions rather than re-computing them [6].
given concepts from the application. The implication, however,             At the heart of dynamic programming is the principle of
usually takes a bit more explanation than the other aspects of the         optimality which states “components of a globally optimal
overall definition. The basic idea of the implication is that the          solution are themselves globally optimal” [7]. But how can we
existence of frequent item sets of cardinality k can be determined         use dynamic programming when we are not solving an
from the existence of frequent item sets of cardinality k – 1. This        optimization problem. Or are we? An optimization problem is
makes sense if you think about it. For example, a four-element             defined as “a computational problem in which the object is to find
frequent item set must contain as a subset a three-element                 a solution in the feasible region which has the minimum (or
frequent item set since for all four items to have been purchased          maximum) value of the objective function.” [8]. If we define our
more than min_support times, certainly three of the four items             objective function as the cardinality of each ∪Lk, then our goal is
must have been purchased min_support times. Students are quick
to see that this definition is ripe for implementation using               to find the set ∪Lk that has the maximum cardinality. In this
dynamic programming (recall that we introduce this application             light, finding the frequent items sets for market basket analysis is
after they have already studied dynamic programming).                      an optimization problem.

With this “big picture” as a backdrop, we then ask the students to         To be able to effectively apply dynamic programming to its
consider each part of the implication in turn. The antecedent of           solution, we must prove two properties of the problem.
the implication (k > 1) ∧ (Lk-1               ∅) indicates that                 1)   An optimal solution to the problem of finding ∪Lk-1 is
implication holds whenever our most recently produced set of                         a subset of the optimal solution to the problem finding
frequent item sets is non-empty. This is intuitive since you can’t
have any frequent item sets of k elements when you don’t have                        ∪Lk. This is the principle of optimality.
any frequent item sets of k-1 elements.
                                                                                2)   Every element of  ∪Lk - ∪Lk-1 can be constructed as
The consequent of the implication gives the rule for using Lk-1 to
establish Lk. In particular, the implication defines Lk as being                     the union of elements from ∪Lk-1. This is the property
constructed from the union of sets in Lk-1 subject to three                          that problems have overlapping subproblems.
constraints. The first constraint ((|Si–Sj| = 1 ∧ |Sj-Si| =                The proof of (1) is a simple proof-by-contradiction. Assume that
1)) indicates that the two sets chosen from Lk-1 must differ from          ∪Lk is optimal and there exists a set A ∈ ∪Lk-1 such that A ∉
each other in exactly one element each. This constraint ensures
that the union of the two sets produces a set with cardinality k (k –      ∪Lk. Since membership in ∪Lk-1 implies that A is a frequent
item set with k-1 elements or fewer, then we could build a larger          with a set theoretic definition of the association rules as shown in
set ∪Lk-1 ∪ A contradicting the assumption that ∪Lk is optimal.            Figure 2 (which assumes the definitions shown in Figure 1).
                                                                           Let R be the set of all association rules built
The proof of (2) using the fact that every subset of a frequent item
                                                                           from the frequent item sets of L.
set must itself be frequent. Therefore, for every A ∈ Lk+1 we can
                                                                           Let  be an ordered pair representing an
find two sets B,C ∈ Lk such that | B – C | = | C – B| = 1 and A = B
                                                                           association rule with antecedent A (a set) and
∪ C.                                                                       consequent C (a set).
Given these two properties hold for the problem of producing               Let F ∈ L (a frequent item set)
∪Lk,  a dynamic programming approach to the problem is                     Let 2F be the powerset of F.
appropriate.                                                               Then,
At this point, we engage our students in an investigation of                 R = { |
appropriate data structures for our implementation. We ask our
students to inspect our set theoretic definition for operations that                   A ∈ 2F ∧
will be required of our data structures. Almost immediately, the                       (C = F – A) ∧
students suggest using some form of a hash table for storing each
                                                                                       (A     ∅) ∧
Lk. They justify this decision based on the fact that the constraint
 ∀S [((S ⊆ Si ∪ Sj) ∧ (|S| = k-1))
(∀                                                        S ∈ Lk-1])                   (C     ∅) ∧
requires that we be able to quickly determine membership in Lk-1.                      (support(F)/support(A)            confidence) }
Next, the students typically turn their collective attention to the
required set operations. The constraint (|Si–Sj| = 1 ∧ |Sj-                   Figure 2: A Set Theoretic Definition of Association Rules
Si| = 1) requires that our algorithm must be able to find all              Notice that the brute force implementation of this definition
elements of a set that overlap in all but one element each. Further,       implies that we must consider all possible subsets of each item set
the previous constraint mentioned above also requires that we be           - resulting in an exponential algorithm for association rule
able to find all subsets of size k – 1 for a set of size k. Finally, the   learning. We have found this analysis to be an excellent way to
constraint support(Si ∪ Sj)              min_support requires that         reinforce to students the value of algorithm analysis, order of
we be able to effectively determine if a given set is a subset of          magnitude functions, and big-theta estimations.
another set (i.e., is Si ∪ Sj a subset of each transaction record
in T). Developing an optimally efficient data structure for these
requirements turns out to be quite a challenge and far beyond the          5.2 AN IMPROVED APPROACH
backgrounds of our students in discrete structures (for more               Clearly, the above approach is unacceptable for even modestly
information on efficient data structures for the Apriori algorithm         sized problems. To help motivate a more effective approach,
see [2, 9]). For our purpose, we introduce the students to a bit           consider the following double-consequent association rule with
representation in which each set in our system is represented as an        two items in the antecedent and the consequent.
n-bit binary string in which n is the cardinality of I (our set of all          Rule1: If a and b then c and d.
items in our store) and a “1” in the binary string indicates that the
corresponding item is a member of this item set. While this is             Next, consider the two single-consequent rules formed from the
certainly not an optimal choice, this simple approach is intuitive         above rule.
for the students and allows them to apply logical operations in the             Rule2: If a and b then c.
implementation of our set operations.
                                                                                Rule3: If a and b then d.
                                                                           Clearly, Rule1 cannot meet the minimum necessary confidence
5    ASSOCIATION RULE LEARNING                                             unless both Rule2 and Rule3 meet this confidence. After all, if c
Thus far we have discussed how we use the “find frequent item              doesn’t follow sufficiently frequently from a and b, then
sets” step of market basket analysis to integrate propositional            certainly c and d will not. Notice we can therefore build
logic, sets, optimality proof, data structures, and dynamic                candidate double-consequent rules from single-consequent
programming within a discrete structures course. This section              rules, triple-consequent rules from double-consequent
explores how we use the second phase of market basket analysis             rules, and so on. This seems familiar. The generation of
(association rule learning) to reinforce these same concepts.              association rules is just another application of our dynamic
                                                                           programming approach. In fact, we use the development of
5.1 THE BRUTE FORCE APPROACH                                               the set theoretic definition of this process as a follow-up
                                                                           exercise to the discussion of the Apriori algorithm. This
Recall from Section 2 that a given frequent item set can lead to a
                                                                           built-in follow-up activity is just another of the interesting
large number of possible association rules. Also recall that not all
of these possible rules will be effective and thus we must use the         features of using market basket analysis as an integrating
confidence threshold to filter out ineffective rules. The brute            application. The resulting definition of the association rule
force approach, therefore, would be to generate all possible rules         learning phase is shown in Figure 3.
from each frequent item set and test each such rule against the
transaction set T by dividing the frequency of the consequent of
each rule by the frequency of the antecedent of each rule. This
sounds like a lot of work, but how much is it really? Let’s start
Let L =     ∪   k   Lk
                                                                        REFERENCES
Let T = {S |             S   ⊆    I   }    be   the   set    of   all
transactions.                                                             1.   Agrawal R., Mannila H., Srikant R., Toivonen H. and
Let  be an association rule with antecedent                               Verkamo, A.I., “Fast Discovery of Association Rules”,
A and consequent C.                                                            from “Advances in Knowledge Discovery and Data
Let confid() = |{B | B ∈ T ∧                                              Mining”, AAI/MIT Press, 1996, pp. 307-328.

                                          (A ∪ B) ⊆ B}| /                 2.   Cerin, C., Gay, J-S., Mahec, G. and Koskas M,
                                  |{B | B ∈ T ∧ A ⊆ B}|                        “Efficient Data Structures and Parallel Algorithms for
                                                                               Association Rules Discovery”, http://www-lipn.univ-
Let R1 = { | F ∈ L ∧ a ∈ F ∧
                                                                               paris13.fr/~cerin/documents/cerin_c_enc04.pdf
                             confide(F,a)          min_confid)}
and                                                                       3.   Coenen, F., Leng, P. and Ahmed, S., “Data Structure
                                                                               for Association Rule Mining: T-Trees and P-Trees”,
∀k [ (k > 1) ∧ (Rk-1             ∅)
                                                                               IEEE Transactions on Knowledge and Data
    Rk = {  |                                                             Engineering, Vol. 16, No. 6, June 2004, pp. 774-778.
            ( ∈ Rk-1) ∧
                                                                          4.   Decker, A., and Ventura, P., “We Claim this Class for
            ( ∈ Rk-1) ∧                                                  Computer Science: A Non-Mathematician’s Discrete
            (|Ci – Cj| =1 ∧ |Cj – Ci| = 1) ∧                                   Structures Course”; ACM SIGCSE Bulletin,
            (∀S [((S ⊆ Ci ∪ Cj) ∧
                                                                               Proceedings of the 35th SIGCSE Technical Symposium
                                                                               on Computer Science Education, Vol. 36, No. 1, March
                    (|S| = k-1))           ∈ Rk-1]) ∧                     2004, pp. 442-446.
            (confide()              min_confi) } ]
                                                                          5.   Han, J. and Kamber, M, “Data Mining, Concepts and
then
                                                                               Techniques”, Academic Press, 2001.
R =   ∪Rk   is the set of all confident association rules.
                                                                          6.   National Institute of Standards and Technology,
Figure 3: Set Theoretic Definition of Association Rule Finding                 http://www.nist.gov/dads/HTML/dynamicprog.html
Given space constraints, we will forego a detailed analysis of this
definition. However, the analysis of this definition is directly          7.   National Institute of Standards and Technology,
analogous to the analysis given for the set theoretic definition of            http://www.nist.gov/dads/HTML/principle.html
the Apriori algorithm.
                                                                          8.   National Institute of Standards and Technology,
                                                                               http://www.nist.gov/dads/HTML/optimization.html
6     SUMMARY AND CONCLUSION
                                                                          9.   Park J., Chen, M, and Yu P., “An Effective Hash-Based
Market-basket analysis and suggestive sell strategies are                      Algorithm for Mining Association Rules”, Proceedings
commonplace in today’s world, especially in connection with                    of the 1995 ACM-SIGMOD International Conference
online retail stores. As such, students are well-motivated and                 on the Management of Data (SIGCMOD ’95), San Jose,
engaged by the application. In this paper we have presented ways               CA, May 1995, pp. 175-186.
in which the market basket analysis application can be used to
illustrate and integrate many of the typical topics in the discrete       10. Tomer, D. S., Baldwin, D., Smith, C. H., Henderson, P.
structures course of an undergraduate computer science                        B.,& Vadisigi, V. (2000). CS1 and CS2: Foundations of
curriculum. The topics involved in definition and implementation              Computer Science and Discrete Mathematics. Panel
of market-basket analysis include formal logic, sets, set                     presented at the 31st SIGCSE technical symposium on
operations, power sets, proof methods, dynamic programming,                   Computer Science Education, Austin, Texas; Vol. 32,
algorithm analysis, and data structures. Further, market-basket               No. 1, March 2000, pp. 397-398.
analysis can be divided into phases that allow lecture material and
classroom experiences to focus on the application of these topics
to the construction of frequent item sets and to allow subsequent
theoretical and programming assignments to apply these same
concepts to the second phase of association rule learning.
While our discussion of market-basket analysis has been focused
on the use of this application in a discrete structures courses, the
richness of market-basket analysis provides a plethora of other
possible uses in an undergraduate curriculum including (but not
limited to) databases for storing learned associations, artificial
intelligence for applying association rules, and advanced data
structures for efficient implementations of sets using P-Trees and
T-Trees [3].
You can also read