Semi-Automatic Index Tuning: Keeping DBAs in the Loop

Page created by Robin Price
 
CONTINUE READING
Semi-Automatic Index Tuning: Keeping DBAs in the Loop

                            Karl Schnaitter                                             Neoklis Polyzotis
                              Teradata Aster                                               UC Santa Cruz
                 karl.schnaitter@teradata.com                                            alkis@ucsc.edu

ABSTRACT                                                                DBA can refine the automated recommendations by passing indi-
                                                                        rect domain knowledge to the tuning algorithm. Overall, the semi-
   To obtain good system performance, a DBA must choose a set           automatic paradigm offers a unique combination of very desirable
of indices that is appropriate for the workload. The system can         features: the tuner analyzes the running workload online and thus
aid in this challenging task by providing recommendations for the       relieves the DBA from the difficult task of selecting a representa-
index configuration. We propose a new index recommendation              tive workload; the DBA retains total control over the performance-
technique, termed semi-automatic tuning, that keeps the DBA “in         critical decisions to create or drop indices; and, the feedback mech-
the loop” by generating recommendations that use feedback about         anism couples human expertise with the computational power of an
the DBA’s preferences. The technique also works online, which           automated tuner to enable an iterative approach to index tuning.
avoids the limitations of commercial tools that require the work-          We illustrate the main features of semi-automatic tuning with
load to be known in advance. The foundation of our approach is          a simple example. Suppose that the semi-automatic tuner recom-
the Work Function Algorithm, which can solve a wide variety of          mends to materialize three indices, denoted a, b, and c. The DBA
online optimization problems with strong competitive guarantees.        may materialize a, knowing that it has negligible overhead for the
We present an experimental analysis that validates the benefits of      current workload. We interpret this as implicit positive feedback
semi-automatic tuning in a wide variety of conditions.                  for a. The DBA might also provide explicit negative feedback on
                                                                        c because past experience has shown that it interacts poorly with
                                                                        the locking subsystem. In addition, the DBA may provide posi-
1    Introduction                                                       tive feedback for another index d that can benefit the same queries
   Index tuning, i.e., selecting indices that are appropriate for the   as c without the performance problems. Based on this feedback,
workload, is a crucial task for database administrators (DBAs).         the tuning method can bias its recommendations in favor of indices
However, selecting the right indices is a very difficult optimization   a, d and against index c. For instance, a subsequent recommenda-
problem: there exists a very large number of candidate indices for a    tion could be {a, d, e}, where e is an index that performs well with
given schema, indices may benefit some parts of the workload and        d. At the same time, the tuning method may eventually override the
also incur maintenance overhead when the data is updated, and the       DBA’s feedback and recommend dropping some of these indices if
benefit or update cost of an index may depend on the existence of       the workload provides evidence that they do not perform well.
other indices. Due to this complexity, an administrator often resorts   Previous Work. Existing approaches to index selection fall in two
to automated tools that can recommend possible index configura-         paradigms, namely offline and online. Offline techniques [3, 7]
tions after performing some type of workload analysis.                  generate a recommendation by analyzing a representative workload
   In this paper, we introduce a novel paradigm for index tuning        provided by the DBA, and let the DBA make the final selection
tools that we term semi-automatic index tuning. A semi-automatic        of indices. However, the DBA is faced with the non-trivial task
index tuning tool generates index recommendations by analyzing          of selecting a good representative workload. This task becomes
the workload online, i.e., in parallel with query processing, which     even more challenging in dynamic environments (e.g., ad-hoc data
allows the recommendations to adapt to shifts in the running work-      analytics) where workload patterns can evolve over time.
load. The DBA may request a recommendation at any time and is              Online techniques [6, 11, 14, 15] monitor the workload and au-
responsible for selecting the indices to create or drop. The most       tomatically create or drop indices. Online monitoring is essential
important and novel feature of semi-automatic tuning is that the        to handle dynamic workloads, and there is less of a burden on the
DBA can provide feedback on the recommendation, which is taken          DBA since a representative workload is not required. On the other
into account for subsequent recommendations. In this fashion, the       hand, the DBA is now completely out of the picture. DBAs are
                                                                        typically very careful with changes to a running system, so they are
                                                                        unlikely to favor completely automated methods.
                                                                           None of the existing index tuning techniques achieves the same
                                                                        combination of features as semi-automatic tuning. Semi-automatic
                                                                        tuning starts with the best features from the two paradigms (on-
                                                                        line workload analysis with decisions delegated to the DBA) and
                                                                        augments them with a novel feedback mechanism that enables the
                                                                        DBA to interactively refine the recommendations. We note that in-
                                                                        teractive index tuning has been explored in the literature [8], but
                                                                        previous studies have focused on offline workload analysis. Our
study is the first to propose an online feedback mechanism that is     of evaluating q assuming that X is the set of materialized indices.
tightly coupled with the index recommendation engine.                  This function is possible to evaluate through the what-if interface
   A closer look at existing techniques also reveals that they can-    of modern optimizers. Given disjoint sets X, Y ⊆ I, we define
not easily be modified to be semi-automatic. For instance, a naive     benefit q (Y, X) = cost(q, X) − cost(q, Y ∪ X) as the differ-
approach to semi-automatic tuning would simply execute an online       ence in query cost if Y is materialized in addition to X. Note that
tuning algorithm in the background and generate recommendations        benefit q (Y, X) may be negative, if q is an update statement and Y
based on the current state of the algorithm, but this approach ig-     contains indices that need to be updated as a consequence of q.
nores the fact that the DBA may select indices that contradict the        Another source of cost comes from adding and removing ma-
recommendation. A key challenge of semi-automatic tuning is to         terialized indices. We let δ(X, Y ) denote the cost to change the
adapt the recommendations in a flexible way that balances the in-      materialized set from X to Y . This comprises the cost to create the
fluence of the workload and feedback from the DBA.                     indices in Y − X and to drop the indices in X − Y . The δ function
                                                                       satisfies the triangle inequality: δ(X, Y ) ≤ δ(X, Z) + δ(Z, Y ).
Our Contributions. We propose the WFIT index-tuning algorithm
                                                                       However, δ is not a metric because indices are often far more ex-
that realizes the new paradigm of semi-automatic tuning. WFIT uses
                                                                       pensive to create than to drop, and hence symmetry does not hold:
a principled framework to generate recommendations that take the
                                                                       δ(X, Y ) 6= δ(Y, X) for some X, Y .
workload and user feedback into account. We can summarize the
technical contributions of this paper as follows:                      Index Interactions. A key concern for index selection is the issue
                                                                       of index interactions. Two indices a and b interact if the benefit
• We introduce the new paradigm of semi-automatic index tuning
                                                                       of a depends on the presence of b. As a typical example, a and
in Section 3. We identify the relevant design choices, provide a
                                                                       b can interact if they are intersected in a physical plan, since the
formal problem statement, and outline the requirements for an ef-
                                                                       benefit of each index may be boosted by the other. Note, however,
fective semi-automatic index advisor.
                                                                       that indices can be used together in the same query plan without
• We show that recommendations can be generated in a princi-           interacting. This scenario commonly occurs when indices are used
pled manner by an adaptation of the Work Function Algorithm [4]        to handle selection predicates on different tables.
(WFA) from the study of metrical task systems (Section 4.1). We           We employ a formal model of index interactions that is based
prove that WFA selects recommendations with a guaranteed bound         on our previous work on this topic [17]. Due to the complexity
on worst-case performance, which allows the DBA to put some            of index interactions, the model restricts its scope to some subset
faith in the recommended indices. The proof is interesting in the      J ⊆ I of interesting indices. (In our context, J is usually a set of
broader context of online optimization, since the index tuning prob-   indices that are relevant for the current workload.) The degree of
lem does not satisfy the assumptions of the original Work Function     interaction between a and b with respect to a query q is
Algorithm for metrical task systems.
                                                                       doi q (a, b) = max |benefit q ({a}, X) − benefit q ({a}, X ∪ {b})|.
                                                                                     X⊆J
• We develop the WFA+ algorithm (Section 4.2) which uses a divide-
and-conquer strategy with several instances of WFA on separate in-     It is straightforward to verify the symmetry doi q (a, b) = doi q (b, a)
dex sets. We show that WFA+ leads to improved running time and         by expanding the expression of benefit q in the metric definition.
better guarantees on recommendation quality, compared to analyz-       Overall, this degree of interaction captures the amount that the ben-
ing all indices with a single instance of WFA. The guarantees of       efits of a and b affect each other. Given a workload Q, we say a, b
     +
WFA are significantly stronger compared to previous works for          interact if ∃q ∈ Q : doi q (a, b) > 0, and otherwise a, b are inde-
online database tuning [6, 11], and are thus of interest beyond the    pendent.
scope of semi-automatic index selection.                                   Let {P1 , . . . , PK } denote a partition of indices in J . Each Pk
• We introduce the WFIT index-tuning algorithm that provides an        is referred to as a part. The partition is called stable if the cost
end-to-end implementation of the semi-automatic paradigm (Sec-         function obeys the following identity for any X ⊆ J :
tion 5). The approach builds upon the framework of WFA+ , and              cost(q, X) = cost(q, ∅) − K
                                                                                                          P
                                                                                                            k=1 benefit q (X ∩ Pk , ∅).   (2.1)
couples it with two additional components: a principled feedback
mechanism that is tightly integrated with the logic of WFA+ , and an   Essentially, a stable partition decomposes the benefit of a large set
online algorithm to extract candidate indices from the workload.       X into benefits of smaller sets X ∩Pk . The upshot for index tuning
                                                                       is that indices can be selected independently within each Pk , since
• We evaluate WFIT’s empirical performance using a prototype im-       indices from different parts have independent benefits. As shown
plementation over IBM DB2 (Section 6). Our results with dynamic        in [17], the stable partition with the smallest parts is given by the
workloads demonstrate that WFIT generates online index recom-          connected components of the binary relation {(a, b) | a, b interact}.
mendations of high quality, even when compared to the best in-         The same study also provides an efficient algorithm to compute the
dices that could be chosen with advance knowledge of the complete      binary relation and hence the minimum stable partition.
workload. We also show that WFIT can benefit from good feedback           In the worst case, the connected components can be quite large
in order to improve further the quality of its recommendations, but    if there are many complex index interactions. In practice, the parts
is also able to recover gracefully from bad advice.                    can be made smaller by ignoring weak interactions, i.e., index-pairs
                                                                       (a, b) where doi q (a, b) is small. Equation (2.1) might not strictly
2    Preliminaries                                                     hold in this case, but we can ensure that it provides a good approx-
General Concepts. We model the workload of a database as a             imation of the true query cost (that is still useful for index tuning)
stream of queries and updates Q. We let qn denote the n-th state-      as long as the partition accounts for the most significant index in-
ment and QN denote the prefix of length N .                            teractions. We discuss this point in more detail in Section 5.
   Define I as the set of secondary indices that may be created on
the database schema. The physical database design comprises a          3    Semi-Automatic Index Tuning
subset of I that may change over time. Given a statement q ∈ Q           At a high level, a semi-automatic tuning algorithm takes as input
and set of indices X ⊆ I, we use cost(q, X) to denote the cost         the current workload and feedback from the DBA, and computes
a recommendation for the set of materialized indices. (Both in-         of indices which have received a vote after the most recent query,
puts are continuous and revealed one “element” at a time.) The          where the most recent vote was positive. Define Fc− analogously
DBA may inspect the recommendation at any time, and is solely           for negative votes. The consistency constraint requires S to contain
responsible for scheduling changes to the materialized set. The on-     all indices in Fc+ and no indices in Fc− , i.e., Fc+ ⊆ S∧S∩Fc− = ∅.
line analysis allows the algorithm to adapt its recommendations to         Consistency forces recommendations to agree with the DBA’s
changes in the workload or in the DBA’s preferences. Moreover,          cumulative feedback so long as the algorithm has not analyzed a
the feedback mechanism enables the DBA to pass to the algorithm         new query in the input. This property is aligned with the assump-
domain knowledge that is difficult to obtain automatically. We de-      tion that the DBA is a trusted expert. Moreover, consistency en-
velop formal definitions for these notions and for the overall prob-    ables an intuitive interface in the case of implicit feedback that
lem statement in the following subsection.                              is derived from the DBA’s actions: without the consistency con-
   We note that our focus is on the core problem of generating index    straint, it would be possible for the DBA to create an index a and
recommendations, which forms the basic component of any index           immediately receive a recommendation to drop a (an inconsistent
advisor tool. An index advisor typically includes other peripheral      recommendation) even though the workload has not changed.
components, such as a user interface to visually inspect the current       At the same time, our definition implies that Fc+ = Fc− = ∅
recommendation [10, 17] or methods to determine a materialization       when a new query arrives. This says that votes can only force
schedule for selected indices[17]. These components are mostly          changes to the recommended configuration until the next query is
orthogonal to the index-recommendation component and hence we           processed, at which time the algorithm is given the option to over-
can reuse existing implementations. Developing components that          ride the DBA’s previous feedback. Of course, the algorithm needs
are specialized for semi-automatic index tuning may be an interest-     to analyze the workload carefully before taking this option, and
ing direction for future work.                                          determine whether the recent queries provide enough evidence to
                                                                        override past feedback. Otherwise, it could appear to the DBA
3.1    Problem Formulation                                              that the system is ignoring the feedback and changing its recom-
Feedback Model. We use a simple and intuitive feedback model            mendation without proper justification. Too many changes to the
that allows the DBA to submit positive and negative votes accord-       recommendation can also hurt the theoretical performance of an
ing to current preferences. At a high level, a positive vote on index   algorithm, as we describe later.
a implies that we should favor recommendations that contain a, un-
til the workload provides sufficient evidence that a decreases per-       The Semi-Automatic Tuning Problem: Given a workload Q
formance. The converse interpretation is given for a negative vote      and a feedback stream V of pairs (F + , F − ), generate a recom-
on a. Our feedback model allows the DBA to cast several of these        mended index set S ⊆ I after each element in Q ∪ V such that S
votes simultaneously. Formally speaking, the DBA expresses new          obeys the consistency constraint.
preferences by providing two disjoint sets of indices F + , F − ⊆ I,       Note that user-specified storage constraints are not part of the
where indices in F + receive positive votes and indices in F − re-      problem statement. Although storage can be a concern in practice,
ceive negative votes.                                                   the recommendation size is unconstrained because it is difficult to
    We say that the DBA provides explicit feedback when they di-        answer the question “How much disk space is enough?” before see-
rectly cast votes on indices. We also allow for implicit feedback       ing the size of recommended indices. Instead, we allow the DBA
that can be derived from the manual changes that the DBA makes          to control disk usage when selecting indices from the recommenda-
to the index configuration. More concretely, we can infer a pos-        tion.1 To validate our choice, we conducted a small survey among
itive vote when an index is created and a negative vote when an         DBAs of real-world installations. The DBAs were asked whether
index is dropped. The use of implicit feedback realizes an unobtru-     they would prefer to specify a space budget for materialized in-
sive mechanism for automated tuning, where the tuning algorithm         dices, or to hand-pick indices from a recommendation of arbitrary
tailors its recommendations to the DBA’s actions even if the DBA        size. The answers were overwhelmingly in favor of the second op-
operates “out-of-band”, i.e., without explicit communication with       tion. One characteristic response said “Prefer hand-pick from DBA
the tuning algorithm.                                                   perspective, as storage is not so expensive as compared to overall
Problem Formulation. A semi-automatic tuning algorithm re-              objective of building a highly scalable system.” This does not im-
ceives as input the workload stream Q and a stream V that rep-          ply we should recommend all possible indices. On the contrary, as
resents the feedback provided by the DBA. Stream V has elements         we see below, the recommendation must account for the overhead
of the form F = (F + , F − ) per our feedback model. Its contents       of materializing and maintaining the indices it recommends.
are not synchronized with Q, since the DBA can provide arbitrary        Performance Metrics. Intuitively, a good semi-automatic tun-
feedback at any point in time. We only assume that Q and V are          ing algorithm should recommend indices that minimize the overall
ordered in time, and we may refer to Q ∪ V as a totally ordered         work done by the system, including the cost to process the work-
sequence. The output of the algorithm is a stream of recommended        load as well as the cost to implement changes to the materialized in-
index sets S ⊆ I, generated after each query or feedback element        dices. The first component is typical for index tuning problems and
in Q ∪ V . We focus on online algorithms, and hence the computa-        it reflects the quality of the recommendations. The second compo-
tion of S can use information solely from past queries and votes—       nent stems from the online nature of the problem: the recommen-
the algorithm has absolutely no information about the future.           dations apply to the running state of the system, and it is clearly
   In order to complete the problem statement, we must tie the al-      desirable to change the materialized set at a low cost. Low mate-
gorithm’s output to the feedback in V . Intuitively, we consider the    rialization cost is important even if new indices are built during a
DBA to be an expert and hence the algorithm should trust the pro-       maintenance period, since these periods have limited duration and
vided feedback. At the same time, the algorithm should be able          typically involve several other maintenance tasks (e.g., generation
to recover from feedback that is not useful for the subsequent state-   of usage reports, or backups).
ments in the workload. We bridge these somewhat conflicting goals
by requiring each recommendation S to be consistent with recent         1
                                                                          Previous work [10, 17] and commercial systems provide tools to
feedback in V . To formally define consistency, let Fc+ be the set      inspect index configurations, which may be adapted to our setting.
queries. During candidate selection, WFIT also analyzes the inter-
                                           candidate
                                           selection                      actions between candidate indices and uses these interactions to
      workload
                                                                          determine a stable partition of the candidates (see Section 2). Then
                                                       partitioned        the output of candidate selection is a partitioned set of indices, as
                                                         index            shown in Figure 1. Once these candidates are chosen, WFIT an-
                                                       candidates         alyzes the benefit of the indices with respect to the workload in
       feedback
        (F +, F -)                                     recommendation     order to generate the final recommendation. The logic that WFIT
                       WFA        WFA       WFA             logic         uses to generate recommendations is based on the Work Function
                                                                          Algorithm (WFA) of Borodin and El-Yaniv [4]. The original ver-
                                                 recommended              sion of WFA was proposed for metrical task systems [5] but we
                                                    index set
   DBA
                                                                          extend its functionality to apply to semi-automatic index selection.
                                                                          A separate instance of WFA analyzes each part of the candidate set
                                                                          and only recommends indices within that part. As we discuss later,
          Figure 1: Components of the WFIT Algorithm.                     this divide-and-conquer approach of WFIT improves the algorithm’s
                                                                          performance and theoretical guarantees. Finally, the DBA may re-
                                                                          quest the current recommendation at any time and provide feedback
   Formally, let A be a semi-automatic tuning algorithm, and define       to WFIT. The feedback is incorporated back into each instance of
Sn as the recommendation that A generates after analyzing qn and          WFA and considered for the next recommendation.
all feedback up to qn+1 . Also denote the initial set of indices as
S0 . We define the following total work metric that captures the             The following two sections present the full details of each com-
performance of A’s recommendations:                                       ponent of WFIT shown in Figure 1. Section 4 defines WFA and
                                                                          describes how WFIT leverages the array of WFA instances for its
                                                                          recommendation logic. Section 5 completes the picture, with the
                              X
    totWork (A, QN , V ) =        cost(qn , Sn ) + δ(Sn−1 , Sn )
                             1≤n≤N
                                                                          additional mechanisms that WFIT uses to generate candidates and
                                                                          account for DBA feedback.
The value of totWork (A, QN , V ) models the performance of a
system where each recommendation Sn is adopted by the DBA
for the processing of query qn . This convention follows common           4     A Work Function Algorithm
practice in the field of online algorithms [4] and is convenient for            for Index Tuning
the theoretical analysis that we present later. In addition, this model
captures the effect of the feedback in V , as each Sn is required to         The index tuning problem closely follows the study of task sys-
be consistent (see above). Overall, total work forms an intuitive         tems from online computation [5]. This allows us to base our rec-
objective function, as it captures the primary sources of cost, while     ommendation algorithm on existing principled approaches. In par-
incorporating the effect of feedback on the choices of the algorithm.     ticular, we apply the Work Function Algorithm [4] (WFA for short),
The adoption of this metric does not change the application of semi-      which is a powerful approach to task systems with an optimal com-
automatic tuning in practice: the tuning algorithm will still generate    petitive ratio.
a recommendation after each element in Q ∪ V , and the DBA will              In order to fit the assumptions of WFA, we do not consider the
be responsible for any changes to the materialized set.                   effect of feedback and we fix a set of candidate indices C ⊆ I from
   It is clearly impossible for an online algorithm A to yield the        which all recommendations will be a drawn. In the next section, we
optimal total work for all values of QN and V . Consequently,             will present the WFIT algorithm, which builds on WFA with support
we adopt the common practice of competitive analysis: we mea-             for feedback and automatic maintenance of candidate indices.
sure the effectiveness of A by comparing it against an idealized
offline algorithm OPT that has advance knowledge of QN and V              4.1    Applying the Work Function Algorithm
and can thus generate optimal recommendations. Specifically, we
say that A has competitive ratio c if totWork (A, QN , V ) ≤ c ·             We introduce the approach of WFA with a conceptual tool that vi-
totWork (OPT, QN , V ) + α for any QN and V , where α is con-             sualizes the index tuning problem in the form of a graph. The graph
stant with respect to QN and V , and A and OPT choose recommen-           has a source vertex S0 to represent the initial state of the system, as
dations from the same finite set of configurations. The competitive       well as vertices (qn , X) for each statement qn and possible index
ratio c captures the performance of A compared to the optimal rec-        configuration X ⊆ C. The graph has an edge from S0 to (q1 , X)
ommendations in the worst case, i.e., under some adversarial input        for each X, and edges from (qn−1 , X) to (qn , Y ) for all X, Y and
QN and V . In this work, we assume that V = ∅ for the purpose             1 < n ≤ N . The weight of an edge is given by the transition cost
of competitive analysis, since V comes from a trusted expert and          between the corresponding index sets. The nodes (q, X) are also
hence the notion of adversarial feedback is unclear in practice. Our      annotated with a weight of cost(q, X). We call this the index tran-
theoretical results demonstrate that the derivation of c remains non      sition graph. The key property of the graph is that the totWork
trivial even under this assumption. Applying competitive analysis         metric is equivalent to the sum of node and edge weights along the
to the general case of V 6= ∅ is a challenging problem that we leave      path that follows the recommendations. Figure 2 illustrates this cal-
for future work.                                                          culation on a small sample graph. A previous study [3] has used this
                                                                          graph formulation for index tuning when the workload sequence is
3.2      Overview of Our Solution                                         known a priori. Here, we are dealing with an online setting where
   The remainder of the paper describes the WFIT algorithm for            the workload is observed one statement at a time.
semi-automatic index tuning. Figure 1 illustrates WFIT’s approach            The internal state of WFA records information about shortest paths
to generating recommendations based on the workload and DBA               in the index transition graph, where the possible index configura-
feedback. The approach starts with a candidate selection com-             tions comprise the subsets of the candidate set C. More formally,
ponent, which generates indices that are relevant to the incoming         after observing n workload statements, the internal state of WFA
tracks a value denoted wn (S) for each index set S ⊆ C, as defined                       Data:Set C ⊆ I of candidate indices; Array w of work function values;
in the following recurrence:                                                                 Configuration currRec.
                                                                                         Initialization: Candidates C and initial state S0 ⊆ C given as input;
 wn (S)         =       min {wn−1 (X) + cost(qn , X) + δ(X, S)} (4.1)                        w[S] = δ(S0 , S) for each S ⊆ C; currRec = S0 .
                        X⊆C

 w0 (S)         = δ(S0 , S)                                                                  Procedure WFA.analyzeQuery(q)
                                                                                             Input: The next statement q in the workload
We henceforth refer to wn (S) as the work function value for S                           1   Initialize arrays w0 and p;
after n statements. As mentioned above, the work function can                            2   foreach S ⊆ C do
be interpreted in terms of paths in the index transition graph. In                       3       w0 [S] = minX⊆C {w[X] + cost(q, X) + δ(X, S)};
                                                                                         4       p[S] = {X ⊆ C | w0 [S] = w[X] + cost(q, X) + δ(X, S)};
the case where n is positive, wn (S) represents the sum of (i) the
cost of the shortest path from S0 to some graph node (qn , X), and                       5 Copy w0 to w;
                                                                                         6 foreach S ⊆ C do score(S) ← w[S] + δ(S, currRec);
(ii) the transition cost from X to S. The actual value of wn (S) uses
                                                                                         7 currRec ← arg minS∈p[S] {score(S)};
the X ⊆ C which minimizes this cost. We can think of w0 (S) in a
similar way, where the “path” is an empty path, starting and ending
                                                                                             Function WFA.recommend()
at S0 . Then the definition w0 (S) = δ(S0 , S) has a natural analogy                     1 return currRec;
to the recursive case.
   Note that the total work of the theoretically optimal recommen-
                                                                                                           Figure 3: Pseudocode for WFA.
dations is equivalent to totWork (Qn , OPT, ∅) = minS⊆C {wn (S)}.
Hence, the intuition is that WFA can generate good recommenda-
tions online by maintaining information about the possible paths of                         The recommendation S chosen by WFA must also appear in p[S].
optimal recommendations.                                                                 Recall that p[S] records states X s.t. there exists a path from S0 to
   Figure 3 shows the pseudocode for applying WFA to index tun-                          (qn , X) that minimizes wn (S). The condition specifies that X =
ing. All of the bookkeeping in WFA is based on the fixed set C of                        S for one such path, and hence wn (S) = wn−1 (S) + cost(q, S).
candidate indices. The algorithm records an array w that is indexed                      An important result from Borodin et al. ([4], Lemma 9.2) shows
by the possible configurations (subsets of C). After analyzing the                       that this condition is always satisfied by a state with minimum
n-th statement of the workload, w[S] records the work function                           score. In other words, the criterion S ∈ p[S] is merely a tie-breaker
value wn (S). The internal state also includes a variable currRec                        for recommendations with the minimum score, to favor configura-
to record the current recommendation of the algorithm.                                   tions whose work function does not include a transition after the
   The core of the algorithm is the analyzeQuery method. There                           last query is processed. This is crucial for the theoretical guaran-
are two stages to the method. The first stage updates the array w                        tees of WFA that we discuss later.
using the recurrence expression defined previously. The algorithm
also creates an auxiliary array p. Each p[S] contains index sets                            E XAMPLE 4.1. The basic approach of WFA can be illustrated
X such that a path from S0 to (qn , X) minimizes wn (S). The                             using the scenario in Figure 2. The actual recommendations of
second stage computes the next recommendation to be stored in                            WFA will be the same as the highlighted nodes. Before the first
currRec. WFA assigns a numerical score to each configuration S                           query is seen, the work function values are initialized as
as score(S) = w[S] + δ(S, currRec) and the next state must min-                                              w0 (∅) = 0,    w0 ({a}) = 20
imize this score. To see the intuition of this criterion, consider a                     based on the transition cost from the initial configuration S0 ≡ ∅.
configuration X with a higher score than currRec, meaning that                           After the first query, the work function is updated using (4.1):
X cannot become the next recommendation. Then
                                                                                                             w1 (∅) = 15,   w1 ({a}) = 25.
              score(currRec) < score(X)                                                  These values are based on the paths ∅  ∅ and ∅  {a} respec-
            ⇒ wn (currRec) − wn (X) < δ(X, currRec).                                     tively.2 The scores are the same as the respective work function
                                                                                         values (δ(∅, ∅) = δ({a}, ∅) = 0 at line 6 of WFA.analyzeQuery),
The left-hand side of the final inequality can be viewed as the ben-
                                                                                         hence ∅ remains as WFA’s recommendation due to its lower score.
efit of choosing a new recommendation X over currRec in terms
                                                                                         After q2 , the work function values are both
of the total work function, whereas the right side represents the
cost for WFA to “change its mind” and transition from X back to                                                 w2 (∅) = w2 ({a}) = 27.
currRec. When the benefit is less than the transition cost, WFA will                     Both values use the path ∅{a}{a}. The calculation of w2 (∅)
not choose X over the current recommendation. This cost-benefit                          also includes the transition δ({a}, ∅), which has zero cost. The cor-
analysis helps WFA make robust decisions (see Theorem 4.1).                              responding scores are again equal to the work function, but here the
                                                                                         tie-breaker comes into play: {a} is preferred because it is used to
                                                                                         evaluate q2 in both paths, hence WFA switches its recommendation
    init        query q1           query q2             query q3
                                                                                         to {a}. Finally, after q3 , the work function values are
                              0                   0
                    5                 2                    20      index set {a}                             w3 (∅) = 42,   w3 ({a}) = 47.
           20
                         20       0        20          0                                 based on paths ∅  {a}  {a}  ∅ and ∅  {a}  {a}  {a}
                    15                15                   15      index set {}          respectively. The actual scores must also account for the current
           0                  0                   0
 This small graph visualizes total work for a workload of three queries q1 , q2 , q3 ,   recommendation {a}. Following line 6 of WFA.analyzeQuery,
 where recommendations are chosen between ∅ and {a}. The index a has cost
 20 to create and cost 0 to drop. The highlighted path in the graph corresponds
                                                                                                          score(∅) = 62,    score({a}) = 47.
 to an algorithm that recommends ∅ for q1 and {a} for q2 , q3 . The combined             The recommendation of WFA remains {a}, since it has a lower
 cost of edges and nodes in the path is δ(∅, ∅) + cost(q1 , ∅) + δ(∅, {a}) +
 cost(q2 , {a}) + δ({a}, {a}) + cost(q3 , {a}) = 57.
                                                                                         score. This last query illustrates an interesting property of WFA:
                                                                                         2
                                                                                           For example 4.1, we abuse notation and use index sets X in place
                         Figure 2: Index transition graph                                of the graph nodes (qn , X).
+
                                                                                                                 S          (k)
although the most recent query has favored dropping a, the recom-             •   WFA    .recommend () returns      k WFA         .recommend ().
mendation does not change because the difference in work function                                     +
                                                                                 On the surface, WFA is merely a wrapper around multiple in-
values is too small to outweigh the cost to materialize a again.             stances of WFA, but the partitioned approach of WFA+ provides sev-
   As a side note, observe that the computation of wn (S) requires           eral concrete advantages. The division of indices into a stable
                                                                                                                                        P par-
computing cost(q, X) for multiple configurations X. This is feasi-           tition implies that WFA+ must maintain statistics on only k 2|Ck |
ble using the what-if optimizer of the database system. Moreover,            configurations, compared to the 2|C| states that would be required
recent studies [13, 9] have proposed techniques to speed up succes-          to monitor all the indices in WFA. This can simplify the book-
sive what-if optimizations of a query. These techniques can readily          keeping massively: a back-of-the-envelope calculation shows that
be applied to make the computation of wn very efficient.                     if WFA+ is given 32 indices partitioned into subsets of size 4, then
                                                                             only 128 configurations need to be tracked, whereas WFA would
WFA’s Advantage: Competitive Analysis. WFA is a seemingly                    require more than four billion states.
simple algorithm, but its key advantage is that we can prove strong              In the extended version of this paper [1], we prove that the sim-
guarantees on the performance of its recommendations.                        plification employed by WFA+ is lossless: in other words, WFA+
   Borodin and El-Yaniv [4] showed that WFA has a competitive                selects the same indices as WFA. It follows that WFA+ inherits the
ratio of 2σ − 1 for any metrical task system with σ possible config-         competitive ratio of WFA. However, the power of WFA+ is that it
urations, meaning that its worst-case performance can be bounded.            enables a much smaller competitive ratio by taking advantage of
Moreover, WFA is an optimal online algorithm, as this is the best            the stable partition.
competitive ratio that can be achieved. These are very powerful
properties that we would like to transfer to the problem of index              T HEOREM 4.2. WFA+ has a competitive ratio of 2cmax +1 − 1,
recommendations. However, the original analysis does not apply               where cmax = maxk {|Ck |}. (Proof in the extended paper [1])
in our setting, since it requires δ to be a metric, and our definition          Hence the divide-and-conquer strategy of WFA+ is a win-win, as
of δ is not symmetric. One of the technical contributions of this            it improves the computational complexity of WFA as well as the
paper is to show how to overcome the fact that δ is not a metric,            guarantees on performance. Observe that WFA+ matches the com-
and extend the analysis to the problem of index recommendations.             petitive ratio of 3 that the online tuning algorithm of Bruno and
                                                                             Chaudhuri [6] achieves for the special case |C| = 1 (the compet-
   T HEOREM 4.1. The WFA algorithm, as shown in Figure 3, has
                                                                             itive analysis in [6] does not extend to a more general case). The
a competitive ratio of 2|C|+1 − 1. (Proof in the extended paper [1])
                                                                             competitive ratio is also superior to the ratio ≥ 8(2|C| − 1) for the
   This theoretical guarantee bolsters our use of WFA to generate            OnlinePD algorithm of Malik et al. [11] for a related problem in
recommendations. The competitive ratio ensures that the recom-               online tuning.
mendations do not have an arbitrary effect on performance in the
worst case. We show empirically in Section 6 that the average-case           5     The WFIT Algorithm
performance of the recommendations can be close to optimal. This                We introduced WFA+ in the previous section, as a solution to the
behavior is appealing to DBAs, since they would not want to make             index recommendation problem with strong theoretical guarantees.
changes that can have unpredictably bad performance.                         The two limitations of WFA+ are (i) it does not accept feedback,
4.2     Partitioning the Candidates                                          and (ii) it requires a fixed set of candidate indices and stable parti-
                                                                             tion. In this section, we define the WFIT algorithm, which extends
    In the study of general task systems, the competitive ratio of WFA            +
                                                                             WFA with mechanisms to incorporate feedback and automatically
is theoretically optimal [5]. However, the algorithm has some draw-
                                                                             maintain the candidate indices.
backs for the index recommendation problem, since it becomes in-
                                                                                Figure 4 shows the interface of WFIT in pseudocode. The meth-
feasible to maintain statistics for every subset of candidates in C as
                                                                             ods analyzeQuery and recommend perform the same steps as the
the size of C increases. The competitive ratio 2|C|+1 − 1 also be-
                                                                             corresponding methods of WFA+ . In analyzeQuery, WFIT takes
comes nearly meaningless for moderately large sets C. Motivated
                                                                             additional steps to maintain the stable partition {C1 , . . . , CK }. This
by these observations, we present an enhanced algorithm WFA+ ,
                                                                             work is handled by two auxiliary methods: chooseCands deter-
which exploits knowledge of index interactions to reduce the com-
                                                                             mines what the next partition should be, and repartition reorga-
putational complexity of WFA, while enabling stronger theoretical
                                                                             nizes the data structures of WFIT for the new partition. Finally,
guarantees.
                                                                             WFIT adds a new method feedback , which incorporates explicit or
    The strategy of WFA+ employs a stable partition {C1 , . . . , CK }
                                                                             implicit feedback from the DBA.
of C, as defined in Section 2. The stable partition guarantees that in-
                                                                                In the next subsection, we discuss the feedback method. We then
dices in Ck do not interact with indices in any other part Cl 6= Ck .
                                                                             provide the details of the chooseCands and repartition methods
This is formalized by (2.1), which shows that each part Ci makes
                                                                             used by analyzeQuery.
an independent contribution to thePbenefit. Moreover, it is straight-
forward to show that δ(X, Y ) = k δ(X ∩ Ck , Y ∩ Ck ), i.e., we              5.1     Incorporating Feedback
can localize the transition cost within each subset Ck . These obser-           As discussed in Section 3, the DBA provides feedback by casting
vations allow WFA+ to decompose the objective function totWork               positive votes for indices in some set F + and negative votes for a
into K components, one for each Ck , and then select indices within          disjoint set F − . The votes may be cast at any point in time, and the
each subset using separate instances of WFA.                                 sets F + , F − may involve any index in C (even indices that are not
    We define WFA+ as follows. The algorithm is initialized with             part of the current recommendation). This mechanism is captured
a stable partition {C1 , . . . , CK } of C, and initial configuration S0 .   by a new method feedback (F + , F − ). The DBA can call feedback
For k = 1, . . . , K, WFA+ maintains a separate instance of WFA, de-         explicitly to express preferences about the index configuration, and
noted WFA(k) . We initialize WFA(k) with candidates Ck and initial           we also use feedback to account for the implicit feedback from
configuration S0 ∩ Ck . The interface of WFA+ follows WFA:                   manual changes to the index configuration.
        +                                    (k)
 •   WFA   .analyzeQuery(q) calls      WFA         .analyzeQuery(q) for         Recall from Section 3 that the recommendations must be consis-
     each k = 1, . . . , K.                                                  tent with recent feedback, but should also be able to recover from
poor feedback. Our approach to guaranteeing consistency is sim-                        Data: Current set C of candidate indices;
ple: Assuming that currRec is the current recommendation, the                               Stable partition {C1 , . . . , CK } of C;
                                                 −     +                                    WFA instances WFA (1) , . . . , WFA (K) ;
new recommendation becomes     S currRec − F ∪ F . Since WFIT                          Initialization: Initial index set S0 is provided as input;
forms its recommendation as k currRec k , where currRec k is the                            C = S0 , K = |S0 | and Ci = {ai } where 1 ≤ i ≤ |S0 | and
recommendation from WFA running on part Ck , we need to modify                         a1 , . . . , a|S0 | are the indices in S0 ;
each currRec k accordingly. Concretely, the new recommendation                              for k ← 1 to K do
for Ck becomes currRec k − F − ∪ (F + ∩ Ck ).                                                     WFA (k) ← instance of WFA with candidates Ck
   The recoverability property is trickier to implement properly.                                                and initial configuration Ck ∩ S0
Our solution is to adjust the scores in order to appear as if the work-
load (rather than the feedback) had led WFIT to recommend creat-                     Procedure WFIT.analyzeQuery(q)
ing F + and dropping F − . With this approach, WFIT can naturally                    Input: The next statement q in the workload.
                                                                                   1 {D1 , . . . , DM } ← chooseCands(q) ; // Figure 6
recover from bad feedback if the future workload favors a differ-
                                                                                   2 if {D1 , . . . , DM } 6= {C1 , . . . , CK } then
ent configuration. To enforce the property in a principled manner,                       // Replace {C1 , . . . , CK } with {D1 , . . . , DM }.
we need to characterize the internal state of each instance of WFA                 3     repartition({D1 , . . . , DM }) ; // Figure 5
after it generates a recommendation. Recall that WFA selects its
                                                                                   4 for k ← 1 to K do WFA(k) .analyzeQuery(q);
next recommendation as the configuration that minimizes the score
function. Let us assume that the selected configuration is Y , which                   Function WFIT.recommend()
differs from the previous configuration by adding indices Y + and                  1 return
                                                                                             S       (k) .recommend();
                                                                                               k WFA
dropping indices Y − . If we recompute score after Y becomes the
current recommendation, then we can assert the following bound                         Procedure WFIT.feedback (F + , F − )
for each configuration S:                                                              Input: Index sets F + , F − ⊆ C with positive/negative votes.
                                                                                   1   for k ← 1 to K do
   score(S) − score(Y ) ≥
                                                                                   2       Let w(k) denote the work function of WFA(k) ;
                 δ(S, S − Y   −   ∪Y   +)   + δ(S − Y   −   ∪Y   + , S)   (5.1)    3       Let currRec k denote the current recommendation of WFA(k) ;
Essentially, this quantity represents the minimum threshold that                   4       currRec k ← currRec k − F − ∪ (F + ∩ Ck );
score(S) must overcome in order to replace the recommendation                      5       for S ⊆ Ck do
                                                                                   6           S cons ← S − F − ∪ (F + ∩ Ck );
Y . Hence, in order for the internal state of WFA(k) to be consistent              7           minDiff ← δ(S, S cons ) + δ(S cons , S);
with switching to the new recommendation currRec k , we must en-                   8           diff ← w(k) [S] + δ(S, currRec k ) − w(k) [currRec k ];
sure that score(S) − score(currRec k ), or the equivalent expres-                  9           if diff < minDiff then
sion w(k) [S] + δ(S, currRec k ) − w(k) [currRec k ], respects (5.1).             10               Increase w(k) [S] by minDiff − diff ;
This can be achieved by increasing w(k) [S] accordingly.
   Figure 4 shows the pseudocode for feedback based on the previ-
ous discussion. For each part Ck of the stable partition, feedback                                        Figure 4: Interface of WFIT.
first switches the current recommendation to be consistent with the
feedback (line 4). Subsequently, it adjusts the value of w(k) [S] for
each S ⊆ Ck to enforce the bound (5.1) on score(S).                               the work function values of {D1 , . . . , DM } and {C1 , . . . , CK },
5.2     Maintaining Candidates Automatically                                      assuming that both partitions are stable.
    The analyzeQuery method of WFIT extends the approach of                          We describe the reinitialization of the work function with an ex-
WFA
     +
        to automatically change the stable partition as appropriate               ample. Assume the old stable partition is C1 = {a}, C2 = {b},
for the current workload. We present these extensions in the re-                  and the new stable partition has a single member D1 = {a, b}. Let
mainder of this section. We first discuss the repartition method,                 w(1) , w(2) be the work function values maintained by WFIT for the
which updates WFIT’s internal state according to a new stable par-                subsets C1 , C2 . Let wn be the work function that considers paths in
tition. Finally, we present chooseCands, which determines what                    the index transition graph with both indices a, b, which represents
that stable partition should be.                                                  the information that would be maintained if a, b were in the same
                                                                                  stable subset. In order to initialize work function values for D1 ,
5.2.1    Handling Changes to the Partition                                        we observe that the following identity follows from the assumption
   Suppose that the repartition method is given a stable partition                that {C1 , C2 } is a stable partition:
{D1 , . . . , DM } for WFIT to adopt for the next queries. We require
each of the indices materialized by WFA to appear in one of the sets                                                                        n
                                                                                                                                            X
D1 , . . . , DM , in order to avoid inconsistencies between the internal               wn (S) = w(1) (S ∩ {a}) + w(2) (S ∩ {b}) −                 cost(qi , ∅)
state of WFIT and the physical configuration. In this discussion,                                                                           i=1
we do not make assumptions about how {D1 , . . . , DM } is chosen.
Later in this section, we describe how chooseCands automatically                  This is a special case of an equality that we prove in the extended
chooses the stable partition that is given to repartition.                        paper [1]:
Unmodified Candidate Set. We initially consider the case where                                      X                                   n
                                                                                                                                        X
the
SK new partition
               SM is over the same set of candidate indices, i.e.,                       wn (S) =         w(k) [S ∩ Ck ] − (K − 1)            cost(qi , ∅).
   k=1 Ck =      m=1 Dm . The original internal state of WFIT corre-                                  k                                 i=1
sponds to a copy of WFA for each stable subset Ck . The new parti-
tion requires a new copy of WFA to be initialized for each new stable             The bottom line is that it is possible to reconstruct the values of
subset Dm . The challenge is to initialize the work function values               the work function wn using the work functions within the smaller
corresponding to Dm in a meaningful way. We develop a gen-                        partitions. For the purpose of initializing the state of WFA, the final
eral initialization method that maintains an equivalence between                  sum may be ignored: the omission of this sum increases the scores
Procedure repartition({D1 , . . . , DM })                                   Data: Index set U ⊇ C from which to choose candidate indices;
     Input: The new stable partition.                                                  Array idxStats of benefit statistics for indices in U ;
     // Note: D1 , . . . , DM must cover materialized indices                          Array intStats of interaction statistics for pairs of indices in U .
1    Let w(k) denote the work function of WFA(k) ;                               Procedure chooseCands(q)
2    Let currRec denote the current recommendation of WFIT;                      Input: The next statement q in the workload.
3    for m ← 1 to M do                                                           Output: D1 , . . . , DM , a new partitioned set of candidate indices.
4        Initialize array x(m) and configuration variable newRec m ;             Knobs: Upper bound idxCnt on number of indices in output;
5        foreach X ∈ 2Dm do                                                          Upper bound stateCnt on number of states m 2|Dm | .
                                                                                                                                    P
             x(m) [X] ← K           (k) [C ∩ X];
                            P
6                             k=1 w       k                                          Upper bound histSize on number of queries to track in statistics
7            x(m) [X] ← x(m) [X] + δ(S0 ∩ Dm − C, X − C);                    1   U ← U ∪ extractIndices(q);
                                                                             2   IBG q ← computeIBG(q); // Based on [17]
8       newRec m ← Dm ∩ currRec;                                             3   updateStats(IBG q );
9 Set {D1 , . . . , DM } as the stable partition, where Dm is tracked by a   4   M ← {a ∈ C | a is materialized};
     new instance WFA(m) with work function x(m) and state newRec m ;        5   D ← M ∪ topIndices(U − M, idxCnt − |M|);
                                                                             6   {D1 , . . . , DM } ← choosePartition(D, stateCnt);
                                                                             7   return {D1 , . . . , DM };
              Figure 5: The repartition method of WFIT.

                                                                                         Figure 6: The chooseCands Method of WFIT.
of each state S by the same value, which does not affect the deci-
sions of WFA. Based on this reasoning, our repartitioning algorithm
would initialize D1 using the array x defined as follows:                    which did not previously appear in any Ck or the initial state S0 .
                                                                             Since a is a new index, it does not belong to any of the original
                                                                             subsets Ck ,Pand hence the cost to materialize a will not be reflected
    x[∅] ← w(1) [∅] + w(2) [∅]        x[{a}] ← w(1) [{a}] + w(2) [∅]         in the sum k w(k) [X ∩ Ck ]. Since x(m) [X] includes a transition
    x[{b}] ← w(1) [∅] + w(2) [{b}]    x[{a, b}] ← w(1) [{a}] + w(2) [{b}]    to an index set with a materialized, we must add the cost to mate-
We use an analogous strategy to initialize the work function when            rialize a as a separate step. This idea is generalized by adding the
repartitioning from D1 to C1 , C2 :                                          transition cost on line 7. The expression is a bit complex, but we
                                                                             can explain it in an alternative form δ(S0 ∩ Dm − C, X ∩ Dm − C),
                 w(1) [∅] ← x[∅]         w(2) [∅] ← x[∅]                     which is equivalent because X ⊆ Dm . In this form, we can make
                 w(1) [{a}] ← x[{a}]     w(2) [{b}] ← x[{b}]                 an analogy to the initialization used for the work function before
Again, note that these assignments result in work function values            the first query, for which we use w0 (X) = δ(S0 , X). The ex-
that would be different if C1 , C2 were used as the stable partition         pression used in line 7 computes the same quantity restricted to the
for the entire workload. The crucial point is that each work function        indices (Dm − C) that are new within Dm .
value is distorted by the same quantity (the omitted sum), so the            5.2.2      Choosing a New Partition
difference between the scores of any two states is preserved.
                                                                                As the final piece of WFIT, we present the method chooseCands,
   The pseudocode for repartition is shown in Figure 5. For each
                                                                             which automatically decides the set of candidate indices C to be
new stable subset Dm , the goal is to initialize a copy of WFA with
                                                                             monitored by WFA, as well as the partition {C1 , . . . , CK } of C.
candidates Dm . The copy is associated with an array x(m) that
                                                                                At a high level, our implementation of chooseCands analyzes
stores the work function values for the configurations in 2Dm . For
                                                                             the workload one statement at a time, identifying interesting in-
a state X ⊆ Dm , the value x(m) [X] is initialized as the sum of
                                                                             dices and computing statistics on benefit interactions. These statis-
w(k) [X ∩ Ck ], i.e., the work function values of the configurations         tics are subsequently used to compute a new stable partition, which
in the original partition that are maximal subsets of X (line 6). This       may reflect the addition or removal of candidate indices or changes
initialization follows the intuition of the example that we described        in the interactions among indices. As we will see shortly, several
previously, since the stable partition {C1 , . . . , CK } implies that       of these steps rely on simple, yet intuitive heuristics that we have
X ∩ Ck is independent from X ∩ Cl for k 6= l. Line 7 makes a                 found to work well in practice. Certainly, other implementations of
final adjustment for new indices in X, but this is irrelevant if the         chooseCands are possible, and can be plugged in with the remain-
candidate set does not change (we will explain this step shortly).           ing components of WFIT.
Finally, the current state corresponding to Dm is initialized by tak-
                                                                                The chooseCands method exposes three configuration variables
ing the intersection of currRec with Dm .
                                                                             that may be used to regulate its analysis. Variable idxCnt speci-
   Overall, repartition is designed in order for the updated inter-          fies an upper bound on the number of indices P    that are monitored
nal state to select the same indices as the original state, provided         by an instance of WFA, i.e., idxCnt ≥ |C| = k |Ck |. Variable
that both partitions are stable. This property was illustrated in the        stateCnt specifies an upper bound on thePnumber of configura-
example shown earlier. It is also an intuitive property, as two stable       tions tracked by WFIT, i.e., stateCnt ≥            |Ck |
                                                                                                                            k2        . If the mini-
partitions record a subset of the same independencies, and hence             mal stable partition does not satisfy these bounds, chooseCands
both allow WFIT to track accurate benefits of different configura-           will ignore some candidate indices or some interactions between
tions. A more formal analysis of repartition would be worthwhile             indices, which in turn affects the accuracy of WFIT’s internal statis-
to explore in future work.                                                   tics. Variable histSize controls the size of the statistics recorded
Modified Candidate Set. We now extend our discussion to the                  for past queries. Any of these variables may be set to ∞ in order
case where theS new partitionSMis over a different set of candidate          to make the statistics as exhaustive as possible, but this may result
indices, i.e., Kk=1 Ck 6=     m=1 Dm . The repartition method
                                                                             in high computational overhead. Overall, these variables allow a
(Figure 5) can handle this case without modifications. The only              trade-off between the overhead of workload analysis and the effec-
difference is that line 7 becomes relevant, and it may increase the          tiveness of the selected indices.
work function value of certain configurations. It is instructive to             Figure 6 shows the pseudocode of chooseCands. The algorithm
consider the computation of x(m) [X] when X contains an index a              maintains a large set of indices U, which grows as more queries
are seen. The goal of chooseCands is to select a stable partition           This means that b requires extra evidence to evict an index in C,
over some subset D ⊆ U. To help choose the stable partition, the            which helps C be more stable.
algorithm also maintains statistics for U in two arrays: idxStats
                                                                            The choosePartition(D, stateCnt) method. Conceptually, the
stores benefit information for individual indices and intStats stores
                                                                            stable partition models the strongest index interactions for recent
information about interactions between pairs of indices within U.
                                                                            queries. We first describe the statistics used to estimate the strength
   Given a new statement q in the workload, the algorithm first
                                                                            of interactions, and then the selection of the partition.
augments U with interesting indices identified by extractIndices
(line 1). This function may be already provided by the database                 The statistics for choosePartition are based on the degree of in-
system (e.g., as with IBM DB2), or it can be implemented ex-                teraction doi q (a, b) between indices a, b ∈ U for a workload state-
ternally [2, 6]. Next, the algorithm computes the index benefit             ment q (Section 2). Specifically, we maintain an array intStats that
graph [17] (IBG for short) of the query (line 2). The IBG com-              is updated in the call to updateStats (which also updates idxStats
pactly encodes the costs of optimized query plans for all relevant          as described earlier). The idea is to iterate over every pair (a, b)
subsets of U. As we discuss later, updateStats uses the IBG to              of indices in the IBG, and use the technique of [17] to compute
efficiently update the benefit and interaction statistics (line 3). The     d ≡ doi qn (a, b). The pair (n, d) is added to intStats[a, b] if
next step of the algorithm determines the new set of candidate in-          d > 0, and only the histSize most recent pairs are retained.
dices D that should be monitored by WFIT for the upcoming work-                 We use intStats[a, b] to compute a “current degree of interac-
load, with an invocation of topIndices on line 5. We ensure that D          tion” for a, b after N observed workload statements, denoted as
includes the currently materialized indices (denoted M), in order           doi ∗N (a, b), which is similar to the “current benefit” described ear-
to avoid overriding the materializations chosen by WFA. Finally,            lier. If intStats[a, b] = ∅ then we set doi ∗N (a, b) = 0. Otherwise,
chooseCands invokes choosePartition to determine the partition              let intStats[a, b] = (n1 , d1 ), . . . , (nL , dL ) for n1 > · · · > nL ,
D1 , . . . , DM of D, and returns the result.                               and
   To complete the picture, we must describe the methodology that                                                     d1 + · · · + d`
                                                                                             doi ∗N (a, b) = max                      .
topIndices and choosePartition use to decide the new partition                                               1≤`≤L N − n` + 1

of indices, and the specific bookkeeping that updateStats does to               To compute the stable partition, we conceptually build a graph
enable this decision.                                                       where vertices correspond to indices and edges correspond to pairs
The topIndices(X, u) Method. The job of topIndices(X, u) is                 of interacting indices. Then a stable partition is a clustering of the
to choose at most u candidate indices from the set X that have the          nodes so that no edges exist between clusters. In the context of
highest potential benefit.                                                  chooseCands, we are interested in partitions {P1 , . . . , PM } such
                                                                            that m 2|Pm | ≤ stateCnt. Since there may exist no stable par-
                                                                                  P
    We first describe the statistics used to evaluate the potential bene-
fit of a candidate index. For each index a, the idxStats array stores       tition that obeys this bound, our approach is to ignore interactions
entries of the form (n, βn ), where n is a position in the workload         until a feasible partition is possible. This corresponds to dropping
and βn is the maximum benefit of a for query qn . The maximum               edges from the conceptual graph, until the connected components
benefit is computed as βn = maxX⊆U benefit qn ({a}, X). The                 yield a suitable clustering of the nodes. The chooseCands algo-
cell idxStats[a] records the histSize most recent entries such that         rithm uses a randomized approach to select which edges to drop,
βn > 0. These statistics are updated when chooseCands invokes               favoring the elimination of edges that represent weak interactions.
updateStats on line 3. The function considers every index a that is         The details are presented in the extended version of this paper [1].
relevant to q, and employs the IBG of query q in order to compute
βn efficiently. If βn > 0 then (n, βn ) is appended to idxStats[a]          6     Experimental Study
and the oldest entry is possibly expired in order to keep histSize
entries in total.                                                              In this section, we present an empirical evaluation of WFIT us-
    Based on these statistics, topIndices(X, u) returns a subset Y ⊆        ing a prototype implementation that works as middleware on top
X with size at most u, which becomes the new set of indices mon-            of an existing DBMS. The prototype, written in Java, intercepts the
itored by WFIT. The first step of topIndices computes a “cur-               SQL queries and analyzes them to generate index recommenda-
rent benefit” for each index in X, which captures the benefit of            tions. The prototype requires two services from the DBMS: access
the index for recent queries. We use benefit ∗N (a) to denote the           to the what-if optimizer, and an implementation of the
current benefit of a after observing N workload statements, and             extractIndices(q) method (line 1 in Figure 6). This design makes
compute this value as follows. If idxStats[a] = ∅ after N state-            the prototype easily portable, as these services are common primi-
ments, then benefit ∗N (a) is zero. Otherwise, let idxStats[a] =            tives found in index advisors [18, 2].
(n1 , b1 ), . . . , (nL , bL ) such that n1 > · · · > nL . Then                We conducted experiments using a port of the prototype to the
                                                                            IBM DB2 Express-C DBMS. The port uses DB2’s design advi-
                                         b1 + · · · + b`
               benefit ∗N (a) = max                      .                  sor [18] to provide what-if optimization and extractIndices(q).
                                 1≤`≤L    N − n` + 1                        Unless otherwise noted, we set the parameters of WFIT as follows:
For each ` = 1, . . . , L, this expression computes an average benefit      idxCnt = 40, stateCnt = 500, and histSize = 100. All exper-
over the most recent N − n` + 1 queries, and we take the maxi-              iments were run on a machine with two dual-core 2GHz Opteron
mum over all `. Note that a large value of n` results in a small            processors and 8GB of RAM.
denominator, which gives an advantage to indices with recent ben-
efit. This approach is inspired by the LRU-K replacement policy             6.1     Methodology
for disk buffering [12].                                                    Competitor Techniques. We compare WFIT empirically against
   The second step of topIndices(X, u) uses the current benefit to          two competitor algorithms. The first algorithm, termed BC, is an
compute a score for each index in X, and returns the u indices with         adaptation3 of the state-of-the-art online tuning algorithm of Bruno
the highest scores. If a ∈ X ∩ C (i.e., a is currently monitored by
                                             ∗                              3
WFA ), the score of a is simply benefit (a). The score of other               The original algorithm was developed in the context of MS SQL
                                  ∗
indices b ∈ X − C is benefit (b) minus the cost to materialize b.           Server. Some of its components do not have counterparts in DB2.
You can also read