COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics (Extended Version)

Page created by Manuel White
 
CONTINUE READING
COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics (Extended Version)
COMPARE: Accelerating Groupwise Comparison in
             Relational Databases for Data Analytics
                       (Extended Version)

                        Tarique Siddiqui                  Surajit Chaudhuri                       Vivek Narasayya
                                                    Microsoft Research
                                        {tasidd, surajitc, viveknar}@microsoft.com

ABSTRACT
Data analysis often involves comparing subsets of data across many
dimensions for finding unusual trends and patterns. While the com-
parison between subsets of data can be expressed using SQL, they
tend to be complex to write, and suffer from poor performance
over large and high-dimensional datasets. In this paper, we pro-
pose a new logical operator C OMPARE for relational databases that
concisely captures the enumeration and comparison between sub-
sets of data and greatly simplifies the expressing of a large class
of comparative queries. We extend the database engine with op-                                           (a) Seedb [47]
timization techniques that exploit the semantics of C OMPARE to
significantly improve the performance of such queries. We have
implemented these extensions inside Microsoft SQL Server, a com-
mercial DBMS engine. Our extensive evaluation on synthetic and
real-world datasets shows that C OMPARE results in a significant
speedup over existing approaches, including physical plans gener-
ated by today’s database systems, user-defined functions (UDFs),
as well as middleware solutions that compare subsets outside the
databases.
1.   INTRODUCTION
   Comparing subsets of data is an important part of data explo-                                      (b) Zenvisage [43]
ration [8, 30, 43, 19, 47], routinely performed by data scientists
to find unusual patterns and gain actionable insights. For instance,       Figure 1: Examples of comparative queries from visual analytic tools: a)
market analysts often compare products over different attribute com-       Finding socio-economic indicators that differentiate married and unmarried
binations (e.g., revenue over week, profit over week, profit over          couples in Seedb [47].The user specifies the subsets (A) after which the
                                                                           tool outputs a pair of attributes (B) along with corresponding visualizations
country, quantity sold over week, etc.) to find the ones with sim-
                                                                           (C) that differentiates the subsets the most. b) A comparative query in Zen-
ilar or dissimilar sales. However, as the size and complexity of           visage [43] for finding states with comparable housing price trends.
the dataset increases, this manual enumeration and comparison of
subsets becomes challenging. To address this, a number of visual-
ization tools [47, 49, 19, 43] have been proposed that automatically       to improve performance and scalability of comparative queries?
compare subsets of data to find the ones that are relevant. Figure         Supporting such queries within relational databases also makes them
1a depicts an example from Seedb [47] where the user specifies the         broadly accessible via general-purpose data analysis tools such as
subsets of population (e.g., based on marital status, race) and the        PowerBI [3], Tableau [5], and Jupyter notebooks [27]. All of these
tool automatically find a socio-economic indicator (e.g., education,       tools let users directly write SQL queries and execute them in the
income, capital gains) on which the subsets differ the most. Sim-          DBMS to reduce the amount of data that is shipped to the client.
ilarly, Figure 1b depicts an example from Zenvisage [43] for find-            One option for in-database execution is to extend DBMS with
ing states with similar house pricing trends. Unfortunately, most          custom user-defined functions (UDF) for comparing subsets of data.
of these tools perform comparison of subsets in a middleware and           However, UDFs incur invocation overhead and are typically exe-
as depicted in Figure 2, with the increase in size and number of           cuted as a batch of statements where each statement is run sequen-
attributes in the dataset, these tools incur large data movement as        tially one after other with limited resources (e.g., parallelism, mem-
well as serialization and deserialization overheads. This result in        ory). Consequently, the performance of UDFs does not scale with
poor latency and scalability.Additionally, these tools require data        the increase in the number of tuples (see Figure 2). Furthermore,
scientists to learn and switch to their domain-specific querying in-       UDFs have limited interoperability with other operators, and are
terfaces, thereby limiting their broad adoption.                           less amenable to logical optimizations, e.g., PK-FK join optimiza-
                                                                           tions over multiple tables.
  The question we pose in this work is: can we efficiently perform            While comparative queries can be expressed using regular SQL,
comparison between subsets of data within the relational databases         such queries require complex combination of multiple subqueries.

                                                                       1
COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics (Extended Version)
Middleware (Outside DB)
                             UDF-based Comparison
  (w.r.t. SQL SERVER)   75   COMPARE(Our approach)
   Improvement (%)
                        50
                        25
                         0
                        25
                        50
                        75 103 104 105 106 107 108
                                   Number of Tuples
Figure 2: Relative performance of different execution approaches for a
comparative query w.r.t unmodified SQL S ERVER execution time (higher
the better). The query finds a pair of origin airports that have the most
similar departure delays over week trends in the flight dataset [1]

The complexity makes it hard for relational databases to find ef-
ficient physical plans, resulting in poor performance. While prior
work have proposed extensions [14, 13, 22, 18, 44] such as group-
ing variables, GROUPING SETS, CUBE; as we discussed in the
later sections, expressing and optimizing grouping and comparison
simultaneously remains a challenge. To describe the complexity
using regular SQL, we use the following example.                                Figure 3: A SQL query for comparing subsets of data over different at-
                                                                                tribute combinations, depicting the complexity of specification using exist-
Example. Consider a market analyst exploring sales trends across
                                                                                ing SQL expressions.
different cities. The analyst generates a sample of visualizations
depicting different trends, e.g., average revenue over week, aver-              result for each comparison; however the join results in large inter-
age profit over week, average revenue over country, etc., for a few             mediate data—one tuple for each pair of matching tuples between
cities. She notices that trends for cities in Europe look different             the two sets, resulting in substantial overheads.
from those in Asia. To verify whether this observation generalizes,
she looks for a counterexample by searching for pairs of attributes             1.1     Overview of Our Approach
over which two cities in Asia and Europe have most similar trends.
Often, an Lp norm-based distance measure (e.g., Euclidean dis-                     In this paper, we take an important step towards making specifi-
                                                                                cation of the comparative queries easier and ensuring their efficient
tance, Manhattan distance) that measures deviation between trends
and distributions is used for such comparisons [43, 47, 19].                    processing. To do so, we introduce a logical operator and exten-
   Figure 3 depicts a SQL query template for the above example.                 sions to the SQL language, as well as optimizations in relational
The query involves multiple subqueries, one for each attribute pair.            databases, described below.
Within each subquery, subsets of data (one for each city) are ag-               Groupwise comparison as a first class construct (Section 2 and
gregated and compared via a sequence of self-join and aggregation               3). We introduce a new logical operation, C OMPARE (Φ), as a first
functions that compute the similarity (i.e., sum of squared differ-             class relational construct, and formalize its semantics that help cap-
ences). Finally, a join and filter is performed to output the tuples            ture a large class of frequently used comparative queries. We pro-
of subsets with minimum scores. Clearly, the query is quite ver-                pose extensions to SQL syntax that allows intuitive and more con-
bose and complex, with redundant expressions across subqueries.                 cise specification of comparative queries. For instance, the compar-
While comparative queries often explore and compare a large num-                ison between two sets of cities C1 and C2 over n pairs of attributes:
ber of attribute pairs [47, 28], we observe that even with only a few           (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) using a comparison function F can
attribute pairs, the SQL specification can become extremely long.               be succinctly expressed as C OMPARE [C1 C2 ][ (x1 , y1 ), (x2 ,
   Furthermore, the number of groups to compare can often be                    y2 ), ..., (xn , yn )] USING F. As illustrated earlier, expressing the
large—determined by the number of possible constraints (e.g., cities),          same query using existing SQL clauses requires a UNION over n
pairs of attributes, and aggregation functions—which grow signifi-              subqueries, one for each (xi , yi ) where each subquery itself tends
cantly with the increase in dataset size or number of attributes. This          to be quite complex. Overall, while C OMPARE does not give ad-
results in many subqueries with each subquery taking substantially              ditional expressive power to the relational algebra, it reduces the
long time to execute. In particular, while there are large oppor-               complexity of specifying comparative queries and facilitates opti-
tunities for sharing computations (e.g., aggregations) across sub-              mizations via query optimizer and the execution engine.
queries, the relational engines execute subqueries for each attribute-          Efficient processing via optimizations (Section 4 and 5). We
pair separately resulting in substantial overhead in both runtime as            exploit the semantics of C OMPARE to share aggregate computa-
well as storage. Furthermore, as depicted in subquery 1 in Figure               tions across multiple attribute combinations, as well as partition
3, while each pair of groups (e.g., set of tuples corresponding to              and compare subsets in a manner that significantly reduces the pro-
each city) can be compared independently, the relational engines                cessing time. While these optimizations work for any comparison
perform an expensive self-join over a large relation consisting of              function, we also introduce specific optimizations (by introducing
all groups. The cost of doing this increases super-linearly as the              a new physical operator) that exploit properties of frequently used
number and size of subsets increases (discussed in more detail in               comparison functions (e.g., Lp norms). These optimizations help
Section 4.1). Finally, in many cases, we only need the aggregated               prune many subset comparisons without affecting the correctness.

                                                                            2
product =‘Inspiron’                                          product = ‘Inspiron’                                                                                                          city = ‘Tokyo’

                                                                                                                                                                                           revenue
                                                                       region = ‘Asia’                                                                                   city =‘Tokyo’

                                                                                                                                                              revenue
                                      & region = ‘Asia’                                                                                  city = ‘Paris’                                               city = ‘Paris’

                                                                                                                               revenue
                            revenue
                                                                                                   & region = ‘Asia’

                                                            revenue

                                                                                         revenue

                                                                                                                                                                                                                        profit
          region = ‘Asia’
                                            week                            week                         week                                  week                           week                         week                       category
revenue

                                       product = ‘G7’ &                region = ‘Asia’             product = ‘Inspiron’                                                                              city = ‘London’
                                                                                                   & region = ‘Asia’                     city = ‘London’                                                                         city = ‘Seoul’

                                                                                                                                                                                                                       revenue
                                       region = ‘Asia’                                                                                                                  city = ‘Beijing’

                                                                                                                                                              revenue
                                                                                                                              revenue
                            revenue

                                                            profit

                                                                                                                                                                                           profit
                                                                                         profit
               week

                                          week                           country                       country                                week                           week                       category                       week
                                      product = ‘XPS’ &               region = ‘Asia’              product = ‘Inspiron’
                                                                                                                                         city = ‘Zurich’                                             city = ‘Zurich’

                                                                                                                                                                                           revenue
                                                                                                                                                                                                                                 city =‘Beijing’

                                                            revenue

                                                                                                                              revenue
                                                                                                   & region = ‘Asia’                                                    city = ‘Seoul’

                                                                                                                                                                                                                       revenue
                                      region = ‘Asia’
                            revenue

                                                                                                                                                              revenue
                                                                                         revenue
                                          week                         month                 month                        week                   week                                                  week                  week
  1a: One to many comparisons over fixed X and               1b: One to one comparisons over varying X and Y      2a: Many to many comparisons over fixed X and                                 2b: Many to many comparisons over varying
  Y attributes                                              attributes                                           Y attributes                                                                 X and Y attributes
                                                                                                       denotes comparison
                                                                       Figure 4: Illustrating comparative queries described in Section 2.1
Inter-operator optimizations (Section 6). We introduce new                                                                         revenue over week trend in Asia is very dissimilar (typically mea-
transformation rules that transform the logical tree containing the                                                                sured using Lp norms) to that of the Asia’s overall trend. There are
C OMPARE operator along with other relational operators into equiv-                                                                visualization systems [48, 12, 43, 28] that support similar queries.
alent logical trees that are more efficient. For instance, the at-                                                                 Example 1b. In the above example, the analyst finds that the trend
tributes referred in C OMPARE may be spread across multiple ta-                                                                    for product ‘Inspiron’ is different from the overall trend for the re-
bles, involving PK-FK joins between fact and dimension tables. To                                                                  gion ‘Asia’. She finds it surprising and wants to see the attributes
optimize such cases, we show how we can push C OMPARE below                                                                        for which trends or distributions of Inspiron and Asia deviate the
join that reduces the number of tuples to join. Similarly, we de-                                                                  most. More precisely, she wants to compare ‘Inspiron’ and ‘Asia’
scribe how aggregates can be pushed below C OMPARE, how multi-                                                                     over multiple pairs of attributes (e.g., average profit over country,
ple C OMPARE operators can reordered and how we can detect and                                                                     average quantitysold over week, ..., average profit over week) and
translate an equivalent sub-plan expressed using existing relational                                                               select the one where they deviate the most. Such comparisons have
operators to C OMPARE.                                                                                                             inspired features such as Explain Data[4] in Tableau and tool such
Implementation inside commercial database engine (Section 7).                                                                      as SeeDB [47], Zenvisage [43], Voyager [49].
We have prototyped our techniques in Microsoft SQL S ERVER en-                                                                     Example 2a. Consider another scenario: the analyst visualizes the
gine, including the physical optimizations. Our experiments show                                                                   revenue trends of a few cities in Asia and in Europe, and finds that
that even over moderately-sized datasets (e.g., 10–20 GB) C OM -                                                                   while most cities in Asia have increasing revenue trends, those in
PARE results in up to 4× improvement in performance relative to                                                                    Europe have decreasing trends. Again, as a counterexample to this,
alternative approaches including physical plans generated by SQL                                                                   she wants to find a pair of cities in these regions where this pattern
S ERVER, UDFs, and middlewares (e.g., Zenvisage, SeeDB). With                                                                      does not hold, i.e., they have the most similar trends. Such tasks
the increase in the number of tuples and attributes, the performance                                                               involving search for similar pair of items are ubiquitous in data
difference grows quickly, with C OMPARE giving more than a order                                                                   mining [36] and time series [35, 10, 8, 30].
of magnitude better performance.                                                                                                   Example 2b. In the above example, the analyst finds that the out-
                                                                                                                                   put pair of visualizations look different, supporting her intuition
                                                                                                                                   that perhaps no two cities in Europe and Asia have similar revenue
2.           CHARACTERIZING COMPARATIVE                                                                                            over week trends. To verify whether this observation generalizes
             QUERIES                                                                                                               when compared over other attributes, she searches for pairs of at-
   In this section, we first characterize comparative queries with the                                                             tributes (similar to ones mentioned in Example 1b) for which two
help of additional examples drawn from visualization tools [19, 43,                                                                cities in Asia and Europe have most similar trends or distributions.
47] and data mining [35, 10, 8, 30]. Then, we give a formal defini-                                                                Such queries are common in tools such as Zenvisage [43], Quick-
tion that concisely captures the semantics of comparative queries.                                                                 Insights [19] that support finding outlier visualizations over a large
                                                                                                                                   set of attributes.
2.1             Examples                                                                                                              In summary, the comparative queries in above examples fast-
                                                                                                                                   forward the analyst to a few visualizations that depict a pattern she
   We return to the example scenario discussed in introduction: a
                                                                                                                                   wants to verify—thereby allowing her to skip the tedious and time-
market analyst is exploring sales trends of products with the help
                                                                                                                                   consuming process of manual comparison of all possible visualiza-
of visualizations to find unusual patterns. The analyst first looks at
                                                                                                                                   tions. As illustrated in Figure 4, each query involves comparisons
a small sample of visualizations, e.g., average revenue over week
                                                                                                                                   between two sets of visualizations (henceforth referred as Set1 and
trends for a few regions (e.g., Asia, Europe) and for a subset of
                                                                                                                                   Set2) to find the ones which are similar or dissimilar. Each visu-
cities and products within each region. She observes some unusual
                                                                                                                                   alization depicting a trend is represented via two attributes (X at-
patterns and wants to quickly find additional visualizations that ei-
                                                                                                                                   tribute, e.g., week and a Y attribute, e.g., average revenue) and a set
ther support or disprove those patterns (without examining all pos-
                                                                                                                                   of tuples (specified via a constraint, e.g., product = ‘Inspiron’). We
sible visualizations). Note that we use the term "trend" to refer to
                                                                                                                                   now present a succinct representation to capture these semantics.
a set of tuples in a more general sense where both categorical (e.g.,
country) or ordinal attributes (e.g., week) can be used for alignment
for comparison. We consider several examples below in increasing                                                                   2.2             Formalization
order of complexity. Figure 4 illustrates each of these examples,                                                                    We formalize our notion of comparative queries and propose a
depicting the differences in how the comparison is performed.                                                                      concise representation for specifying such queries.
Example 1a. The analyst notes that the average revenue over week
trends for Asia as well as for a subset of products in that region look
similar. As a counterexample, she wants to find a product whose                                                                          2.2.1            Trend

                                                                                                                          3
A trend is a set of tuples that are compared together as one unit.            Definition 6 [Scorer]. Given two trends t1 and t2 , a scorer is any
Formally,                                                                       function that returns a single scalar value called ‘score’ measuring
                                                                                how t1 compares with t2 .
Definition 1 [Trend]. Given a relation R, a trend t is a set of tuples
derived from R via the triplet: constraint c, grouping g, measure m                While we can accept any function that satisfies the above defi-
and represented as (c)(g, m).                                                   nition as a scorer; as mentioned earlier, two trends are often com-
                                                                                pared using distance measures such as Euclidean distance, Manhat-
Definition 2 [Constraint]. Given a relation R, a constraint is a con-           tan distance [31, 47, 43]. Such functions are also called aggregated
junctive filter of the form: (p1 = α1 , p2 = α2 , ..., pn = αn ) that           distance functions [34]. All aggregated distance functions use a
selects a subset of tuples from R. Here, p1 , p2 , ..., pn are attributes       function DIFF(.) as defined below.
in R and αi is a value of pi in R. One can use ‘ALL’ to select all
values of pi , similar to [22].                                                 Definition 7 [DIFF(m1 , m2, p)]. 1 Given a tuple with measure value
                                                                                m1 and grouping value gi in trend t1 and another tuple with mea-
Definition 3 [(Grouping, Measure)]. Given a set of tuples selected              sure value m2 and the same grouping value gi , DIFF(m1 , m2 , p)
via a constraint, all tuples with the same value of grouping are                = |m1 − m2 |p where p ∈ Z+ . Tuples with non-matching grouping
aggregated using measure. A tuple in one trend is only compared                 values are ignored.
with the tuple in another trend with the same value of grouping.
                                                                                  Since m1 and m2 are clear from the definition of t1 and t2 and
   In example 1a, (R.region = ‘Asia’)(R.week, AVG(R.revenue)) is                tuples across trends are compared only when they have same group-
a trend in Set1, where (region = ‘Asia’) is a constraint for the trend          ing and measure expressions, we succinctly represent
and all tuples with the same value of grouping:‘week’ are aggre-                DIFF(m1 , m2 , p) = DIFF(p)
gated using the measure: ‘AVG(revenue)’. We currently do not
                                                                                Definition 8 [Aggregated Distance Function]. An aggregated dis-
support range filters for constraint.
                                                                                tance function compares trends t1 : (ci )(gi , mi ) and t2 : (cj )
2.2.2     Trendset                                                              (gi , mi ) in two steps: (i) first DIFF(p) is computed between every
                                                                                pairs of tuples in t1 and t2 with same values of gi , and (ii) all values
   A comparative query involves two sets of trends. We formalize                of DIFF(p) are aggregated using an aggregate function AGG such
this via trendset.                                                              as SUM, AVG, MIN, and MAX to return a score. An aggregated
                                                                                distance function is represented as AGG OVER DIFF(p).
Definition 4 [Trendset]. A trendset is a set of trends. A trend in one
trendset is compared with a trend in another trendset.                             For example, Lp norms2 such as Euclidean distance can be spec-
                                                                                ified using SUM OVER DIFF(2), Manhattan distance using SUM
    In example 1a, the first trendset consists of a single trend: (R.region     OVER DIFF(1), Mean Absolute Deviation as AVG OVER DIFF(1),
= ‘Asia’)(R.week, AVG(R.revenue))}, while the second trendset                   Mean Square Deviation as AVG OVER DIFF(2).
consists of as many trends as there are are unique products in R:
{(R.region = ‘Asia’, R.product = ‘Inspiron’) (R.week, AVG(R.revenue)),          2.2.4     Comparison between Trendsets
(R.region = ‘Asia’, R.product = ‘XPS’)(R.week, AVG(R.revenue)),                    We extend Definition 5 to the following observation over trend-
..., (R.region = ‘Asia’, R.product = ‘G7’) (R.week, AVG( R.revenue))}.          sets.
    As is the case in the above example, often a trendset contains one
trend for each unique value of an attribute (say p) as a constraint,            Observation 1 [Comparability between two trendsets] Given two
all sharing the same (grouping, measure). Such a trendset can be                trendsets T1 and T2 , a trend (ci )(gi , mi ) in T1 is compared with
succinctly represented using only the attribute name as constraint,             only those trends (cj )(gj , mj ) in T2 where gi = gj and mi = mj .
i.e., [p][(g1 , m1 )]. If α1 , α2 , ..αn represent all unique values of p,         Thus, given two trendsets, we can automatically infer which trends
then,                                                                           between the two trendsets need to be compared. We use T 1T 2
    [p][(g1 , m1 )] ⇒ {(p = α1 )(g1 , m1 ), (p = α2 )(g1 , m1 ), ...,           to denote the comparison between two trendsets T1 and T2 . For
(p = αn )(g1 , m1 )} (⇒ denotes equivalence)                                    example, the comparison in example 1a can be represented as:
    Similarly, [p1 , p2 = β][(g1 , m1 )] ⇒ {(p1 = α1 , p2 = β)(g1 ,             [region = ‘Inspiron’][(week, AVG(revenue))]  [region = ‘Asia’,
m1 ), (p1 = α2 , p2 = β)(g1 , m1 ), ..., (p1 = αn , p2 = β)(g1 , m1 )}          product][ (week, AVG (revenue))]
    Alternatively, a trendset consisting of different (grouping, mea-              If both T1 and T2 consist of the same set of grouping and mea-
sure) combinations but the same constraint (e.g., p = α1 ) can be               sure expressions say {(g1 , m1 ), ..., (gn , mn )} and differ only in
succinctly written as:                                                          constraint, we can succinctly represent T1  T2 as follows:
    [(p = α1 )][(g1 , m1 ), ..., (gn , mn )] ⇒ {(p = α1 )(g1 , m1 ), ...,       [c1 ][(g1 , m1 ), ..., (gn , mn )]  [c2 ][(g1 , m1 ), ..., (gn , mn )] ⇒
(p = α1 )(gn , mn )}                                                            [c1  c2 ][(g1 , m1 ), ..., (gn , mn )]
                                                                                   Thus, the comparison between trendsets in example 1a can be
2.2.3     Scoring                                                               succinctly expressed as:
                                                                                [(region = ‘Asia’)  (region = ‘Asia’, product) ][(week, AVG(revenue))]
   We first define our notion of ‘Comparability’ that tells when two
trends can be compared.                                                         Similarly, the following expression represents the comparison in
                                                                                example 1b.
Definition 5 [Comparability of two trends]. Two trends t1 : (c1 )(g1 ,          [(region = ‘Asia’)  (region = ‘Asia’, product = ‘Inspiron’)][(week,
m1 ) and t2 : (c1 )(g2 , m2 ) can be compared if g1 = g2 and m1 =               AVG(revenue)), (country, AVG(profit)), ... , (month, AVG(revenue))]
m2 , i.e., they have the same grouping and measure.                                We can now define a comparative expression using the notions
  For example, a trend (R.product = ‘Inspiron’) (R.week, AVG(                   introduced so far.
R.revenue)) and a trend (R.product = ‘XPS’)( R.month, AVG(R.profit))            1
                                                                                  Note that DIFF is distinct from another operator [6] with similar
cannot be compared since they differ on grouping and measure.                   name.
                                                                                2
  Next, we define a function scorer for comparing two trends.                     We ignore the pth root as it does not affect the ranking of subsets.

                                                                            4
Table 1: Output of C OMPARE in Example 1a                                    Table 2: Output of C OMPARE in Example 1b
                                                                                     R1        P        W       C       M       V      O      score
  R1              P               W            V          score                      Asia   Inspiron   True    False   False   True   False    40
                                                                                     Asia   Inspiron   False   True    False   True   False    20
  Asia           XPS             True         True         30                         ...      ...      ...     ...     ...     ...    ...      ...
                                                                                     Asia   Inspiron   False   False   True    True   False    10
  Asia         Inspiron          True         True         24
   ...            ...             ...          ...          ...                 Table 1 illustrate the output of this query. The first two columns
  Asia            G8             True         True         45                R1 and P identify the values of constraint for compared trends in
                                                                             T1 and T2. The columns W and V are Boolean valued denoting
                                                                             whether R.week and AVG(R.revenue) were used for the compared
                                                                             trends. Thus, the values of (R1, P, W, V) together identify the pairs
Definition 9 [Comparative expression]. Given two trendsets T1 
                                                                             of trends that are compared. Since R.week and AVG(R.revenue)
T2 over a relation R, and a scorer F, a comparative expression
                                                                             are grouping and measure for all trends in this example, their values
computes the scores between trends (ci )(gi , mi ) in T1 and (cj )
                                                                             are always True. Finally, the column score specifies the scores com-
(gj , mj ) in T2 where gi = gj and mi = mj .
                                                                             puted using Euclidean distance, expressed as SUM OVER DIFF(2).
                                                                                Now, consider below the query for example 1b that compares
3.    THE COMPARE OPERATOR                                                   tuples where (R.region = Asia) with tuples where (R.region = Asia)
   In this section, we introduce a new operator C OMPARE, that               and (R.product = ’Inspiron’) over a set of (grouping, measure):
makes it easier for data analysts and application developers to ex-
press comparative queries. We first explain the syntax and seman-                                Listing 2: COMPAREXPR1B
tics of C OMPARE and then show how C OMPARE inter-operates                   SELECT R1, P, W, C, V, ..., M, score
with other relational operators to express top-k comparative queries         FROM sales R
as discussed in Section 2.1.                                                 COMPARE [((R.region = Asia) AS R1)  (R1, (R.product = ’Inspiron’)
                                                                                   AS P)][(R.week AS W, AVG(R.revenue) AS V), (R.country AS
3.1    Syntax and Semantics                                                        C, AVG(R.profit) AS O), ..., (R.month AS M, V)]
                                                                             USING SUM OVER DIFF(2) AS score
   C OMPARE, denoted by Φ, is a logical operator that takes as input
a a comparative expression specifying two trendsets T1 T2 over
relation R along with a scorer F and returns a relation R0 .                    Table 2 depicts the output for this query. The columns R1 and P
                      Φ(R, T1 T2 , F) → R0                                are always set to "Asia" and "Inspiron" since the constraint for all
     0
   R consists of scores for each pair of compared trends between             trends in T1 and T2 are fixed. W, C, M, V, and P consist of Boolean
the two trendsets. For instance, the table below depicts the out-            values telling which columns among R.week, R.country, R.month,
put schema (with an example tuple) for the C OMPARE expression               AVG(R.revenue), and AVG(R.profit) were used as (grouping, mea-
[c1  c2 ][(g1 , m1 ), (g2 , m2 )]. The values in the tuple indicate       sure) for the pair of compared trends.
that the trend (c1 = α1 )(g1 , m1 ) is compared with the trend (c2 =            From above examples, it is easy to see that we can write queries
α2 )(g1 , m1 ) and the score is 10.                                          with C OMPARE expression for examples 2a and 2b as follows:

         c1    c2       g1    m1      g2       m2      score                                     Listing 3: COMPAREXPR2A
         α1    α2      True   True   False    False     10                   SELECT R1, C1, R2, C2, W, V, score
         ...   ...      ...    ...    ...      ...       ...                 FROM sales R
                                                                             COMPARE [((R.Region = Asia) AS R1, (R.city) AS C1)  ((R.Region
                                                                                            = Europe) AS R2, (R.city) AS C2)][R.week AS W,
  We express the C OMPARE operator in SQL using two extensions:                             AVG(R.revenue) AS V]
C OMPARE, and USING:                                                         USING SUM OVER DIFF(2) AS score

COMPARE T1  T2                                                                                Listing 4: COMPAREXPR2B
USING F
                                                                             SELECT R1, C1, R2, C2, W, C, V, ..., M, score
For instance, for example 1a, the comparison between the AVG(revenue) FROM sales R
over week trends for the region ‘Asia’ and each of the products in    COMPARE [((R.Region = Asia) AS R1, (R.city) AS C1)  ((R.Region
                                                                        = Europe) AS R2, (R.city) AS C2)][(R.week AS W, AVG(R.revenue) AS
region ’Asia’ can be succinctly expressed as follows:                   V), (R.country AS C, AVG(R.profit) AS O), ..., (R.month AS M, V)]
                                                                             USING SUM OVER DIFF(2) AS score
                     Listing 1: COMPAREXPR1A
SELECT R1, P, W, V, score
                                                                                Note that C OMPARE is semantically equivalent to a standard
FROM sales R
COMPARE [((R.region = Asia) AS R1)  (R1, R.product AS P)]                 relational expression consisting of multiple sub-queries involving
           [R.week AS W, AVG (R.revenue) AS V]                               union, group-by, and join operators as illustrated in introduction.
USING SUM OVER DIFF(2) AS score                                              As such, C OMPARE does not add to the expressiveness of relational
                                                                             algebra SQL language. The purpose of C OMPARE is to provide a
  Here T1 = [((R.region = Asia) AS R1)][R.week AS W, AVG                     succinct and more intuitive mechanism to express a large class of
(R.revenue) AS V] and T2 = [((R.region = Asia) AS R1, R.product              frequently used comparative queries as shown above. For example,
AS P)][R.week AS W, AVG (R.revenue) AS V]. Observe that T1                   expressing the query in Listing 2 using existing SQL clauses (see
and T2 share the same set of (grouping, measure) and the filter              Figure 3) is much more verbose, requiring a complex sub-query for
predicate (R.region = Asia) in their constraints, thus it is concisely       each (grouping, measure). Prior work have proposed similar suc-
expressed as [((R.region = Asia) AS R1)(R1, R.product AS                  cinct abstractions such as GROUPING SETs [17] and CUBE [22]
P)][R.week AS W, AVG (R.revenue) AS V].                                      (both widely adopted by most of the databases) and more recently

                                                                         5
DIFF [6], which share our overall goal that with an extended syn-             The SELECT clause only outputs the values of columns for which
tax, complex analytic queries are easier to write and optimize.               corresponding trends has the highest score, setting NULL for other
   Furthermore, the input to C OMPARE is a relation, which can ei-            columns to indicate that those columns were not part of top-1 pair
ther be a base table or an output from another logical operator (e.g.,        of trends. This idea of setting NULL is borrowed from prior work
join over multiple tables); similarly the output relation from C OM -         on Gray et al. on computing CUBE [22]. Nevertheless, an alterna-
PARE can be an input to another logical operator or the final output.         tive is to output values of all columns, and add (S.W, S.M, S.C, S.P,
Thus, C OMPARE can inter-operate with other operators. In order               S.V) (as in the previous example) to the output to indicate which
to illustrate this, we discuss how we C OMPARE inter-operates with            columns were part of the comparison between top-1 pair of trends.
other operators such as join, filter to select top-k trends. While
C OMPARE outputs the scores for each pair of compared trends,                 4.    OPTIMIZING COMPARATIVE QUERIES
comparative queries often involve selection of top-k trends based                In this section, we discuss how we optimize a logical query plan
on their scores (Section 2.1).                                                consisting of a C OMPARE operation. Microsoft SQL S ERVER uses
                                                                              a cost-based optimizer for generating the physical plan for an in-
3.2    Expressing Top-k Comparative Queries                                   put logical plan. We extend the optimizer to replace C OMPARE
  In this section, we show how we can use the above-listed C OM -             with a sub-plan of existing physical operators such that cost of the
PARE sub-expressions (referred by COMPAREXPR1A, COMPAR-                       sub-plan is least. This is done in two steps: we first transform
EXPR1B, COMPAREXPR2A, and COMPAREXPR1B) with LIMIT                            C OMPARE into a sub-plan of existing logical operators. These log-
and join to select tuples for trends belonging to top-k.                      ical operators are then transformed into physical operators using
Example 1a. The following query selects the tuples of a prod-                 existing rules to compute the cost of C OMPARE. The cost of the
uct in region ‘Asia’ that has the most different AVG(revenue) over            sub-plan for C OMPARE is combined with costs of other physical
week trends compared to that of region ‘Asia’ overall. COMPAR-                operators to estimate the total cost of the query. We state our prob-
EXPR1A refers to the sub-expression in Listing 1.                             lem formally:
SELECT T.product, T.week, T.revenue, S.score
FROM sales T JOIN
                                                                              Problem 4.1. Given a logical query plan consisting of C OMPARE
(SELECT * FROM COMPAREXPR1A                                                   operation: Φ(R, [c1 c2 ] [(d1 , m1 ), ..., (dn , mn )], F) → R0 ,
ORDER BY score DESC                                                           replace C OMPARE with a sub-plan of physical operators with the
LIMIT 1) AS S                                                                 lowest cost.
WHERE T.product = S.P
                                                                              For ease of exposition, we assume that both trendsets contain the
   The ORDER BY and LIMIT clause select the top-1 row in Ta-                  same set of trends, one for each unique value of c, i.e., c1 = c2 = c.
ble 1 with the highest score with P consisting of the most similar
product. Next, a join is performed with the base table to select all          4.1     Basic Execution
tuples of the most similar product along with its score.                         We start with a simple approach that transforms C OMPARE into
Example 2a. The query for example 2a differs from example 1a                  a sub-plan of logical operators. The sub-plan is similar to the one
in that both trendsets consist of multiple trends. Here, one may be           generated by database engines when comparative queries are ex-
interested in selecting tuples of both cities that are similar, thus we       pressed using existing SQL clauses (discussed in Section 1). We
use the WHERE condition (T.city = S.C1 AND T.Region = S.R1)                   perform the transformation using the following steps, with each
OR (T.city = S.C2 AND T.Region = S.R2). (S.R1, S.R2, S.C1,                    step using the output from previous step as input.
S.C2) in SELECT clause identifies the pair of compared trends.                (1) ∀(di , mi ): Ri ← Group-byc,di Aggmi (R)
                                                                              (2) ∀ Ri : Rij ←1Ri .c!=Ri .c,Ri .di =Ri .di (Ri )
SELECT T.Region, T.city, T.week, T.revenue, S.R1, S.C1, S.R2, S.C2,           (3) ∀ Rij : Rijk ← Group-byci ,cj AggUDAF (Rij ) // ci , cj are
 S.score
                                                                              aliases of column c
FROM sales T JOIN
(SELECT * FROM COMPAREXPR2A                                                   (4) R0 ← Union All(Rijk )
                                                                                             i,j,k
ORDER BY score                                                                    First, for each (grouping, measure) combination, we add a group-
LIMIT 1) AS S
                                                                              by aggregate (e.g., GROUP BY product, week, AGG on AVG(revenue))
WHERE (T.city = S.C1 AND T.Region = S.R1) OR (T.city = S.C2 AND
T.Region = S.R2)                                                              to create trendsets. Then, on the output of group-by aggregate, we
                                                                              join tuples between two trends that have the same value of grouping
                                                                              (e.g., 1R’.product !=R’.product, R’.week = R’.week )). Finally, the score between
Examples 1b and 2b. These examples extend the first two exam-                 each pair of trends is computed by applying a user-defined aggre-
ples to multiple attributes. We show the query for example 2b; it’s           gate (UDA) F. This is done by first partitioning the join output
a complex version of (example 1b) where trends in each trendsets              to group together each pair of trends. Each partition consisting of
are created by varying all three: constraint, grouping, measure (ex-          pair of trends is then aggregated using F. Finally, the scores for all
ample 1b has a fixed constraint for each trendset).                           pairs of trends are aggregated via Union All.
SELECT T.city, S.R1, S.R2, S.C1, S.C2,                                           Unfortunately, this approach does not scale well as the size of
 CASE WHEN S.W THEN T.week ELSE NULL END,                                     the input dataset and the number of (grouping, measure) combi-
 ...                                                                          nations become large. There are two issues. First, aggregations
 CASE WHEN S.V THEN T.revenue ELSE NULL END,                                  across (grouping, measure) are performed separately, even when
 S.score
                                                                              there are overlaps in the subset of tuples being aggregated. Second,
FROM sales T JOIN
(SELECT * FROM COMPAREXPR2B                                                   while each trend is compared with other trends independently, the
ORDER BY score                                                                second step typically involves a join at the granularity of trendsets,
LIMIT 1) AS S                                                                 i.e., a relation consisting of one trendset is joined with a relation
WHERE (T.city = S.C1 AND T.Region = S.R1) OR (T.city = S.C2 AND               consisting of another trendset. The cost of join increases rapidly
T.Region = S.R2)                                                              as the number of trends being compared and the size of each trend

                                                                          6
3. Trend-wise                                                      UNION ALL
                                                                               comparison

                                                                                                                  ......
                                                                                                                                                                     ......         ......         ......
                                                                                                                                                                                                                 ......
         Latency (s)50                                                                 ⨝             ⨝            ......⨝            ⨝
                                                                                                                                                                     ......⨝        ......⨝        ......⨝
                                                                                                                                                                                                                 ......⨝
                                                                                                                                  ....
                                                                                Grp-by {c1d1}             Grp-by {c2d1}
                                                                                                                                  ....      Grp-by {cnd1}               .........    .........      .........
                    25                                                          Agg on {m1}               Agg on {m1}                       Agg on {m1}                                                           .........
                                                                                                                                  Partition                               Partition
                                                                                                  Partition                                                                                          .........     .........
                                                                                                                                    on c                                    on c
                                                                                                    on c
                                                                                                                                         cd6m6               cd4m4
                                                                                                                      cd1m1

                         20 18 16 14 12 10 8 6 4 21
                                                                                                                                                                                             Partition
                                                                                           Horizontal
                                                                                                                Vertical
                                                                                                                                           Partition
                                                                                                                                                                                             on (dimi)
                                                                                                                                                                                                                   .........
                                                                                           Partitioning                                   on (dimi)

                         Number of group-by expressions
                                                                                                              Partitioning
                                                                              2. Partitioning                                                                                                                      .........
                                 after merging                                TrendSet into
                                                                                  Trends         1. Merging
                                                                                                                             Group-by {c d1d6d4}
                                                                                                                            Aggregate on {m1m6m4}
                                                                                                                                                                                Group-by {c d2dn}
                                                                                                                                                                               Aggregate on {m2mn}
                                                                                                                                                                                                                   .........
                                                                                                 aggregates
           (a) Variation in performance as we merge group-by ag-                                                                                             R
           gregates to share computations                                   Figure 6: Optimized query plan generated after applying merging and par-
                             Trendsetwise Join
                                                                            titioning on basic query plan in Figure 7.
                300          Partitioning + Trendwise Join
      Latency (s)

                                                                            impact of merging on the cost of subsequent comparison between
                200          Only Partitioning                              trends; ignoring which can lead to sub-optimal plans as we describe
                100                                                         shortly. We first introduce the second optimization for comparison.
                                                                            Trendwise Comparison via Partitioning. The second optimiza-
                       101    102      103        104        105            tion is based on the observation that pairwise joins of multiple
                          Number of trends per trendset                     smaller relations is much faster than the a single join between two
                                                                            large relations. This is because the cost of join increases super-
       (b) Improvements due to trendwise join after partition-              linearly with the increase in the size of the trendsets. In addi-
       ing trendset into trends (the size of each trend is fixed to         tion to improvement in complexity, trend-wise joins are also more
       1000 tuples)                                                         amenable to parallelization than a single join between two trend-
                                                                            sets. Figure 5b depicts the difference in latency for these two ap-
Figure 5: Improvement in performance due to merging group-by                proaches as we increase the number of trends from 10 to 105 (each
aggregates and trendwise comparison (via partitioning)                      of size 1000). The black dotted line shows the partitioning over-
                                                                            head incurred while creating partitions for each trend, showing that
                                                                            the overhead is small (linear in n) compared to the gains due to
increases (see Figure 5b). We next discuss how we address these             trendwise join. Moreover, this is much smaller than the overhead
issues via merging and partitioning optimizations.                          incurred when partitioning is performed on the join output (∝ n2 )
                                                                            in the basic plan (see step 3 in Figure 5).
4.2      Merging and Partitioning Optimization                                 Figure 7 depicts the query plan after applying the above two op-
    To generate a more efficient plan, we adapt the sub-plan gener-         timizations on the basic query plan. First,we merge multiple group-
ated above using two steps. First, we merge group-by aggregates             by aggregates to share computations (using the approach discussed
into a fewer number of group-by aggregates. Next, we partition              below). Then, we partition the output of merged group-by aggre-
the output of group-by aggregates to create one relation for each           gates into smaller relations, one for each trend. This is followed
trend, which allows parallel join and comparison for scoring. We            by joining and scoring between each pair of trends independently
describe each of these optimizations and then present an algorithm          and in parallel. Observe that the merging of group-by aggregates
that jointly optimizes them to find an overall efficient plan.              results in multiple trends with overlapping (grouping, measure) in
Merging group-by aggregates. The first optimization shares the              the output relation. Hence, we apply the partitioning in two phases.
computations across a set of group-by aggregates, one for each              In the first phase, we partition it vertically, creating one relation
(grouping, measure), by merging them into fewer group-by aggre-             for each (grouping, measure). In the second phase, we partition
gates. Many (grouping, measure) often share a common group-                 horizontally, creating one relation for each trend.
ing column, e.g., [(day, AVG(revenue), (day, AVG(profit)] or have           Joint Optimization of Merging and Partitioning. As depicted in
correlated grouping columns (e.g., [(day, AVG(revenue), (month,             Figure 6b, the cost of partitioning increases with the increase in the
AVG(revenue)]) with high degree of overlapping tuples across trends.        size of its input. The input size is proportional to the number of
To illustrate, we considered a set of 20 group-by aggregates in the         unique group-by values, which increases with the increase in the
flights [1] dataset, computing avg(ArrivalDelay), avg(DepDelay),            number of merging of group-by aggregates. Thus, when the input
..., avg(Duration) grouped by day, week, ..., airport. As depicted in       becomes large, the cost of partitioning dominates the gains due to
Figure 5a, by merging them (using an approach discussed shortly)            merging. It is therefore important to merge group-by aggregates
into 12 aggregates, the latency improves by 2×. Furthermore, we             such that the overall cost of computing group-by aggregates, parti-
can see that after a certain number of merges, the performance de-          tioning and trendwise comparison together is minimal.
grades. This is because the sharing of tuples decreases as we merge            In order to find the optimal merging and partitioning, we follow
additional grouping columns, which in turn results in the larger in-        a greedy approach as outlined in Algorithm 1. Our key idea is to
crease in the output size of group-by aggregates.                           merge at the granularity of sub-plans instead of the group-by ag-
    Finding the optimal merging of group-by aggregates is NP- Com-          gregates. We start with a set of sub-plans, one for each (grouping,
plete [7]. Prior work on optimizing GROUPING SETS compu-                    measure) as generated by the basic execution strategy discussed
tation [17] have proposed best-first greedy approaches that merge           earlier and merge two sub-plan at a time that lead to the maximum
those group-by aggregates first that lead to maximum decrease in            decrease in cost.
the cost. Unfortunately, in our setting, we also need to consider the          Formally, if the two sub-plans operate over (d1 , m1 ) and (d2 ,

                                                                        7
m2 ) respectively, we merge them using the following steps (illus-              the two trends as we discuss below. Given the upper and lower
trated in Figure 6):                                                            bound scores of each pair, we find a pruning threshold T on the
(1) R1 ← Group-byc,d1 ,d2 Aggm2 ,m2 [R] // merge group-by aggre-                lowest possible top k score, as the kth largest lower bound score
gates                                                                           across all pairs . Any pair with its upper bound score smaller than
(2) ∀ (di , mi ): Ri ← Π(di ,mi ) (R1) // vertical partitioning                 T can thus be pruned. 3. Finally, we perform a join between the
(3) ∀ i: Rij ← Partition Ri ON c // horizontal partitioning, one                tuples of trends that are not pruned.
partition for each value of c
                                                                                      [28, 30]     [15, 19]    [26, 29]     [28, 30]            [26, 29]
(4) ∀ i, j: Ri0 j 0 ← Group-bycj ,di Aggmi [Rij ] // aggregate again                                                                   3

                                                                                          ⨝            ⨝           ⨝           ⨝                  ⨝
(5) ∀ i0 , j 0 , k: Ri0 j 0 k ← Ri0 j 0 1di Ri0 k //partitition-wise join         2
                     0
(6) ∀ i0 , j 0 , k: Ri0 j 0 k ← AggUDAF (Ri0 j 0 k ) // compute scores                Bitmap +
                                                                                      Summary
                                                                                                   Bitmap +
                                                                                                   Summary
                                                                                                                Bitmap +
                                                                                                                Summary
                         0                                                           Aggregates   Aggregates   Aggregates
      0
(7) R ← Union
          0 0
              All(Ri0 j 0 k )                                                    1
             i ,j ,k
                                                                                        P1           P2           P3         P1            P2     P3
    We first merge group-by aggregates, followed by creating one
partitions for each trend using both vertical and horizontal par-                      Figure 7: Illustrating pruning for DIFF-based comparisons
titioning. Then, we join pairs of trends and compute the score.                    We empirically show in Section 8 that while the pruning incurs
For computing the cost of the merged sub-plan, we use the opti-                 an overhead of first computing the summary aggregates and bitmap
mizer cost model. The cost is computed as a function of avail-                  for each candidate trend, the gains from skipping tuple comparisons
able database statistics (e.g., histograms, distinct value estimates),          for pruned trends offsets the overhead. Moreover, the summary
which also captures the effects of the current physical design, e.g.,           aggregates of each trend can be computed independently in parallel.
indexes as well as degree of parallelism (DOP). We merge two sub                Computing Bounds. The simplest approach is to create a sin-
plans at a time until there is no improvement in cost.                          gle set of summary aggregates for each trend as depicted in Fig-
                                                                                ure 8b. The gray and yellow blocks depict the summary aggregates
                                                                                for two trends respectively, consisting of COUNT (computed using
Algorithm 1 Merge-Partition Algorithm
                                                                                bitmaps), SUM, MIN, and MAX in order.
1: Let B be a basic sub-plan computed from Φ as described in Section 4.1           First, for deriving the lower bound, we prove the following useful
2: while true do
3:     C ← OptimizerCost(B)
                                                                                property based on the convexity property of DIFF functions (see
4:     Let si ∈ S be a sub-plan in B consisting of a sequence of group-by       [2] for the proof).
    aggregate, join and partition operations over (di , mi )
5:      Let M P = Set of all sub-plans obtained by merging a pair of sub-
                                                                                Theorem 1. ∀ DIFF(m1 , m2 , p),
    plans in S as described in Section 4.2                                      AVG (DIFF(m1 , m2 , p)) ≥ DIFF(AVG (m1 ),AVG (m2 ), p)
 6:     Let Bnew be the sub-plan in M P with lowest cost (Cnew ) after             This essentially allows us to apply DIFF on the average values
    merging two sub-plans si , sj                                               of each trend to get a sufficiently tight lower bounds on scores. For
 7:     if Cnew > C then
                                                                                example, in Figure 8b, we get a lower bound of 1700 for a score of
 8:         break;
 9:     end if                                                                  1717 for the two trends shown in Figure 8b.
10:      C ← Cnew                                                                  For the upper bound, it is easy to see that the maximum value
11:      B ← Bnew                                                               of DIFF(m1 , m2 , 2) between any pairs of tuples in R and S is
12: end while                                                                   given by: MAX ( |MAX (m1 ) − MIN (m2 )|, |MAX (m2 ) − MIN
13: Return B                                                                    (m1 )|). Given that DIFF(m1 , m2 , 2) is Non-negative and Mono-
                                                                                tonic, we can compute the upper bound on SUM by multiplying
                                                                                the the MAX (DIFF(m1 , m2 , 2)) by COUNT. For example, in Fig-
5.    OPTIMIZING DIFF-BASED COMPARI-                                            ure 8b, we get an upper bound of 6400.
      SON                                                                       Multiple Piecewise Summaries. Given that the value of measure
   While the approach discussed in the previous section works for               can vary over a wide range in each trend, using a single sum-
any arbitrary scorer, we note that for top-k comparative queries in-            mary aggregate often does not result in tight upper bound. Thus,
volving aggregated distance functions (defined in Section 2.2) such             to tighten the upper bounds, we create multiple summary aggre-
as Euclidean distance, we can substantially reduce the cost of com-             gates for each trend, by logically dividing each trend into a se-
parison between pairs of trends. We first outline the three properties          quence of l segments, where segment i represents tuples from in-
of DIFF(.) function that we leverage for optimizations.                         dex: (i − 1) × nl + 1 to i × nl where n is the number of tuples
1. Non-negativity: DIFF( m1 , m2 , p) ≥ 0                                       in the trend. Instead of creating a single summary, we compute
                                                                                the set of same summary aggregates over each segment, called seg-
2. Monotonicity: DIFF(m1 , m2 , p) varies monotonically with the
                                                                                ment aggregates. For example, Figure 8c depicts two segment ag-
increase or decrease in |m1 − m2 |.
                                                                                gregates for each trend, with each segment representing a range
3. Convexity: DIFF( m1 , m2 , p) are convex for all p.                          of 8 tuples. The bounds between a pair of matching segments is
                                                                                computed in the same way as we described above for a single sum-
5.1       Summarize → Bound → Prune                                             mary aggregates. Then, we sum over the bounds across all pairs
Overview. We introduce a new physical operator that minimizes                   of matching segments to get the overall bound (see [2] for formal
the number of trends that are compared using the following three                description). As we can see in Figure 8c, increasing the number of
steps (illustrated in Figure 8). 1. We summarize each trend inde-               segments from 1 to 2 improves the bounds from [1700, 6400] to
pendently using a set of three aggregates: SUM, MIN and MAX                     [1702, 4624], substantially reducing the upper bound.
and a bitmap corresponding to the grouping column. 2. Next, we                     To estimate the number of summary aggregates for each trend,
intersect the bitmaps between trends to compute the COUNT of                    we use Sturges formula, i.e., (b1 + log2 (n)c) [42], which assumes
matching tuples between them. We use the COUNT along with                       the normal distribution of measure values for each trend. Because
three aggregates to compute the upper and lower bounds between                  of its low computation overhead and effectiveness in capturing the

                                                                            8
18 18 14 18 18 16 14 14 10 14 12 10 13 13 14 14                               16, 229, 10, 18                  8, 129, 13, 18   8, 100, 10, 14
                              Score = 1717                                Bounds = [1700, 6400]                    Bounds = [1702, 4624]
 26 23 23 29 30 28 24 25 27 24 24 20 21 25 20 22                               16, 394, 20, 30                  8, 211, 23, 30   8, 183, 20, 27
                                                                          (b) Bounds on score us-                 (c) Bounds on score using
           (a) Exact score on comparing two trends
                                                                           ing a single summary                    two-segment summaries

            Figure 8: Using summaries to bound scores. F = SUM OVER DIFF(2). Each value in (a) corresponds to a single tuple in a trend.

distribution or trends of values, Sturges formula is widely used in           Algorithm 2 Pruning Algorithm for DIFF-based Comparison
the statistical packages for automatically segmenting or binning               1: Compute SegAgg and bitmaps for each candidate trend ci
data points into fewer groups. We empirically show in Section 8                2: for each pair of trends ci , cj do
(see Figure 11a) that (b1 + log2 (n)c) number of segments per trend            3:     Compute bounds on scores (Section 5.1)
results in sufficiently tight bounds for effective pruning, without            4:     Update PQS
                                                                               5: end for
adding significant computational overhead.                                     6: for each candidate trend ci do
                                                                               7:     If ((ci , cj ) upper bound < PQS .Top()) Continue;
5.2    Early Termination                                                       8:     Initialize (ci , cj ) TState and push to PQP
   For early termination of the top-k trends search among the un-              9: end for
                                                                              10: while size of PQP > k do
pruned trends, we assign an utility to each of the unpruned trends            11:      (ci , cj ) = PQP .Top()
that tells how likely they are going to be in the top-k.                      12:      Compare a segment ci , with that of cj
   For estimating the utility of trends , we use the bounds computed          13:      Update bounds and PQS
using segment aggregates in the previous step. Specifically, for se-          14:      If ((ci , cj ) upper bound < PQS .Top()) Continue;
lecting top-k trends in descending order of their scores, a trend with        15:      Push (ci , cj ) to PQP
higher upper bound score has a higher utility. On the other hand,             16: end while
                                                                              17: Return Top k trend pairs from PQP
for top-k trends in ascending order of their scores, a trend with the
smallest lower bound has a higher utility. The processing of higher
utility trends leads to the faster improvement in the pruning thresh-
                                                                              fetch the trend with the highest upper bound score ((line 11)), and
old T, thereby minimizing wastage of tuple comparisons over low
                                                                              following the process outlined in Section 5.2, compare one of its
utility trends.
                                                                              segments with a segment from the reference trend (line 12). After
   Moreover, the utility of a trend can increase or decrease after
                                                                              processing a segment, we check if the current trend pair is pruned or
comparing a few tuples in a candidate trend. Hence, instead of
                                                                              if there is another trend pair with higher upper bound (line 14–15).
processing the entire trend in one go, we process one segment of
                                                                              This process is continued until we are left with k pairs of trends .
a trend at a time, and then updates the bounds to check (i) if the
                                                                              Finally, we output values of k pairs of trends with highest scores
trend can be pruned, or (ii) if there is another trend with better
                                                                              (line 17).
upper bounds that it should switch to. As we show in Section 8,
incrementally comparing high utility trends first leads to pruning            Memory Overhead. Given a relation of n tuples consisting of
of many trends without processing all of their tuples.                        p trends, Φp creates p × log(n/p) segment aggregates, each seg-
                                                                              ment aggregate consisting of 4 aggregates values. In addition, the
5.3    Putting It All Together                                                operator maintains a TState consisting of a fixed number of ele-
   In order to support the above optimizations, we implemented a              ments per trend, including the largest k lower bounds on scores.
new physical operator, Φp , that takes as input the partitions (one for       Thus, the overall space overhead is O(p × log(n/p)) + 10 × p
each trend), and replaces the join and F in query plans discussed             + k) ≈ O(p × log(n/p)). Thus, for an input relation, consist-
in Section 4. It outputs a relation consisting of tuples that identify        ing of 100 million tuples with 100 trends and 3 attributes, of total
the top-k pairs of trends as discussed in Section 3. The algorithm            size about 2.5 GB, Φp incurs an overhead of approximately 64K
used by the operator makes use of four data structures: (1) SegAgg            floating point numbers (< 1 MB).
: An array where index i stores summary aggregates for segment
i. There is one SegAgg per trend. (2) TState : The state of a trend,
                                                                              6.    ADDITIONAL ALGEBRAIC RULES
consisting of the current upper and lowers bounds on the score,                  The query optimizer in Microsoft SQL S ERVER relies on alge-
as well as the next segment within the trend to be compared next.             braic equivalence rules for enumerating query plans to find the plan
There is one TState for each trend, and is updated after comparing            with the least cost. When C OMPARE occurs with other logical oper-
each segment. (3) PQP : a MAX priority queue that keeps track                 ators, we present five transformation rules (see Table 3) that reorder
of the trend pairs with the highest upper bound. It is updated after          Φ with other operators to generate more efficient plans.
comparing each segment. (4) PQS : a MIN priority queue that                   R1. Pushing Φ below join. Data warehouses often have a snowflake
keeps track of the trend pairs with the smallest lowest bound. It is          or star schema, where the input to C OMPARE operation may involve
updated after comparing each segment.                                         a PK-FK join between fact and dimension tables. If one or more
    Algorithm 2 depicts the pseudo-code for a single threaded im-             columns in Φ are the PK columns or have functional dependen-
plementation. We first compute the segment aggregates for trends              cies on the PK columns in the dimension tables , Φ can be pushed
(line 1). For each pair of trends, we compute the bounds on scores            down below the join on fact table by replacing the dimension tables
as discussed in Section 5.1, and update PQS to keep track of top              columns with the corresponding FK columns in the fact table (see
k lower bounds (lines 2—5). The upper bound for each trend pair               Rule R1 in Table 3.) For instance, consider example 1a in Section
is compared with PQS .Top() to check if it can be pruned (line 7).            2.1 that finds a product with a similar average revenue over week
If not pruned, the TState is initialized and pushed to PQP (line              trend to ‘Asia’. Here, revenue column would typically be in a fact
8). Once the TState of all unpruned trends are pushed to PQP , we             table along with foreign key columns for region, product and year.

                                                                          9
Table 3: Algebraic equivalence rules for Φ: COMPARE [P1 = α1 |P2=α2 ][(M1, M2)] USING A OVER F
 Rule    Equivalence rule                                          Pre-condition
 R1      Φ(R 1T S) ≡ (Φk (R)) 1T S                                 T: PK in S; T ⊆ {P1,P2, M1, M2} ⊆ R ;Φk : Φ with PK columns replaced with FK columns
 R2      ΥG,A (Φ(R)) ≡ Φ(ΥG,A (R))                                 {P1, P2, M1, M2} ⊆ G and A ∈ {MAX,MIN }
 R3      σC (Φ(R)) ≡ Φ(σC (R))                                     C ⊆ {P1, P2}
 R4      Φ2 (Φ1 (R)) ≡ Φ1 (Φ2 (R)))                                Φ1 : P1 , P2 = Φ2 : P1 , P2
 R5      (LIMITK Υ(ΠDIFF(R.X, L.X, p) (σP =α R) 1 R)) 1 R ≡ Φ(R)   None

In such cases, we can push Φ below the join by replacing dimen-                 DIFF-based comparisons can be further optimized by adding a new
sion table columns (e.g., product, week) values with corresponding              physical operator that first computes the upper and lower bounds on
PK column values.                                                               the scores of each trend, which can then be used for pruning parti-
R2. Pushing Group-by Aggregate (Υ) below Φ to remove du-                        tions without performing costly join.
plicates. When an aggregate operation occurs above a C OMPARE                   Robustness to physical design changes. A large part of C OM -
operation, in some cases we can push the aggregate operation be-                PARE execution involves operators such as group-by, joins and par-
low the C OMPARE to reduce the size of each partition. In particular,           tition (See Figure 6). Hence, the effect of physical design changes
consider an aggregate operation ΥG,A with group by attributes G                 on C OMPARE is similar to their effect on these operators. For in-
and aggregate function A such that all columns used in Φ are in G.              stance, since column-stores tend to improve the performance of
Then, if all aggregation functions in Φ ∈ {MAX, MIN }, we can                   group-by operations, they will likely improve the performance of
push Υ below Φ as per the Rule R2 in Table 3. Pushing aggregation               C OMPARE. Similarly, if indexes are ordered on the columns used
operation below Φ reduces the size of each partition by removing                in constraints or grouping, the optimizer will pick merge join over
the duplicate values.                                                           hash-join for joining tuples from two trends. Finally, if there is a
R3. Predicate pushdown. A filter operation (σ) on partition col-                materialized view for a part of the C OMPARE expression, modern
umn (e.g., product) can be pushed down below Φ, to reduce the                   day optimizers can match and replace the part of the sub-plan with
number of partitions to be compared. While predicate pushdown in                a scan over the materialized view. We empirically evaluate the im-
a standard optimization, we notice that optimizers are unable to ap-            pact of indexes on C OMPARE implementation in Section 8.
ply such optimizations when the C OMPARE are expressed via com-                 Applications of C OMPARE. C OMPARE is meant to be used by
plex combination of operations as described in Section 1. Adding                data analysts as well as applications to issue comparative queries
an explicit logical C OMPARE, with a predicate pushdown rule makes              over large datasets stored in relational databases. It has two advan-
it easier for the optimizer to apply this optimization. Note that if σ          tages over regular SQL and middleware approaches (e.g., Zenvis-
involves any attribute other than the partitioning column, then we              age, Seedb). First, it allows succinct specification of comparative
cannot push it below Φ. This is because the number of tuples for                queries which can be invoked from data analytic tools supporting
partitions compared in Φ can vary depending on its location.                    SQL clients. Second, it helps avoid data movement and serializa-
R4. Commutativity. Finally, a single query can consist of a chain               tion and deserialization overheads, and is thus more efficient and
of multiple C OMPARE operations for performing comparison based                 scalable. We classify the applications into three categories:
on different metrics (e.g., comparing products first on revenue, and            BI Tools. BI applications such as Tableau and Power BI do not pro-
then on profit). When multiple Φ operations on the same parti-                  vide an easier mechanism for analysts to compare visualizations.
tioning attribute, we can swap the order such that more selective               However, for supporting complex analytics involving multiple joins
C OMPARE operation is executed first.                                           and sub-queries, these tools support SQL querying interfaces. For
R5. Reducing comparative sub-plans to Φ. Finally, we extend                     comparative queries, users currently have to either write complex
the optimizer to check for an occurrence of the comparative sub-                SQL queries as discussed in Introduction, or generate all possible
expression specified using existing relational operators to create an           visualizations and compare them manually. With C OMPARE, users
alternative candidate plan by replacing the sub-expression with Φ.              can now succinctly express such queries (as illustrated in Section
In order to do so, we add the equivalence rule R5 where the expres-             3) for in-database comparison.
sion on the left side represents the sub-expression using existing              Notebooks. For large datasets stored in relational databases, it is
relational operators. This rule allows us to leverage physical op-              inefficient to pull the data into notebook and use dataframe APIs
timizations for comparative queries expressed without using SQL                 for processing. Hence, analysts often use a SQL interface to access
extensions.                                                                     and manipulate data within databases. While one can also expose
                                                                                Python APIs for comparative queries and automatically translate
                                                                                them to SQL, such features are limited to the users of the Python
7.      DISCUSSION                                                              library. SQL extensions, on the other hand, can be invoked from
   We discuss the generalizability and robustness of our proposed               multiple applications and languages that support SQL clients. Fur-
optimizations as well as potential applications of C OMPARE.                    thermore, in the same query, one can use C OMPARE along with
Generalizability of optimizations. Our proposed optimizations in                other relational operators such as join and group-by that are fre-
Section 4 deal with replacing C OMPARE to a sub-plan of logical                 quently used in data analytics (see Section 3.2).
and physical operators within existing database engines. These op-              Visual analytic tools. Finally, there are visual analytic tools such as
timizations can be incorporated in other database engines support-              as Zenvisage and Seedb that perform comparison between subsets
ing cost-based optimizations and addition of new transformation                 of data in a middle-ware. With C OMPARE, such tools can scale to
rules. Concretely, given a C OMPARE expression, one can generate                large datasets and decrease the latency of queries as we show in
a sub-plan using Algorithm 1 and transformation rules implement-                Section 8.
ing steps outlined in Section 4.1 and Section 4.2. Furthermore, we
discuss additional transformation rules (see Table 3) in Section 6
that optimize the query when C OMPARE occurs along with other
logical operators such as join, group-by, and filter. We show that              8.    PERFORMANCE EVALUATION

                                                                           10
You can also read