Hierarchical Bitmap Indexing for Range and Membership Queries on Multidimensional Arrays - arXiv

Page created by Audrey Hampton
 
CONTINUE READING
Hierarchical Bitmap Indexing for Range and Membership
                                                     Queries on Multidimensional Arrays

                                                            Luboš Krčál                       Shen-Shyang Ho                          Jan Holub
                                                   Czech Technical University in          Rowan University, Glassboro,       Czech Technical University in
                                                     Prague, Czech Republic                       NJ, USA                      Prague, Czech Republic
                                                    lubos.krcal@fit.cvut.cz                    hos@rowan.edu                   jan.holub@fit.cvut.cz
arXiv:2108.13735v1 [cs.DB] 31 Aug 2021

                                         ABSTRACT                                                          1.   INTRODUCTION
                                         Traditional indexing techniques commonly employed in da-             Research in many areas, such as geoscience or model sim-
                                         tabase systems perform poorly on multidimensional array           ulations, produces large scientific datasets, which are stored
                                         scientific data. Bitmap indices are widely used in commer-        in multidimensional arrays of arbitrary size, dimensional-
                                         cial databases for processing complex queries, due to their ef-   ity and cardinality, such as QuikSCAT satellite data [16].
                                         fective use of bit-wise operations and space-efficiency. How-     Efficient processing of such data is challenging because of
                                         ever, bitmap indices apply natively to relational or linearized   their multidimensional nature. However, most of the analy-
                                         datasets, which is especially notable in binned or compressed     sis techniques apply to relational datasets or require a strict
                                         indices.                                                          linearization of the data.
                                            We propose a new method for multidimensional array in-            To query multidimensional array data, one needs an effec-
                                         dexing that overcomes the dimensionality-induced inefficien-      tive system index and subsequently query the data. Major-
                                         cies. The hierarchical indexing method is based on n-dimen-       ity of the current systems rely on linearization of the array
                                         sional sparse trees for dimension partitioning, with bound        data, i.e., mapping the data into one dimension, enabling
                                         number of individual, adaptively binned indices for attribute     many one-dimensional access methods to be used. Others,
                                         partitioning. This indexing performs well on range involv-        such as array databases [27, 2], work natively with multidi-
                                         ing both dimensions and attributes, as it prunes the search       mensional arrays.
                                         space early, avoids reading entire index data, and does at           A popular and very effective method of indexing arbitrary
                                         most a single index traversal. Moreover, the indexing is eas-     data is bitmap indexing, which is an index consisting of a set
                                         ily extensible to membership queries.                             of bitmaps (bitvectors) with associated metadata. Bitmap
                                            The indexing method was implemented on top of a state          indices leverage hardware support for fast bit-wise opera-
                                         of the art bitmap indexing library Fastbit. We show that the      tions (AND, OR, NOT, XOR), and are very space-efficient,
                                         hierarchical bitmap index outperforms conventional bitmap         especially for low-cardinality attributes, although this was
                                         indexing built on auxiliary attribute for each dimension.         partially overcome by sophisticated multi-level and multi-
                                         Furthermore, the adaptive binning significantly reduces the       component indices. Bitmap indices are used in majority of
                                         amount of bins and therefore memory requirements.                 commercial relational databases [9, 22, 23, 8].
                                                                                                              The major disadvantage of bitmap indices for multidimen-
                                                                                                           sional array data indexing is their linear nature. Even with a
                                                                                                           variation of run-length compression, of which the most well-
                                         Categories and Subject Descriptors                                known is WAH, that only partially suppresses the issue.
                                         H.2.8 [Information Systems]: Database Management—Database            Our major contribution is a new method of bitmap index-
                                         Applications, Scientific databases                                ing for multidimensional arrays that overcomes the dimen
                                                                                                           -sionality-induced inefficiencies. The method is based on
                                                                                                           n-dimensional sparse trees for dimension partitioning, and
                                         General Terms                                                     on attribute partitioning using adaptively binned indices.
                                                                                                           We demonstrate the performance on range queries involving
                                         Keywords                                                          both dimensions and attributes. We also show the effec-
                                                                                                           tiveness of our hierarchical indexing method as it prunes
                                         bitmap indexing, multidimensional arrays, range queries,
                                                                                                           the search space early, avoids reading entire index data, and
                                         scientific datasets, Fastbit
                                                                                                           does at most a single index traversal.
                                                                                                              The paper is organized as follows. In Section 2, we briefly
                                                                                                           describe previous work on bitmap indexing, scientific ap-
                                                                                                           plications and multidimensional arrays. In Section 3, we
                                                                                                           describe the preliminaries to our work, including bitmap in-
                                                                                                           dexing, array data model and array queries. In Section 4,
                                                                                                           we introduce our hierarchical bitmap array index, discuss its
                                                                                                           concepts, and explain its construction. In Section 5, we de-
                                                                                                           scribe the query evaluation process for mixed attribute and
                                                                                                           dimension range queries. In Section 6, we demonstrate the
effectiveness on multiple queries and compare our index to                       sion and attribute in one of the following formats. A one-
other solutions. In Section 7, we conclude with several notes                    sided range query: y ≤ 45; two-sided range query: 23.4 ≤
on future research and development directions.                                   y < 73.2, equality query: y = 89; membership query: y ∈
                                                                                 {2, 4, 6, 8, 10}, where y is either dimension or attribute of the
                                                                                 array. Figure 1 shows a query that has a two-sided constraint
2.    RELATED WORK                                                               on an attribute a and a one-sided constraint on dimension d2
   Traditional indexing methods like B-trees and hashing are                     on a 2-dimensional array and the (shaded) query outcome.
not effectively applicable to index multiple attributes in a                     Note that equality query is a special case of membership
single index, being replaced by multidimensional indexing                        query, and that all queries can be rewritten to a set of range
methods, such as R-trees [10], R*-trees [3], KD-trees, n-                        queries. Mixed queries are queries that pose constrains on
dimensional trees (quadtrees, octrees, etc.) [19, 20]. These                     at least one dimension and one attribute.
methods are not very effective for high dimension arrays and                        An example query on array SatelliteArray  [latitude, longitude, altitude, time]
indexing algorithms is in [21], though majority of the focus                     may look like this:
is on traditional spatial data instead of multidimensional                       SELECT * FROM SatelliteArray WHERE 50.68 ≤ latitude ≤
arrays.                                                                          50.88 AND 14.37 ≤ longitude ≤ 14.57 AND 30.0 ≤ snowf all.
   The drawbacks of traditional indexing algorithms led to                          The result would then be a possibly empty subarray of
the introduction of bitmap indices [6] and their applications                    the same format as SatelliteArray.
for scientific data [25]. Bitmap indices are naturally based
on linear data, ideal for relational databases. Space fill-                                   A[d1,d2]                             A'[d1,d2]

ing curves, such as Z-order curve and Hilbert curves [14,                                 3   3   2    ~ ~     SELECT * FROM A    3   3   2    ~ ~
                                                                                                               WHERE 2 ≤ a ≤ 4
13] were used for linearization and subsequent querying of                                2   4   2    1   5                      2   4   2    1   5
                                                                                         d1                    AND d2 ≤ 2;       d1
multidimensional data. Hilbert curves were used in [13],                                  1   4   7    3   2                      1   4   7    3   2

while Z-order curves were used in [17], which is a system                                 0   ~ 5      4   1                      0   ~ 5      4   1
                                                                                              0   1    2   3                          0   1    2   3
for querying spatial data (not arrays) using compressed hi-                                           d2                                      d2
erarchical bitmap indices. Hierarchically organized bitmap                       Figure 1: An example of a range query on a two-dimensional
indices were also used for star queries on data with hier-                       array.
archically organized dimensions [7]. Bitmap indices have
also been used for approximating aggregations [29], contrast
set mining [36], subgroup discovery [30], correlation analysis                   3.2    Distributed Arrays
[28]. All of which use bitmap indices on auxiliary attributes
made from dimensions (see Section 3.3). Other works utilize                        Due to the large size of scientific data, it is often necessary
bitmap indexing for spatial applications, but do not model                       to split the data into subarrays called chunks.
the data as multidimensional arrays [15, 24, 26].                                  There are two commonly used strategies. Regularly grid-
   The boom of multidimensional, scientific array data gave                      ded chunking, where all chunks are of equal shape and do
birth to open-source multidimensional array-based data man-                      not overlap. This array data model is known in SciDB as
agement and analytics systems, namely RasDaMan [2] and                           MAC (Multidimensional Array Clustering) [27]. This ar-
SciDB [27]. These databases work natively with multidi-                          ray model works well for coarse dimension-based queries,
mensional arrays, but lack some of the effective query pro-                      but requires either additional indexes or filtering for fine
cessing methods implemented in other databases. On the                           dimension-bases and for any attribute-based queries. This
other hand, SciDB has been established as a foundation for                       array data model is the foundation (the lowest level) of our
many multidimensional array processing tasks. Searchlight                        hierarchical bitmap array index. The second strategy is ir-
[12] is a SciDB based system for range queries with aggre-                       regularly gridded chunking, which is one of the chunking
gation constraints, using constraints programming on top of                      option in RasDaMan [2].
array synopsis – lossy representation of small array chunks.
                                                                                 3.3    Bitmap Indexing
                                                                                    Bitmap indices, originally introduced in [6], were shown
3.    PRELIMINARIES                                                              to be very effective for read-only or append-only data, we
   We first introduce the multidimensional array data model,                     used in many relational databases and for scientific data
then describe types of commonly used queries on arrays,                          management [9, 22, 23, 8].
with some examples. Next, we introduce bitmap indexing on                           Bitmaps can either be created for a single attribute value,
linear data, binning types, encoding types and compression                       called low-level bitmaps, or for multiple values, called high-
schemes.                                                                         level bitmaps, where the bitmap is set to 1 for the cell of
                                                                                 the arrays whose indexed value is in the value range of such
3.1     Array Data Model                                                         bitmap.
   An array A consists of cells with dimensions indexed by                          The structure of high-level bitmaps is determined by a
d1 , . . . , dn . Each cell is a tuple of several attributes a1 , . . . , am .   binning strategy. For high cardinality attributes, binning is
We assume the structure of the attributes is the same for all                    the essential minimum to keep the size of the index reason-
cells in the array. The array is denoted as A < a1 , . . . , am >                able [35, 34]. Binning effectively reduces the overall number
[d1 , . . . , dn ]. For example, satellite data may have latitude,               of bitmaps required to index the data, but increases the
longitude, altitude and time as dimensions, and precipita-                       number of cells that have to be later verified. This is called
tion, temperature, wind speed, etc. as attributes.                               a candidate check. Two most common binning strategies
   We form a query on arrays based on constraints. A di-                         are equi-width binning, which divides the attribute domain
mension and attribute constraint is a constraint on a dimen-                     into equal intervals, and equi-depth binning, which divides
the attribute domain into intervals covering equal (or near                                                                           4.    HIERARCHICAL BITMAP ARRAY IN-
equal) number of cells. Equi-width binning is highly prone                                                                                  DEX
to excessive candidate checks, especially on skewed data.
                                                                                                                                          We now briefly discuss a common way of indexing multi-
                                                                                                                                      dimensional arrays using additional bitmap indexes for each
d1 d2 a EBM     E[1] E[2] E[3] E[4] E[5] E[6] E[7]   R[1,1] R[1,2] R[1,3] R[1,4] R[1,5] R[1,6] R[1,7]   I[1,4] I[2,5] I[3,6] I[4,7]   dimension. Then we describe the structure of our hierarchi-
0
0
    0
    1
        3
        2
            0
            0
                    0
                    0
                        0
                        1
                            1
                            0
                                0
                                0
                                     0
                                     0
                                           0
                                           0
                                                0
                                                0
                                                      0
                                                      0
                                                             0
                                                             1
                                                                    1
                                                                    1
                                                                           1
                                                                           1
                                                                                  1
                                                                                  1
                                                                                         1
                                                                                         1
                                                                                                 1
                                                                                                 1
                                                                                                         1
                                                                                                         1
                                                                                                                 1
                                                                                                                 1
                                                                                                                        1
                                                                                                                        0
                                                                                                                               0
                                                                                                                               0
                                                                                                                                      cal bitmap array index.
0   2   ~   1       0   0   0   0    0     0    0     0      0      0      0      0      0       0       0       0      0      0          Arrays Aha1 , . . . , am i[d1 , . . . , dn ] are usually stored in a
0   3   ~   1       0   0   0   0    0     0    0     0      0      0      0      0      0       0       0       0      0      0
1   0   4   0       0   0   0   1    0     0    0     0      0      0      1      1      1       1       1       1      1      1      linearized representation, most commonly C-style row-major
1   1   2   0       0   1   0   0    0     0    0     0      1      1      1      1      1       1       1       1      0      0
1   2   1   0       1   0   0   0    0     0    0     1      1      1      1      1      1       1       1       0      0      0      array representation. Creating one index Idi =k (d1 , . . . , dn )
1   3   5   0       0   0   0   0    1     0    0     0      0      0      0      1      1       1       0       1      1      1
2   0   4   0       0   0   0   1    0     0    0     0      0      0      1      1      1       1       1       1      1      1      for each dimension d, which is set to 1 for cells of array A
2   1   7   0       0   0   0   0    0     0    1     0      0      0      0      0      0       1       0       0      0      1
2   2   3   0       0   0   1   0    0     0    0     0      0      1      1      1      1       1       1       1      1      0
                                                                                                                                      where d is equal to a value k. This allows filtering out results
2
3
    3
    0
        2
        ~
            0
            1
                    0
                    0
                        1
                        0
                            0
                            0
                                0
                                0
                                     0
                                     0
                                           0
                                           0
                                                0
                                                0
                                                      0
                                                      0
                                                             1
                                                             0
                                                                    1
                                                                    0
                                                                           1
                                                                           0
                                                                                  1
                                                                                  0
                                                                                         1
                                                                                         0
                                                                                                 1
                                                                                                 0
                                                                                                         1
                                                                                                         0
                                                                                                                 1
                                                                                                                 0
                                                                                                                        0
                                                                                                                        0
                                                                                                                               0
                                                                                                                               0
                                                                                                                                      based on dimensions using binary AND.
3
3
    1
    2
        5
        4
            0
            0
                    0
                    0
                        0
                        0
                            0
                            0
                                0
                                1
                                     1
                                     0
                                           0
                                           0
                                                0
                                                0
                                                      0
                                                      0
                                                             0
                                                             0
                                                                    0
                                                                    0
                                                                           0
                                                                           1
                                                                                  1
                                                                                  1
                                                                                         1
                                                                                         1
                                                                                                 1
                                                                                                 1
                                                                                                         0
                                                                                                         1
                                                                                                                 1
                                                                                                                 1
                                                                                                                        1
                                                                                                                        1
                                                                                                                               1
                                                                                                                               1
                                                                                                                                          Note that the dimensions index Idi =k (d1 , . . . , dn ) does not
3   3   1   0       1   0   0   0    0     0    0     1      1      1      1      1      1       1       1       0      0      0      necessarily have to use equality encoding, but based on the
                                                                                                                                      expected queries, we may choose a better combination of
Figure 2: Bitmap index for attribute a of the array A from                                                                            binning, encoding and compression. This approach is used in
Figure 1: empty bitmask EBM, equality encoded index E,                                                                                [30, 36] with equi-depth binning or in [29] with v-optimized
range encoded index R and interval encoded index I.                                                                                   binning based on v-optimal histograms [11] and C-style row-
                                                                                                                                      major linearization in [28].
                                                                                                                                          Unfortunately, dimension bitmap index is not effectively
                                                                                                                                      compressible. Consider an example of row-major ordering
  Another crucial aspect of bitmap indexing is encoding [6].
                                                                                                                                      on 5x5 array. Then the row dimension index for column = 1
which determines how a set of bins, B, of attribute do-
                                                                                                                                      is 01000 01000 01000 01000 01000, which cannot be ef-
main is encoded in each bitmap and consecutively into a
                                                                                                                                      fectively compressed using either BCC or WAH, since the
bitmap index. The simplest encoding, called equality en-
                                                                                                                                      compression context of both is a single bit. This can be
coding, encodes each bin with one bitmap for a total of
                                                                                                                                      partially mitigated by stretching dimensions to multiples of
|B| bitmaps. Processing of equality queries reads a single
                                                                                                                                      bytes or words, and extending the run-length compression to
bitmap, but processing of range queries has to read at most
                                                                                                                                      use byte or word in its compression context, instead of single
half of all the bitmaps. Range encoding uses B − 1 bitmaps,
                                                                                                                                      bits. Another option is to use either Z-order or Hilbert space
each bitmap Ri encodes a range of bins [B1 , Bi ]. The pro-
                                                                                                                                      filling curves to further increase locality of the dimensions.
cessing of range encoded bitmap index for range queries
                                                                                                                                      Neither, however, solves the problem entirely.
reads at most two bitmaps. Interval encoding [5] uses |B| 2
bitmaps, each bitmap Ii is based on range encoded bitmaps                                                                             4.1     Partitioning of Arrays
Ri ⊕ Ri+ |B| . Interval encoding uses at most two bitmaps to
                2                                                                                                                       Non-partitioned data require much finer binning and the
process range queries. Compared to range encoding, it uses                                                                            domain of the dimension is higher than its partitioned coun-
only half the space. Figure 2 shows an example of equality,                                                                           terpart, thus higher amount of bins is required. By partition-
range and interval bitmaps for the array in Figure 1.                                                                                 ing the array Aha1 , . . . , am i[d1 , . . . , dn ] into a set of regularly
   Bitmap indices, based on the number of bins, may take                                                                              gridded chunks C in the Multidimensional Array Clustering
up to |B| · C, where C is the cardinality of the indexed                                                                              fashion described in Section 3.2, such that:
attribute, leading to very small number of bins needed to
exceed the size of the raw data. Binary run-length com-                                                                                      Ci [o1 , o2 , . . . on , e1 , e2 , . . . , en ] =
pression algorithms are usually applied on bitmap indices                                                                                     Aha1 , . . . , am i[o1 ≤ d1 < e1 , . . . , on ≤ dn < en ]
to reduce the overall size. However, another requirement is
posed to these compression algorithms, such that it must be                                                                             All chunks in our data model are of the same shape, i.e.,
possible to run bit-wise operations effectively on the com-                                                                           for all chunks Ci , Cj of array A, it holds that
pressed bitmaps. There are two representative compression                                                                                              Ci [ek ] − Ci [ok ] = Cj [ek ] − Cj [ok ]
algorithms, namely Byle-aligned Bitmap Code – BCC [1]
and Word-Aligned Hybrid (WAH) compression [32].                                                                                       for all dimensions k, and chunks are not overlapping and
   In order to facilitate effectively high cardinality attributes                                                                     completely cover the whole array A. In the chunk notation,
with space efficient indices and fast querying, two compos-                                                                           ok stands for offset and ek stands for end of the chunk along
ite methods were introduced. The first method is multi-                                                                               that dimension (exclusive boundary).
component, where the attribute value is decomposed into                                                                                  By chunking the array, we limit the domain of both at-
multiple components, which are then indexed independently.                                                                            tributes and dimensions in a given partition. In our adaptive
An example of multi-component index is a bit-sliced index                                                                             binning indices, we use the fact that the domain of the at-
[18], where each component corresponds to a bit of the value.                                                                         tribute varies based on the location.
Second composite method is called multi-level indexing [23],                                                                             The first problem arising from the equal size chunking
where the binning of the attribute becomes progressively                                                                              model is that within a single chunk, we are still required to
more precise with increasing levels.                                                                                                  use either indexing or at least aggregate information on the
   Thorough performance analysis of bitmap indexing, espe-                                                                            attributes, such as min and max for precise queries or his-
cially multi-level and multi-component both uncompressed                                                                              tograms for probabilistic queries, or data exploration. We
and compressed is presented in [33]. An open-source bitmap                                                                            choose to use bitmap indexing on both attributes and dimen-
indexing framework Fastbit [31] implements most of cur-                                                                               sions within the chunk. Note that the dimension indices are
rently existing indexing schemes, mainly two-level indices.                                                                           the same for all chunks in the array, since for each chunk,
we can simply subtract its offset from the dimensions query             The overall internal node fanout F can be expressed in
constraints.                                                          terms of a fanout Fdk for a single dimension k as
   The second problem lies in the overall structure of the                               Yn                    n
chunks. There is no direct, high level index of the attributes                       F =     Fdk ≤ max Fdk
for the chunks. It is necessary to scan through the synop-                                               1≤k≤n
                                                                                           k=1
sis of all the individual chunks, or generate a hierarchical
                                                                         Assuming that the dimension fanout Fdk is the same for
synopsis. The latter has been utilized in [12] in a form of a
                                                                      all dimensions, we can get
graph generated over merging sub-arrays.                                                           j 1k
   We propose a unified solution that solves both the problem                                Fd k = F n
with dimension attributes and with synopsis of array chunks.
Our solution is in a form of hierarchical bitmap index on top            As we will see in Section 5.2, in order to facilitate efficient
of a n-dimensional tree (such as octree for 3 dimensions) with        dimension range queries, the size of F cannot be too large,
variable binning for each node in the tree.                           since the size of precomputed dimension clipping bitmaps
                                                                      depends on F .
                                                                         The index tree construction works in a bottom-up fashion,
4.2    Structure of the Array Chunk Index                             where the leaf nodes are indexed at first. This allows both
   The index is done separately for each attribute of the array       data appending and modification (see Section 4.7). Each
A. Let’s fix an attribute α. All the following functions refer        internal node is constructed from at most F direct children
to this attribute.                                                    and with at most BINS attribute bins, with one additional
   Each chunk C(o1 , o2 , . . . , on ) of array Aha1 , . . . , am i   index for empty bitmask. Each child node Ni of internal
[d1 , . . . , dn ] is associated with exactly one leaf                node N provides its attribute’s min(Ni ) and max(Ni ) val-
N` (o1 , o2 , . . . , on ). Independently, each leaf uses an equi-    ues. These values are used for the construction of the bitmap
depth binning index with a total of at most BINS bins, where          index of N .
bin boundaries bins(N` of the index are based on an exact                Let B = (min(N1 ), max(N1 )), . . . , (min(NF ), max(NF ))
chunk values histogram. Note that this assumes uniform                be the set of all intervals ranging from the minimum to the
distribution of queries. If we had any prior knowledge of             maximum value of the indexed attribute α among all the
the queries based on the attribute, we would instead opt              child nodes Ni . The set B is the set of bins – the individual
for weighted histogram to construct the binning. The leaf’s           interval boundaries are delimiters, where the attribute’s α
dimension boundaries correspond to its associated chunk’s             value a is in the attribute domain of different child nodes.
boundaries, clipped by the global shape of the array A.               Formally, let nodesin(a) ⊂ Ni be a function of a value a ∈ α
   Accounting for empty values (missing cells in A) is done           of attribute α, which returns a subset of child nodes.
using a special bitmask, known as empty bitmask, for a total
of BINS + 1 indices. Only leaves with at least E · BINS non-                Ni ∈ nodesin(a) ⇐⇒ min(Ni ) ≤ a ≤ max(Ni )
empty cells are indexed, where the constant E is dependent               The set nodesin(a) is used to construct the binning for
on the data structure used for the leaf representation, i.e.,         index of this internal node. We describe the encoding of
do not use bitmap indexing if listing the values is more space        this bitmap index in Section 4.5.
efficient.                                                               The index bins are aligned with the bins from B. This
   Encoding of the leaf indices is left as a parameter to the         guarantees that no two indices for different bins will be iden-
user, as the bitmap indexing performance heavily depends              tical, i.e., represent the same set of children. It also directly
on the cardinality of the array attribute, desired number of          implies that adding more boundaries to B would be point-
bins, and query types. For generality, we assume high car-            less.
dinality attributes, such as integers and doubles and small
number of bins such as BINS ≤ 16.                                     4.4    Bin Boundaries Merging in Parent Nodes
   Except for very narrow dimension range queries, a dimen-              The number of bins from all F child nodes is higher than
sion query will either cover the whole span of a leaf node, or        BINS for majority of the internal nodes N , therefore it is
result in a one-sided dimension range query once the query            necessary to reduce the size of the set of bins, B. There are
processing reaches a single chunk. Thus, the ideal encod-             several strategies to choose B ⊂ D such that |B| = BINS.
ings for chunks are range and interval encodings [5]. Our             An example of such binning reduction is in Figure 3.
default encoding is interval encoding since it uses half the             The first strategy is to use an equi-width distribution of
memory range encoding does. Encoding of inner nodes is                the bins. This is the ideal choice assuming the attribute
more complicated and we describe it in Section 4.5.                   part of the query is uniformly distributed or when there is
                                                                      no prior knowledge about the attribute query and assuming
4.3    Structure and Construction of the Hierar-                      the data distribution is not skewed.
       chical Bitmap Array Index                                         The second strategy is to use equi-depth binning. This
                                                                      is ideal if the attribute distribution of the child nodes is
   To deal with the higher level index, we create a special
                                                                      skewed. It is possible to maintain the weights of the bins
composite index on tree similar to n-dimensional tree. Each
                                                                      for leaf nodes, since those have direct access to the data.
internal node of the index has at most F children, where F
                                                                      However, internal nodes can only make estimates about the
is called a fanout. Note that, unlike in quadtrees, octrees or
                                                                      weight of merged bins. In each internal node and leaf, we
n-dimensional trees, F is not necessarily 2n , where n is the
                                                                      store the weight estimate w(b), where b ∈ B. The weighted
number of dimensions. Our bitmap indices are based on the
                                                                      square error of a bin b is
fanout and we want to utilize binary operations as much as
                                                                                                                    2
possible. For this reason, the fanout F should be a multiple                                                w(D)
                                                                                        wse(b) = w(b) −
of the processor word size W , or as close to it as possible.                                               BINS
N1                                          N1                                     R+(-∞,1)= 0000        Input: set of bins B, set of weights w(b), b ∈ B,
N2                                          N2                                     R+[1,3) = 0101
                                                                                   R+[3,+∞)= 1111
                                                                                                                number of output bins BINS
N3                                          N3
                                                                                                         Result: approx equi-depth bins R ⊂ B, |R| =BINS
N4                                          N4                                     R-(-∞,6]= 1111

B
                                                                                   R-(6,8] = 0011
                                                                                   R-(8,+∞)= 0000
                                                                                                     1  R ← eq-width bins from B, |B| =BINS ;
R                                           R                                                        2  BS ← all possible split bins of R;
       1   2   3   4   5   6   7   8             1   2   3   4   5   6   7   8     Bitmap index of
                                                                                   nodes that have    3 BM ← all possible merged bins of R;
R -- Approximate bins for attribute index        False positive attribute ranges   started / ended
                                                                                                      4 QSP LIT ← priority queue();
                                                                                                      5 QM ERGE ← priority queue();
Figure 3: Example of merging |B| = 8 bin boundaries to                                                6 for s ∈ BS do // bins to split
|R| = 4 bin boundaries for 4 child nodes. False positive                                              7     add (s, ∆wse(s)) to QSP LIT ;
ranges are marked in red. Two sided range encoded bitmaps                                             8 end
are generated for R.                                                                                               0
                                                                                                      9 for (m, m ) ∈ BM do // bins to merge
                                                                                                     10     add ((m, m0 ), ∆wse((m, m0 )) to QM ERGE ;
                                                                                                     11 end
and the weighted sum square error is                                                                    // split that decreases wsse the most
                            X                                                                        12 (s, ∆wse(s) ← min(QSP LIT );
                  wsse(B) =     wse(b)                                                                  // merge that increases wsse the least
                                                  b∈B                                                           0              0
                                                                                                     13 ((m, m ), ∆wse((m, m ))) ← min(QM ERGE );
.                                                                                                                           0
                                                                                                     14 while ∆wse((m, m )) > ∆wse(b) do
  To estimate the weight of merged bin r ∈ R ⊂ B, we                                                 15     split b;
assume uniform distribution of values over the intervals of                                          16     merge (b, b0 );
bins b ∈ B. Then the estimated weight of r is                                                        17     update R, BS , BM , QM ERGE , QSP LIT ;
                                                                                                     18 end
                     X
             w(r) =      w(b) · sizeof(b ∩ r)
                                       b∈B                                                             Algorithm 1: Iterative equi-depth binning approximation
where sizeof(b ∩ r) is the size of the intersection of r and b.
   We cannot use the trivial algorithm for equi-depth bin-
ning, because we can only iterate by bins of variable weight,                                        added, and we add r0 to a set R+ . Else, if nodesin(r0 ) ⊂
instead of iterating by single data points. This is why we                                           nodesin(r), then nodes are removed in set nodesin(r0 ), and
need to approximate the equi-depth using a simple iterative                                          we add r0 to set R− . Otherwise, some nodes are added and
algorithm. Details on selecting R ⊂ B approximately equi-                                            some are removed and we add r0 to both R+ and R− . In
depth bins are shown in Algorithm 1. We first start with                                             our example in Figure 3, R+ = {[1, 3), [3, 6)} and R− =
equi-width binning (line 1). Then, we generate sets of all                                           {(3, 6], (6, 8]}.
possible bin splits and merges (lines 2-3), setup two priority                                         There is no guarantee that |R+ | = |R− |. If we wanted,
queues and evaluate all possible splits and merges in terms                                          we could run Algorithm 1 separately on boundaries B+ and
of weighted sum square error (lines 4-11). After that, we                                            B− (likewise defined) and with BINS
                                                                                                                                       2
                                                                                                                                          bins, but then we’d lose
perform one valid split and one merge on the binning as                                              the equi-width approximation.
long as this leads to an improvement of the overall binning                                            Now, we encode |R+ | + 1 bitmaps using range encoding,
(lines 14-18). This preserves the total number of bins.                                              so that the index for bin r+ ∈ R+ corresponds to children,
   In case a node has either a low cardinality attribute throu-                                      whose attribute range minimum min(Ni ) is ≤ to the up-
ghout all the child nodes, we create bins mapped to single                                           per boundary of interval r+ . In our example, bitmap cor-
values of the attribute and their corresponding bitmaps.                                             responding to r = [1, 3) ∈ R+ is 0101, indicating that N1
   Note that v-optimal binning does not work in our case,                                            and N3 have started in or before this interval. Similarly,
since we don’t have the individual data values available dur-                                        we encode |R− | + 1 bitmaps for values r− using inverse
ing construction of the internal nodes, although we could                                            range encoding, i.e., children, whose attribute range max-
approximate this using uniformly or normally distributed                                             imum max(Ni ) is > to r− are encoded by 0 in the bitmap,
estimates within the bins of child nodes, or by propagating                                          representing children that have already ended before or in
at least basic data synopsis.                                                                        the interval r− .
                                                                                                       These two bitmaps easily allow evaluation of partial and
4.5        Double Range Encoding of Bitmap Indices                                                   complete matches (see Section 5.1) using only two bitmap
           in Internal Nodes                                                                         reads and one logical operation for both partial and complete
   Unlike in bitmap indexing in leaves where one encodes                                             query.
positions of individual values, we encode sets of child nodes
nodesin(a) for attribute values a in the internal nodes. Our                                         4.6     Locality of the Hierarchical Index
binning B has the property that for all attribute values                                                In order to preserve locality of the data during queries, we
ab , a00b ∈ b ∈ B it holds that nodesin(ab ) = nodesin(a0b ).                                        store the whole index in a locality preserving linearization of
Note that this does not hold for intervals r ∈ R (See Figure                                         an n-dimensional tree. For each query, blocks of the index
3 for an example).                                                                                   are loaded sequentially and sparsely, based on the parame-
   We will now describe an effective bitmap encoding of                                              ters in the query. Thus, only one traversal, possibly incom-
nodesin(a), a ∈ r ∈ R. Let’s have two adjacent intervals                                             plete, of the index data is needed. The index data consist
r ∈ R and r0 ∈ R, such that rh = r`0 Note that since                                                 of bin boundaries, weight estimates and bitmap indices.
R ⊂ B, we have nodesin(r) 6= nodesin(r0 ). If nodesin(r0 ) ⊃                                            We use space filling curves, namely the Z-order curve to
nodesin(r), then r0 corresponds to a bin, where nodes are                                            linearize the multidimensional array index. We choose not
to use recursive multi-level Z-order curves, as this would                 dimensions, in which case we fill all q with remaining di-
force the query processing to be based on pre-order traver-                mensions, to a complete query. Dimensions, that were not
sal of the index tree. We also choose not to use row major                 specified, are filled with (dj , min(dj ), max(dj )) triples. One-
ordering, since it has poor locality and it would slow down                sided range constraints are also extended in similar manner.
retrieving locations child nodes and partitions. Hilbert curve                The core of the query algorithm is a breadth-first descent
has perfect locality, but it does not preserve dimensions or-              through the index tree. At each level, the search space is
dering. This means we would need to precompute bitmaps                     pruned according to both dimension and attribute values.
for dimension constraints for each block of Hilbert curve                     Let N be the currently searched node, Ni be its child
separately. Z-order curve allows for fast child and parent                 nodes, where 0 ≤ i < F ; multidimensional range DN be the
node index computations, preserves dimensionality between                  set of dimension boundaries in the format [DN [d]` , DN [d]h ],
different level and has a good locality.                                   where d is dimension, ` designates lower bound, h upper
   The order Z` of the Z-order curve of level ` is determined              bound, associated with node N .
by the maximal fanout Fmax = max1≤k≤n Fdk , where Fdk                         Throughout the query processing, we maintain a queue of
is a fanout of dimension k.                                                partially matched nodes P and a set of completely matched
                                                                           nodes C. We start at a root node Nr , setting P = {Nr },
                       Z` = ` · dlog2 Fmax e                               assuming that both: node N ’s boundaries and query dimen-
Assuming Fdk is the same for all dimensions, the order of                  sions are not disjoint:             DN ∩ QD          6=    ∅ and
Z-order curve is then                                                      (min(N ), max(N )) ∩ QA 6= ∅, otherwise node N ∈          / P and
                           l    j 1 km                                     N∈  / C.
                   Z` = ` · log2 F n                                          Let p, p0 , p∗ and c, c0 , c∗ be zero bitmaps of size F ;
                                                                           the bitmaps p indicates partial attribute matches among the
and such a Z-order curve has length of (Z` )n .                            children of node N , p0 indicated partial dimensions matches,
   Several of the higher levels are stored in a dense vector, as           p∗ indicates partial matches, similarly the vectors c, c0 , c∗
specified by a user parameter. These vectors are expected                  indicate complete matches. We will now set these vectors
to be densely filled. The remaining levels are stored as non-              according to the query Q for the first node in queue P . The
overlapping intervals on a Z-order dimension (1D) in con-                  partial and complete matches bitmap computation is also
tinuous blocks, indexed by a binary search tree. This is a                 described in Algorithm 2 and in Figure 4.
compromise between sparse single node map and full vec-
tor used for higher levels. Note that the blocks may not be                    Input: query q = {(a` , ah ), (d1 , d` , dh ), . . .} with DIMS
sequential in memory, but at most a single transition is guar-                 dimension constraints; node N ; node children
anteed, i.e., no blocks are read twice during the processing                   N1 , . . . , NF ; boundaries [DN [d]` , DN [n]h ] for N and
of a single query.                                                             all Ni and dimensions d;
                                                                               Result: partial matches p∗; complete matches c∗;
4.7     Appending and Modifying Data
                                                                           1  PN,S , CN,S ← load index for node N ;
  Scientific data is often considered either fixed or append               2  PS,d0       0
                                                                                     , CS,d   ; // precomputed;
only, our indexing approach allows for both appending and                                   F     0         F
                                                                            3 p ← {0} , p ← {0} , p∗;
data modification, although the latter is not convenient.                                  F     0          F
                                                                            4 c ← {1} , c ← {1} , c∗;
  To append data along any dimension, we apply the same
                                                                            5 if ah < min(N ) or a` > max(N ) then
bottom-up procedure to update the index. It is necessary to
update the dimension bounds of internal nodes (that were                    6       return p∗ ← {0}F , c∗ ← {0}F
possibly previously clipped by the global shape of the array)               7 c = c & CN,S (a` , ah );

and bitmap indices (to include the new child nodes). Note                   8 p = p | PN,S (a` , ah ) & ∼c;

that we do not have to update the weight estimates and bin                  9 for dimensions d, 1 ≤ d ≤ DIMS do

boundaries (except min and max) in order to assure index                   10       if dh < DNi [d]` or a` > DNi [d]h then
correctness. However, in order to assure the equi-depth op-                11            return p∗ ← {0}F , c∗ ← {0}F
timal binning, we need to run the bin merge algorithm again                12       if d` > DN [d]` then
on affected nodes.                                                         13            p0 = p0 | PS,d   0
                                                                                                             (d` );
                                                                           14       if dh < DN [d]h then
                                                                           15            p0 = p0 | PS,d   0
                                                                                                             (dh );
5.    QUERYING DIMENSIONS AND                                              16       c0 = c0 & CS,d    0
                                                                                                        (d` , dh );
      ATTRIBUTES                                                           17 end
                                                                                0       0       0
  In this work, we focus on selection queries over dimensions              18 p ← p & c ;
                                                                                0      0            0
and attributes of an array. Such query consists of a set                   19 c ← c & ∼p ;
                                                                                               0
of dimension constraints and attribute constraints. Let’s                  20 c∗ ← c & c ;
                                                                                                        0     0
specify a query q over an array Aha1 , . . . , am i[d1 , . . . , dn ] as   21 p∗ ← (p | c) & (p | c ) & ∼c∗;
a set of ranges over dimensions qD and attributes qA .                     22 return p∗, c∗

                                                                             Algorithm 2: Evaluation of partial and complete match
        q = qA ∪ qD = {(a, a` , ah )} ∪ {(dj , j` , jh ), . . .}             bitmaps for a single node.
where (a, a` , ah ) is a triple specifying attribute constraint:
attribute, its lower bound and its (exclusive) upper bound;
same goes for dimensions. In this work, we focus on a single               5.1      Attribute based Matches
attribute query. Therefore, we simplify qA to (a` , ah ). It                In this subsection, we explain how attribute bitmask is set.
is possible for a query to not specify constraints for some                This subsection further describes lines 5–8 in Algorithm 2.
If ah < min(N ), or a` > max(N ), there are neither par-          The second expression is similar, but for dh . Third and forth
tial nor complete attribute matches and we terminate pro-            expression combine the partial matches over both query lim-
cessing the current node.                                            its and all dimensions. Note that this results in excessive
   Let PN,S (a` , ah ) be a partial attribute match bitmasks         partial candidates since all child nodes that intersect the
specific to node N of for an array of shape S, with bits set         query constraints along at least one dimension qualify as
to one corresponding to children Ni so that the intersection         partial candidates.
[a` , ah ] ∩ [min(Ni ), max(Ni )] 6= ∅.                                 Partial dimension matches are evaluated using one pre-
                                                                     computed bitmap index corresponding to
  PN,S (a` , ah )[i] = 1 ⇐⇒ PB|N,S (ah )[i] ∧ ¬PE|N,S (a` )[i]
                                                                                     0
      PB|N,S (a)[i] = 1 ⇐⇒ min(Ni ) ≤ a                                             PS,d (b)[i] = 1 ⇐⇒ b = DNi [d]
      PE|N,S (a)[i] = 1 ⇐⇒ max(Ni ) ≥ a                              where b is a bucket corresponding to the chunking of the ar-
The second expression describes bitmap set to 1 for children         ray A. There are a total of Fd such buckets along dimension
that have started before or at value a, the third one describes      d, resulting in a total of Fd · d bitmaps of size F . We query
children that have ended at or after a. The first expression         these bitmaps for all dimensions and combine them using
then combines both.                                                  OR into p0
   To evaluate PN,S (a` , ah ), we first use binary search on           There is a special case of false negative dimension result.
R+ and R− to find two bins L ∈ R+ and H ∈ R− such that               If d` or dh is equal to the d’th dimension range border of
a` ∈ L and ah ∈ H. These bins L and H mark the attribute             a child node Ni , and at the same time the other end of
boundary bins. Then, PB|N,S (ah ) is identical to R+ [H] and         d` or dh causes the dimension to be fully covered in Ni ,
¬PE|N,S (a) is identical to R− [L], where R+ and R− are the          i.e. d` = DNi [d]` and dh ≥ DNi [d]h or dh = DNi [d]h and
bitmap indices described in Section 4.3, each queried for a          d` ≤ DNi [d]` , the query is evaluated as partial match for Ni
single bin. Then we add PN,S (a` , ah ) to p using bitwise OR.       and dimension d, while in fact dimension d contributes to
   Now, we process complete candidates in a similar fashion.         complete matches. A check for this scenario requires com-
Let CN,S (a` , ah ) be a complete attribute match bitmask spe-       paring the dimension ranges of child nodes to the query
cific to node N for array of shape S, so that the intersection       range, and was ignored on purpose, as it complicates and
[a` , ah ] ∩ [min(Ni ), max(Ni )] = [a` , ah ].                      slows down the query process.
                                                                        For complete candidates, we will slightly modify the defi-
  CN,S (a` , ah )[i] = 1 ⇐⇒ PB|N,S (a` )[i] ∧ ¬PE|N,S (ah )[i]                                             0
                                                                     nition of C used for attributes. Let CS,d (d` , dh ) be a complete
   This expression is very similar to PN,S (a` , ah ), describing    dimension match for array of shape S, indicating which child
children that have started at or before a` and have not ended        nodes Ni are partially or fully covered by interval [d` , dh ].
at or before ah . To evaluate CN,S (a` , ah ), we query R+ [L]       Despite the semantics indicating partially matches should
and R− [H]. Then, we add the result to c using bitwise OR            not be included, we later trim the complete dimension match
and remove those from p, i.e., p = p ∧ ¬c.                           bitmap accordingly.
   Note that both partial and complete attribute candidates                   0
                                                                             CS,d (d` , dh )[i] = 1 ⇐⇒ [d` , dh ] ∩ DNi [n] 6= ∅
use a total of 4 index queries. An example of attribute query                                       \
                                                                                0                        0
is displayed in the bottom row in Figure 4.                                   CS (d` , dh )[i] =        CS,d  [i]
                                                                                               1≤n≤DIMS
5.2     Dimension based Matches
                                                                       Complete dimension matches are evaluated using two pre-
   Next, we explain how the dimension masks are set. This
                                                                     computed bitmap indices corresponding to
subsection further describes lines 9–17 in Algorithm 2.
   If for any dimension d it holds that dh < DNi [d]` or a` >                       0
                                                                                   CB|S,d (b)[i] = 1 ⇐⇒ b ≤ DNi [d]
DNi [d]h , there are neither partial nor complete dimension                         0
matches and we terminate processing the current node.                              CE|S,d (b)[i] = 1 ⇐⇒ b ≥ DNi [d]
   Unlike attribute query, the evaluation of dimension query         similarly to bitmaps used for partial matches. There is a
is the same for all nodes N , so all the bitmaps for processing      total of 2 · Fd · d bitmaps of size F for complete matches. We
dimensions queries are precomputed.                                  query these bitmaps for all dimensions and combine them
          0
   Let PS,d  (d` , dh ) be a partial dimension match, where d is a   using AND into c0 .
dimension in the query constraint (d, d` , dh ), for an array of       We now combine the partial dimension matches bitmap c0
shape S, indicating child nodes Ni such that the intersection        with p0 , such that p0 = p0 ∧ c0 . Then, we clip the complete
[DNi [d]` , DNi [d]h , ] ∩ [d` , dh ] 6= ∅.                          dimension bitmap by the partial bitmap as c0 = c0 ∧ ¬p0 .
   Let’s fix a dimension d for which we evaluate partial mat-        During the evaluation of dimension matches, we used a total
          0
c0hes PS,d   (d` , dh ):                                             of 3 · d index queries. An example of dimension query is
         0
        PS,d (d` )[i] = 1 ⇐⇒ d` ∈ DNi [d] ∧ d` 6= DNi [d]`           displayed in the top row in Figure 4.
         0
        PS,d (dh )[i] = 1 ⇐⇒ dh ∈ DNi [d] ∧ dh 6= DNi [d]h           5.3    Partial and Complete Matches
    0                         0               0
   PS,d (d` , dh )[i] = 1 ⇐⇒ PS,d (d` )[i] ∨ PS,d (dh )[i]             Now that we have both attribute and dimension, and both
      0
                          [    0                                     partial and complete candidates, we may proceed to merging
    PS (d` , dh )[i] =       PS,d [i]
                                                                     the candidates and generating a bitmap representing the set
                      1≤d≤DIMS                                                                ∗
                                                                     of result node children CN,S and a bitmap representing the
                                                                                                     ∗
The first expression describes which children Ni have di-            set of potential node children PN,S that will be recursively
mension d range such that the query limit d` falls inside the        explored. This subsection further describes lines 18–22 in
range, but it is not equal to the lower limit of that range.         Algorithm 2.
∗
   The CN,S bitmap is easier to obtain, as it is the intersec-                    6.    EXPERIMENTAL EVALUATION
tion of both complete bitmaps without partial candidates                             We have tested our implementation against several other
bitmaps.                                                                          solutions, of which none is specifically tailored to mixed at-
                                ∗
                               CN,S = CN,S ∧ CS0                                  tribute and dimensions range queries, but those are the only
                                                                                  readily available solutions involving bitmap indices and be-
                                           ∗
  We obtain the set of partial candidates PN,S by joining                         ing capable of executing range queries.
the dimension-based partial candidates with the attribute-                           We measured the time and space efficiencies for each in-
based candidates and clipping both by complete candidates                         dividual query, i.e. total query execution time, and space
            ∗
                                                                                  requirements for the index. Timing was measured as an
           PN,S = (PN,S ∨ CN,S ) ∧ (PS0 ∨ CS0 ) ∧ ¬CN,S
                                                    ∗
                                                                                  average of 3 runs with data preloaded into memory. For
                                                                                  Fastbit queries, we use their internal wall time measuring
   We then iterate through the results, adding child nodes
       ∗                                                                          systems, meaning certain pre and post processing steps are
from CN,S   to the result set C and the partial candidates
  ∗                                                                               not included in the time measurements, such as query string
PN,S into the queue P to be processed subsequently. This
                                                                                  parsing. Space requirements were measured based on the
process is done on top of Z-order indices, as it is trivial
                                                                                  disk space required to store the bitmap index together with
to generate Z-order indices corresponding to nodes in the
                                                                                  all relevant metadata.
lower levels. The Z-order ordering of the inner nodes and
                                                                                     The experiments were run on a single physical machine
breadth-first traversal also ensures single traversal through
                                                                                  – Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz, 16 GB
the index.
                                                                                  RAM, 1TB 7.2K RPM SATA 6Gbps; running Ubuntu 14.04.1
SELECT * FROM A WHERE 2 ≤ a ≤ 4 AND 1.3 ≤ d1 AND d2 ≤ 2.5;                        (3.19.0-32 kernel).
                                                                                     We use a synthetic dataset to test our queries on – ran-
                                                           C' = C' AND NOT P'
3-5 2-3   ~    ~     3-5 2-3    ~   ~    3-5 2-3   ~   ~                          domly generated multidimensional sum gaussian distribution
4-5 2-4 5-7 1-3      4-5 2-4 5-7 1-3     4-5 2-4 5-7 1-3   P' = P' AND C'         SumGauss. Its only attribute aG is a sum of G randomly
4-6 7-8 2-2 3-5      4-6 7-8 2-2 3-5     4-6 7-8 2-2 3-5
                                                                                  initialized Gaussian distribution in D dimensions:
                                                                (P OR C) AND
                                                               (P' OR C') AND                G                                                 !
 ~   5-6 4-6 1-1      ~   5-6 4-6 1-1    ~   5-6 4-6 1-1
                                                                                                                       (d − µi )T Σ−1
                                                                                                                   
                                                              NOT (C AND C')                           1                           i (d − µi )
                                                                                            X
                                                                                       ~ =
                                                                                  aG (d)                        exp −
Partial dimension   Complete dimension                                                           p
matches - P'        matches - C'
                                                                      C AND C'
                                                                                            i=1     (2π)D |Σi |                   2

3-5 2-3   ~    ~     3-5 2-3    ~   ~    3-5 2-3   ~   ~       3-5 2-3   ~   ~    where µi and Σi are randomly generated distribution mean
4-5 2-4 5-7 1-3      4-5 2-4 5-7 1-3     4-5 2-4 5-7 1-3       4-5 2-4 5-7 1-3    vector and a bounded symmetric positive definite covariance
4-6 7-8 2-2 3-5      4-6 7-8 2-2 3-5     4-6 7-8 2-2 3-5       4-6 7-8 2-2 3-5
                                                                                  matrix for dimension i. For sparse arrays, a threshold for the
                                                                                  Gaussian functions is used. Attribute is treated as empty if
 ~   5-6 4-6 1-1      ~   5-6 4-6 1-1    ~   5-6 4-6 1-1        ~   5-6 4-6 1-1
                                                                                  the value is below this threshold. Only partitions with at
Partial attribute   Complete attribute                              Node query
matches - P         matches - C
                                         C   P = P AND NOT C
                                                                        output
                                                                                  least one non empty value are generated.

Figure 4: Processing of a query in a single node of the hi-
                                                                                  6.1    Fastbit Integration
erarchical index. Top row represents dimension constraints,                          Fastbit [31] is an open source library that implements
bottom row represents attribute constraints. Bottom right                         bitmap indexing. It’s not a complete database management
is the final product. Blue nodes represent partial matches                        system, rather a data processing tool, as its main purpose
and green node represent complete matches.                                        is to facilitate selection queries and estimates. Fastbit’s key
                                                                                  technological features are WAH bitmap compression multi-
                                                                                  component and multi-level indices with many different com-
  Running the algorithm for multiple queries or multiple                          binations of encoding and binning schemes.
attribute constraints in a single query can be implemented                           We use Fastbit’s partitions to setup the lowest level of our
using iteration through the constraints in the worst case.                        indices (leaves), and base our binning indices on Fastbit’s
                                                                                  single-level binning index. This approach requires prepro-
5.4       Estimating Cardinality of Results; Mem-                                 cessing of the data into evenly shaped partitions, generating
          bership Queries                                                         empty bitmasks and shape metadata. Once a table is pre-
  It is fairly straightforward to output estimates on minimal                     processed into even partitions, it is indexed as described in
and maximal number of matching cells by iterating some                            Section 4. The index generation processes one partition at
bounded number of levels of the index. The minimal number                         a time, and once processed, the partition is never accessed
outputs the size of nodes in C, while the maximum outputs                         again during the index generation.
the size of nodes in C ∪ P . Using the w(b) estimate, we
may also provide estimates on aggregates over the attribute,                      6.2    Bitmap Indexing Methods
based on bin-wise linear approximation.                                             BoxClip represents a naive algorithm using 32 equi-depth
  There is a simple modification of the algorithm for mem-                        binned indices, interval encoding and WAH compression.
bership queries. (See Section 3.1 for details about member-                       The result bitmask from the attribute query is transformed
ship queries). On top of two sided range indices PN,S and                         to a set of “line” hyperrectangles (size of the hyperrectan-
CN,S for attribute queries, we keep equality indices and it-                      gle in all but one dimensions is 1), which are filtered from
erate through the attribute constraint. For dimension mem-                        the dimension query, then merged into a set of result hy-
bership queries, we precompute an index for all dimension                         perrectangles. All the steps except filtering are built on top
values (within a single chunk), as opposed to buckets corre-                      for Fastbit’s mesh query. The filtering is implemented using
                                              0        0
sponding to child nodes, that are used in PS,d   and CS,d .                       recursive sweeping line algorithm.
crease is due to the results retrieval. ArrayBit achieves
                BoxClip                                                 very good results for low or high hit rate queries. This is
           2                                    1,000

                                   space [MB]
                DimsAtts                                                due to a large number of complete matches, and due to fast
time [s]

                ArrayBit                                                pruning of search space. For medium hit rate queries, the
           1                                     500                    algorithm has relatively high number of candidate nodes to
                                                                        explore, but still manages to prune the search space faster.

           0                                       0                    6.4   Parameterization
                                                                           We also experimented with different setups of our hierar-
               8MB    128MB 1GB                         8MB 128MB 1GB   chical index. The major objectives remain the same: query
                     array size                            array size   execution time and space requirements of the index.
                                                                           First, the partition size determines the ratio of partition
    Figure 5: Query execution time and disk space required to
                                                                        index vs hierarchical index. We set this in equilibrium with
    store the indices for different array sizes.
                                                                        number of index bins, which increases the precision of the
                                                                        binning and results in higher probability of pruning the
                                                                        search space earlier.
       DimsAtts uses indexed uint auxiliary attributes made                Another important parameter is a fanout of nodes. If we
    from dimensions (see Section 4). The dimension query is             use a smaller fanout (the smallest possible is 2D ), we may
    preprocessed into attributes, then run as a multi constraint        not fill a single memory word with the index, significantly
    query in Fastbit. The configuration is the same as in Box-          impair bit parallelism, furthermore the index size will be
    Clip, using 32 binned indices, range encoding and WAH               larger due to much deeper indexing tree. If the fanout is too
    compression on all attributes.                                      high, we will not prune infeasible candidates fast enough.
       ArrayBit represents our hierarchical multidimensional            We got optimal results with a fanout close to a multiple of
    index. We use 16 equi-depth binned indices, range encod-            the word size, such as 82 = 64 for 2D arrays, 43 = 64 for
    ing and WAH compression to index the partitions, and 16             3D, 44 = 256 for 4D, 35 = 243 for 5D, etc.
    approximately equi-depth binned indices (described in Sec-
    tion 4.4) with two sided range encoding and no compression          7.    CONCLUSIONS AND FUTURE WORK
    for the hierarchical index. Note that compared to BoxClip              Most of the work on bitmap indexing to date focus on
    and DimsAtts, we only use half of the bins in the parti-            improving the space efficiency and speed, while a few applied
    tion index. It is sufficient in our algorithm, because the bin      the bitmap indices to multidimensional data. However, the
    boundaries are adapted to the actual data in each partition,        linear form of bitmap indices was never adapted to support
    and because we need to store the bin boundaries within the          multidimensional array data.
    partitions.                                                            We have proposed a bitmap indexing method that is de-
                                                                        signed for multidimensional arrays and focuses on overcom-
    6.3         Range Queries                                           ing the dimensionality issue. The hierarchical nature of the
       In our work, we focus on mixed attribute and dimension           proposed method allows for continuous results and estimates
    queries. Regardless of the dataset, we categorize the queries       to be output as intermediate results. Our approach effec-
    based on the overall ratio of the size of the query result to       tively prunes the search space, uses data adaptive, approx-
    the size of the total array size.                                   imate equi-depth binning. Furthermore, the index supports
       Figure 5 shows the time required to return all results. The      partitioned array data and allows distributed storage.
    index file is preloaded into memory prior to the test for all          Our experimental results show that the proposed bitmap
    the systems used. We used 2D array for this experiment.             indexing method outperforms standard linearized approaches
    and a query with ≈ 10% hit ratio. Both BoxClip and Dim-             for mixed attribute and dimension range query processing.
    sAtts run slower than ArrayBit. In case of BoxClip, the                There is a possible caveat that more complex multi-level
    reason is that all the attribute query results had to be pro-       and multi-component indices exist. None of these indices
    cessed, while for DimsAtts the reason is that the attribute         overcome the problem of dimensionality, rather due to their
    made from second dimension didn’t effectively compress. In          effectiveness delay the threshold where the drawbacks be-
    terms of space requirements, all of the algorithms save at-         came noticeable (in terms of number of dimensions and size
    tribute index. ArrayBit uses less bins in the leaves, but           of the array).
    stores bin boundaries for all leaves and internal nodes, plus          Future work includes adapting the tree structure based on
    bitmaps for internal nodes, effectively taking up the same          dimensions, such as adaptive mesh refinement widely used
    space as BoxClip. On the other hand, DimsAtts stores                in physical simulations [4]. Another interesting possibility
    indices for all dimension attributes. Row major ordering is         is multi-attribute index in a single hierarchical structure.
    used in this measurement.                                           Last, we want to use better approximation algorithms to
       Figure 6 demonstrates the dependency of the query pro-           determine feasible regions from finer attribute bins.
    cessing time on a hit ratio of the query, i.e., the ratio of
    selected cells vs total cells in the array. BoxClip algorithm       8.    ACKNOWLEDGEMENT
    does not prune the search space based on the dimensions,
    resulting in number of hits dependent on the attribute only.          This research was supported in part by AcRF Grant RG-
    Filtering these is is time intensive. DimsAtts depends lin-         18/14.
    early on the total number of dimensions. This is because
    there is an additional attribute for each dimension. There
    is also a small dependency on the hit ratio, where the in-
4                                              8                                                                 BoxClip
                                                                                      10                                     DimsAtts
                                                           6                                                                 ArrayBit
 time [s]

            2                                              4
                                                                                       5
                                                           2

            0                                              0                          0
                 0    20     40      60      80     100      0         50        100     0         50        100
                     2D – result/array ratio [%]           3D – result/array ratio [%] 4D – result/array ratio [%]

Figure 6: Query execution time for 2D, 3D and 4D queries of various hit ratios. Queries contained an attribute constraint
and all dimension constraints, each constraint with approximately the same domain reduction.

9.              REFERENCES                                              [19] H. Samet. The quadtree and related hierarchical data
 [1] G. Antoshenkov. Byte-aligned bitmap compression. In Data                structures. ACM Computing Surveys (CSUR), 16(2):187–260,
     Compression Conference, 1995. DCC’95. Proceedings, page                 1984.
     476. IEEE, 1995.                                                   [20] H. Samet. Applications of spatial data structures. 1990.
 [2] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and                  [21] H. Samet. Foundations of multidimensional and metric data
     N. Widmann. The multidimensional database system rasdaman.              structures. Morgan Kaufmann, 2006.
     In Acm Sigmod Record, volume 27, pages 575–577. ACM, 1998.
                                                                        [22] R. R. Sinha, S. Mitra, and M. Winslett. Bitmap indexes for
 [3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The            large scientific data sets: A case study. In Proceedings 20th
     R*-tree: an efficient and robust access method for points and           IEEE International Parallel & Distributed Processing
     rectangles, volume 19. ACM, 1990.                                       Symposium, pages 10–pp. IEEE, 2006.
 [4] M. J. Berger and P. Colella. Local adaptive mesh refinement for    [23] R. R. Sinha and M. Winslett. Multi-resolution bitmap indexes
     shock hydrodynamics. Journal of computational Physics,                  for scientific data. ACM Transactions on Database Systems
     82(1):64–84, 1989.                                                      (TODS), 32(3):16, 2007.
 [5] C. Chan and Y. Ioannidis. An efficient bitmap encoding scheme      [24] T. L. L. Siqueira, C. D. de Aguiar Ciferri, V. C. Times, and
     for selection queries. ACM SIGMOD Record, 1999.                         R. R. Ciferri. The sb-index and the hsb-index: efficient indices
 [6] C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and                 for spatial data warehouses. Geoinformatica, 16(1):165–205,
     evaluation. In ACM SIGMOD Record, volume 27, pages                      2012.
     355–366. ACM, 1998.                                                [25] K. Stockinger. Bitmap indices for speeding up high-dimensional
 [7] J. Chmiel, T. Morzy, and R. Wrembel. Time-HOBI: indexing                data analysis. In Database and Expert Systems Applications,
     dimension hierarchies by means of hierarchically organized              pages 881–890. Springer, 2002.
     bitmaps. In Proceedings of the ACM 13th international              [26] K. Stockinger and K. Wu. Bitmap indices for data warehouses.
     workshop on Data warehousing and OLAP - DOLAP ’10,                      Data Warehouses and OLAP: Concepts, Architectures and
     page 69, New York, New York, USA, oct 2010. ACM Press.                  Solutions, page 57, 2006.
 [8] J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. Bethel,        [27] M. Stonebraker, P. Brown, D. Zhang, and J. Becla. SciDB: A
     A. Shoshani, O. Rübel, R. D. Ryne, et al. Parallel index and           database management system for applications with complex
     query for large scale data analysis. In Proceedings of 2011             analytics. Computing in Science and Engineering,
     International Conference for High Performance Computing,                15(3):54–62, 2013.
     Networking, Storage and Analysis, page 30. ACM, 2011.
                                                                        [28] Y. Su, Y. Wang, and G. Agrawal. In-situ bitmaps generation
 [9] L. Gosink, J. Shalf, K. Stockinger, K. Wu, and W. Bethel.               and efficient data analysis based on bitmaps. In Proceedings of
     Hdf5-fastquery: Accelerating complex queries on hdf datasets            the 24th International Symposium on High-Performance
     using fast bitmap indices. In Scientific and Statistical                Parallel and Distributed Computing, pages 61–72. ACM, 2015.
     Database Management, 2006. 18th International Conference
                                                                        [29] Y. Wang, Y. Su, and G. Agrawal. A novel approach for
     on, pages 149–158. IEEE, 2006.
                                                                             approximate aggregations over arrays. In Proceedings of the
[10] A. Guttman. R-trees: a dynamic index structure for spatial              27th International Conference on Scientific and Statistical
     searching, volume 14. ACM, 1984.                                        Database Management, page 4. ACM, 2015.
[11] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala,           [30] Y. Wang, Y. Su, G. Agrawal, and T. Liu. Scisd: Novel subgroup
     K. C. Sevcik, and T. Suel. Optimal histograms with quality              discovery over scientific datasets using bitmap indices.
     guarantees. In VLDB, volume 98, pages 24–27, 1998.                      Proceedings of Ohio State CSE Technical Report, 2015.
[12] A. Kalinin, U. Cetintemel, and S. Zdonik. Searchlight: enabling    [31] K. Wu, S. Ahern, E. W. Bethel, J. Chen, H. Childs,
     integrated search and exploration over large multidimensional           E. Cormier-Michel, C. Geddes, J. Gu, H. Hagen, B. Hamann,
     data. Proc. of the VLDB Endowment, 8(10):1094–1105, 2015.               et al. Fastbit: interactively searching massive data. In Journal
[13] J. Lawder and P. King. Querying multi-dimensional data                  of Physics: Conference Series, volume 180, page 012053. IOP
     indexed using the Hilbert space-filling curve. ACM Sigmod               Publishing, 2009.
     Record, 2001.                                                      [32] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap
[14] J. K. Lawder and P. J. King. Using space-filling curves for             indices with efficient compression. ACM Transactions on
     multi-dimensional indexing. In Advances in Databases, pages             Database Systems (TODS), 31(1):1–38, 2006.
     20–35. Springer, 2000.                                             [33] K. Wu, A. Shoshani, and K. Stockinger. Analyses of multi-level
[15] T. L. Lopes Siqueira, R. R. Ciferri, V. C. Times, and C. D.             and multi-component compressed bitmap indexes. ACM
     de Aguiar Ciferri. A spatial bitmap-based index for                     Transactions on Database Systems (TODS), 35(1):2, 2010.
     geographical data warehouses. In Proceedings of the 2009           [34] K. Wu, K. Stockinger, and A. Shoshani. Breaking the curse of
     ACM symposium on Applied Computing, pages 1336–1342.                    cardinality on bitmap indexes. In International Conference on
     ACM, 2009.                                                              Scientific and Statistical Database Management, pages
[16] T. Lungu and P. S. Callahan. QuikSCAT science data product              348–365. Springer, 2008.
     user’s manual: Overview and geophysical data products.             [35] K.-L. Wu and P. S. Yu. Range-based bitmap indexing for high
     D-18053-Rev A, version, 3:91, 2006.                                     cardinality attributes with skew. In COMPSAC’98.
[17] P. Nagarkar, K. Candan, and A. Bhat. Compressed spatial                 Proceedings. The Twenty-Second Annual International, pages
     hierarchical bitmap (cSHB) indexes for efficiently processing           61–66. IEEE, 1998.
     spatial range query workloads. Proceedings of the VLDB             [36] G. Zhu, Y. Wang, and G. Agrawal. Scicsm: novel contrast set
     Endowment, 2015.                                                        mining over scientific datasets using bitmap indices. In
[18] P. O’Neil and D. Quass. Improved query performance with                 Proceedings of the 27th International Conference on Scientific
     variant indexes. In ACM Sigmod Record, volume 26, pages                 and Statistical Database Management, page 38. ACM, 2015.
     38–49. ACM, 1997.
You can also read