Hierarchical Bitmap Indexing for Range and Membership Queries on Multidimensional Arrays - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hierarchical Bitmap Indexing for Range and Membership
Queries on Multidimensional Arrays
Luboš Krčál Shen-Shyang Ho Jan Holub
Czech Technical University in Rowan University, Glassboro, Czech Technical University in
Prague, Czech Republic NJ, USA Prague, Czech Republic
lubos.krcal@fit.cvut.cz hos@rowan.edu jan.holub@fit.cvut.cz
arXiv:2108.13735v1 [cs.DB] 31 Aug 2021
ABSTRACT 1. INTRODUCTION
Traditional indexing techniques commonly employed in da- Research in many areas, such as geoscience or model sim-
tabase systems perform poorly on multidimensional array ulations, produces large scientific datasets, which are stored
scientific data. Bitmap indices are widely used in commer- in multidimensional arrays of arbitrary size, dimensional-
cial databases for processing complex queries, due to their ef- ity and cardinality, such as QuikSCAT satellite data [16].
fective use of bit-wise operations and space-efficiency. How- Efficient processing of such data is challenging because of
ever, bitmap indices apply natively to relational or linearized their multidimensional nature. However, most of the analy-
datasets, which is especially notable in binned or compressed sis techniques apply to relational datasets or require a strict
indices. linearization of the data.
We propose a new method for multidimensional array in- To query multidimensional array data, one needs an effec-
dexing that overcomes the dimensionality-induced inefficien- tive system index and subsequently query the data. Major-
cies. The hierarchical indexing method is based on n-dimen- ity of the current systems rely on linearization of the array
sional sparse trees for dimension partitioning, with bound data, i.e., mapping the data into one dimension, enabling
number of individual, adaptively binned indices for attribute many one-dimensional access methods to be used. Others,
partitioning. This indexing performs well on range involv- such as array databases [27, 2], work natively with multidi-
ing both dimensions and attributes, as it prunes the search mensional arrays.
space early, avoids reading entire index data, and does at A popular and very effective method of indexing arbitrary
most a single index traversal. Moreover, the indexing is eas- data is bitmap indexing, which is an index consisting of a set
ily extensible to membership queries. of bitmaps (bitvectors) with associated metadata. Bitmap
The indexing method was implemented on top of a state indices leverage hardware support for fast bit-wise opera-
of the art bitmap indexing library Fastbit. We show that the tions (AND, OR, NOT, XOR), and are very space-efficient,
hierarchical bitmap index outperforms conventional bitmap especially for low-cardinality attributes, although this was
indexing built on auxiliary attribute for each dimension. partially overcome by sophisticated multi-level and multi-
Furthermore, the adaptive binning significantly reduces the component indices. Bitmap indices are used in majority of
amount of bins and therefore memory requirements. commercial relational databases [9, 22, 23, 8].
The major disadvantage of bitmap indices for multidimen-
sional array data indexing is their linear nature. Even with a
variation of run-length compression, of which the most well-
Categories and Subject Descriptors known is WAH, that only partially suppresses the issue.
H.2.8 [Information Systems]: Database Management—Database Our major contribution is a new method of bitmap index-
Applications, Scientific databases ing for multidimensional arrays that overcomes the dimen
-sionality-induced inefficiencies. The method is based on
n-dimensional sparse trees for dimension partitioning, and
General Terms on attribute partitioning using adaptively binned indices.
We demonstrate the performance on range queries involving
Keywords both dimensions and attributes. We also show the effec-
tiveness of our hierarchical indexing method as it prunes
bitmap indexing, multidimensional arrays, range queries,
the search space early, avoids reading entire index data, and
scientific datasets, Fastbit
does at most a single index traversal.
The paper is organized as follows. In Section 2, we briefly
describe previous work on bitmap indexing, scientific ap-
plications and multidimensional arrays. In Section 3, we
describe the preliminaries to our work, including bitmap in-
dexing, array data model and array queries. In Section 4,
we introduce our hierarchical bitmap array index, discuss its
concepts, and explain its construction. In Section 5, we de-
scribe the query evaluation process for mixed attribute and
dimension range queries. In Section 6, we demonstrate theeffectiveness on multiple queries and compare our index to sion and attribute in one of the following formats. A one-
other solutions. In Section 7, we conclude with several notes sided range query: y ≤ 45; two-sided range query: 23.4 ≤
on future research and development directions. y < 73.2, equality query: y = 89; membership query: y ∈
{2, 4, 6, 8, 10}, where y is either dimension or attribute of the
array. Figure 1 shows a query that has a two-sided constraint
2. RELATED WORK on an attribute a and a one-sided constraint on dimension d2
Traditional indexing methods like B-trees and hashing are on a 2-dimensional array and the (shaded) query outcome.
not effectively applicable to index multiple attributes in a Note that equality query is a special case of membership
single index, being replaced by multidimensional indexing query, and that all queries can be rewritten to a set of range
methods, such as R-trees [10], R*-trees [3], KD-trees, n- queries. Mixed queries are queries that pose constrains on
dimensional trees (quadtrees, octrees, etc.) [19, 20]. These at least one dimension and one attribute.
methods are not very effective for high dimension arrays and An example query on array SatelliteArray [latitude, longitude, altitude, time]
indexing algorithms is in [21], though majority of the focus may look like this:
is on traditional spatial data instead of multidimensional SELECT * FROM SatelliteArray WHERE 50.68 ≤ latitude ≤
arrays. 50.88 AND 14.37 ≤ longitude ≤ 14.57 AND 30.0 ≤ snowf all.
The drawbacks of traditional indexing algorithms led to The result would then be a possibly empty subarray of
the introduction of bitmap indices [6] and their applications the same format as SatelliteArray.
for scientific data [25]. Bitmap indices are naturally based
on linear data, ideal for relational databases. Space fill- A[d1,d2] A'[d1,d2]
ing curves, such as Z-order curve and Hilbert curves [14, 3 3 2 ~ ~ SELECT * FROM A 3 3 2 ~ ~
WHERE 2 ≤ a ≤ 4
13] were used for linearization and subsequent querying of 2 4 2 1 5 2 4 2 1 5
d1 AND d2 ≤ 2; d1
multidimensional data. Hilbert curves were used in [13], 1 4 7 3 2 1 4 7 3 2
while Z-order curves were used in [17], which is a system 0 ~ 5 4 1 0 ~ 5 4 1
0 1 2 3 0 1 2 3
for querying spatial data (not arrays) using compressed hi- d2 d2
erarchical bitmap indices. Hierarchically organized bitmap Figure 1: An example of a range query on a two-dimensional
indices were also used for star queries on data with hier- array.
archically organized dimensions [7]. Bitmap indices have
also been used for approximating aggregations [29], contrast
set mining [36], subgroup discovery [30], correlation analysis 3.2 Distributed Arrays
[28]. All of which use bitmap indices on auxiliary attributes
made from dimensions (see Section 3.3). Other works utilize Due to the large size of scientific data, it is often necessary
bitmap indexing for spatial applications, but do not model to split the data into subarrays called chunks.
the data as multidimensional arrays [15, 24, 26]. There are two commonly used strategies. Regularly grid-
The boom of multidimensional, scientific array data gave ded chunking, where all chunks are of equal shape and do
birth to open-source multidimensional array-based data man- not overlap. This array data model is known in SciDB as
agement and analytics systems, namely RasDaMan [2] and MAC (Multidimensional Array Clustering) [27]. This ar-
SciDB [27]. These databases work natively with multidi- ray model works well for coarse dimension-based queries,
mensional arrays, but lack some of the effective query pro- but requires either additional indexes or filtering for fine
cessing methods implemented in other databases. On the dimension-bases and for any attribute-based queries. This
other hand, SciDB has been established as a foundation for array data model is the foundation (the lowest level) of our
many multidimensional array processing tasks. Searchlight hierarchical bitmap array index. The second strategy is ir-
[12] is a SciDB based system for range queries with aggre- regularly gridded chunking, which is one of the chunking
gation constraints, using constraints programming on top of option in RasDaMan [2].
array synopsis – lossy representation of small array chunks.
3.3 Bitmap Indexing
Bitmap indices, originally introduced in [6], were shown
3. PRELIMINARIES to be very effective for read-only or append-only data, we
We first introduce the multidimensional array data model, used in many relational databases and for scientific data
then describe types of commonly used queries on arrays, management [9, 22, 23, 8].
with some examples. Next, we introduce bitmap indexing on Bitmaps can either be created for a single attribute value,
linear data, binning types, encoding types and compression called low-level bitmaps, or for multiple values, called high-
schemes. level bitmaps, where the bitmap is set to 1 for the cell of
the arrays whose indexed value is in the value range of such
3.1 Array Data Model bitmap.
An array A consists of cells with dimensions indexed by The structure of high-level bitmaps is determined by a
d1 , . . . , dn . Each cell is a tuple of several attributes a1 , . . . , am . binning strategy. For high cardinality attributes, binning is
We assume the structure of the attributes is the same for all the essential minimum to keep the size of the index reason-
cells in the array. The array is denoted as A < a1 , . . . , am > able [35, 34]. Binning effectively reduces the overall number
[d1 , . . . , dn ]. For example, satellite data may have latitude, of bitmaps required to index the data, but increases the
longitude, altitude and time as dimensions, and precipita- number of cells that have to be later verified. This is called
tion, temperature, wind speed, etc. as attributes. a candidate check. Two most common binning strategies
We form a query on arrays based on constraints. A di- are equi-width binning, which divides the attribute domain
mension and attribute constraint is a constraint on a dimen- into equal intervals, and equi-depth binning, which dividesthe attribute domain into intervals covering equal (or near 4. HIERARCHICAL BITMAP ARRAY IN-
equal) number of cells. Equi-width binning is highly prone DEX
to excessive candidate checks, especially on skewed data.
We now briefly discuss a common way of indexing multi-
dimensional arrays using additional bitmap indexes for each
d1 d2 a EBM E[1] E[2] E[3] E[4] E[5] E[6] E[7] R[1,1] R[1,2] R[1,3] R[1,4] R[1,5] R[1,6] R[1,7] I[1,4] I[2,5] I[3,6] I[4,7] dimension. Then we describe the structure of our hierarchi-
0
0
0
1
3
2
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
cal bitmap array index.
0 2 ~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Arrays Aha1 , . . . , am i[d1 , . . . , dn ] are usually stored in a
0 3 ~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 4 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 linearized representation, most commonly C-style row-major
1 1 2 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0
1 2 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 array representation. Creating one index Idi =k (d1 , . . . , dn )
1 3 5 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 1
2 0 4 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 for each dimension d, which is set to 1 for cells of array A
2 1 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1
2 2 3 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
where d is equal to a value k. This allows filtering out results
2
3
3
0
2
~
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
0
0
0
based on dimensions using binary AND.
3
3
1
2
5
4
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
Note that the dimensions index Idi =k (d1 , . . . , dn ) does not
3 3 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 necessarily have to use equality encoding, but based on the
expected queries, we may choose a better combination of
Figure 2: Bitmap index for attribute a of the array A from binning, encoding and compression. This approach is used in
Figure 1: empty bitmask EBM, equality encoded index E, [30, 36] with equi-depth binning or in [29] with v-optimized
range encoded index R and interval encoded index I. binning based on v-optimal histograms [11] and C-style row-
major linearization in [28].
Unfortunately, dimension bitmap index is not effectively
compressible. Consider an example of row-major ordering
Another crucial aspect of bitmap indexing is encoding [6].
on 5x5 array. Then the row dimension index for column = 1
which determines how a set of bins, B, of attribute do-
is 01000 01000 01000 01000 01000, which cannot be ef-
main is encoded in each bitmap and consecutively into a
fectively compressed using either BCC or WAH, since the
bitmap index. The simplest encoding, called equality en-
compression context of both is a single bit. This can be
coding, encodes each bin with one bitmap for a total of
partially mitigated by stretching dimensions to multiples of
|B| bitmaps. Processing of equality queries reads a single
bytes or words, and extending the run-length compression to
bitmap, but processing of range queries has to read at most
use byte or word in its compression context, instead of single
half of all the bitmaps. Range encoding uses B − 1 bitmaps,
bits. Another option is to use either Z-order or Hilbert space
each bitmap Ri encodes a range of bins [B1 , Bi ]. The pro-
filling curves to further increase locality of the dimensions.
cessing of range encoded bitmap index for range queries
Neither, however, solves the problem entirely.
reads at most two bitmaps. Interval encoding [5] uses |B| 2
bitmaps, each bitmap Ii is based on range encoded bitmaps 4.1 Partitioning of Arrays
Ri ⊕ Ri+ |B| . Interval encoding uses at most two bitmaps to
2 Non-partitioned data require much finer binning and the
process range queries. Compared to range encoding, it uses domain of the dimension is higher than its partitioned coun-
only half the space. Figure 2 shows an example of equality, terpart, thus higher amount of bins is required. By partition-
range and interval bitmaps for the array in Figure 1. ing the array Aha1 , . . . , am i[d1 , . . . , dn ] into a set of regularly
Bitmap indices, based on the number of bins, may take gridded chunks C in the Multidimensional Array Clustering
up to |B| · C, where C is the cardinality of the indexed fashion described in Section 3.2, such that:
attribute, leading to very small number of bins needed to
exceed the size of the raw data. Binary run-length com- Ci [o1 , o2 , . . . on , e1 , e2 , . . . , en ] =
pression algorithms are usually applied on bitmap indices Aha1 , . . . , am i[o1 ≤ d1 < e1 , . . . , on ≤ dn < en ]
to reduce the overall size. However, another requirement is
posed to these compression algorithms, such that it must be All chunks in our data model are of the same shape, i.e.,
possible to run bit-wise operations effectively on the com- for all chunks Ci , Cj of array A, it holds that
pressed bitmaps. There are two representative compression Ci [ek ] − Ci [ok ] = Cj [ek ] − Cj [ok ]
algorithms, namely Byle-aligned Bitmap Code – BCC [1]
and Word-Aligned Hybrid (WAH) compression [32]. for all dimensions k, and chunks are not overlapping and
In order to facilitate effectively high cardinality attributes completely cover the whole array A. In the chunk notation,
with space efficient indices and fast querying, two compos- ok stands for offset and ek stands for end of the chunk along
ite methods were introduced. The first method is multi- that dimension (exclusive boundary).
component, where the attribute value is decomposed into By chunking the array, we limit the domain of both at-
multiple components, which are then indexed independently. tributes and dimensions in a given partition. In our adaptive
An example of multi-component index is a bit-sliced index binning indices, we use the fact that the domain of the at-
[18], where each component corresponds to a bit of the value. tribute varies based on the location.
Second composite method is called multi-level indexing [23], The first problem arising from the equal size chunking
where the binning of the attribute becomes progressively model is that within a single chunk, we are still required to
more precise with increasing levels. use either indexing or at least aggregate information on the
Thorough performance analysis of bitmap indexing, espe- attributes, such as min and max for precise queries or his-
cially multi-level and multi-component both uncompressed tograms for probabilistic queries, or data exploration. We
and compressed is presented in [33]. An open-source bitmap choose to use bitmap indexing on both attributes and dimen-
indexing framework Fastbit [31] implements most of cur- sions within the chunk. Note that the dimension indices are
rently existing indexing schemes, mainly two-level indices. the same for all chunks in the array, since for each chunk,we can simply subtract its offset from the dimensions query The overall internal node fanout F can be expressed in
constraints. terms of a fanout Fdk for a single dimension k as
The second problem lies in the overall structure of the Yn n
chunks. There is no direct, high level index of the attributes F = Fdk ≤ max Fdk
for the chunks. It is necessary to scan through the synop- 1≤k≤n
k=1
sis of all the individual chunks, or generate a hierarchical
Assuming that the dimension fanout Fdk is the same for
synopsis. The latter has been utilized in [12] in a form of a
all dimensions, we can get
graph generated over merging sub-arrays. j 1k
We propose a unified solution that solves both the problem Fd k = F n
with dimension attributes and with synopsis of array chunks.
Our solution is in a form of hierarchical bitmap index on top As we will see in Section 5.2, in order to facilitate efficient
of a n-dimensional tree (such as octree for 3 dimensions) with dimension range queries, the size of F cannot be too large,
variable binning for each node in the tree. since the size of precomputed dimension clipping bitmaps
depends on F .
The index tree construction works in a bottom-up fashion,
4.2 Structure of the Array Chunk Index where the leaf nodes are indexed at first. This allows both
The index is done separately for each attribute of the array data appending and modification (see Section 4.7). Each
A. Let’s fix an attribute α. All the following functions refer internal node is constructed from at most F direct children
to this attribute. and with at most BINS attribute bins, with one additional
Each chunk C(o1 , o2 , . . . , on ) of array Aha1 , . . . , am i index for empty bitmask. Each child node Ni of internal
[d1 , . . . , dn ] is associated with exactly one leaf node N provides its attribute’s min(Ni ) and max(Ni ) val-
N` (o1 , o2 , . . . , on ). Independently, each leaf uses an equi- ues. These values are used for the construction of the bitmap
depth binning index with a total of at most BINS bins, where index of N .
bin boundaries bins(N` of the index are based on an exact Let B = (min(N1 ), max(N1 )), . . . , (min(NF ), max(NF ))
chunk values histogram. Note that this assumes uniform be the set of all intervals ranging from the minimum to the
distribution of queries. If we had any prior knowledge of maximum value of the indexed attribute α among all the
the queries based on the attribute, we would instead opt child nodes Ni . The set B is the set of bins – the individual
for weighted histogram to construct the binning. The leaf’s interval boundaries are delimiters, where the attribute’s α
dimension boundaries correspond to its associated chunk’s value a is in the attribute domain of different child nodes.
boundaries, clipped by the global shape of the array A. Formally, let nodesin(a) ⊂ Ni be a function of a value a ∈ α
Accounting for empty values (missing cells in A) is done of attribute α, which returns a subset of child nodes.
using a special bitmask, known as empty bitmask, for a total
of BINS + 1 indices. Only leaves with at least E · BINS non- Ni ∈ nodesin(a) ⇐⇒ min(Ni ) ≤ a ≤ max(Ni )
empty cells are indexed, where the constant E is dependent The set nodesin(a) is used to construct the binning for
on the data structure used for the leaf representation, i.e., index of this internal node. We describe the encoding of
do not use bitmap indexing if listing the values is more space this bitmap index in Section 4.5.
efficient. The index bins are aligned with the bins from B. This
Encoding of the leaf indices is left as a parameter to the guarantees that no two indices for different bins will be iden-
user, as the bitmap indexing performance heavily depends tical, i.e., represent the same set of children. It also directly
on the cardinality of the array attribute, desired number of implies that adding more boundaries to B would be point-
bins, and query types. For generality, we assume high car- less.
dinality attributes, such as integers and doubles and small
number of bins such as BINS ≤ 16. 4.4 Bin Boundaries Merging in Parent Nodes
Except for very narrow dimension range queries, a dimen- The number of bins from all F child nodes is higher than
sion query will either cover the whole span of a leaf node, or BINS for majority of the internal nodes N , therefore it is
result in a one-sided dimension range query once the query necessary to reduce the size of the set of bins, B. There are
processing reaches a single chunk. Thus, the ideal encod- several strategies to choose B ⊂ D such that |B| = BINS.
ings for chunks are range and interval encodings [5]. Our An example of such binning reduction is in Figure 3.
default encoding is interval encoding since it uses half the The first strategy is to use an equi-width distribution of
memory range encoding does. Encoding of inner nodes is the bins. This is the ideal choice assuming the attribute
more complicated and we describe it in Section 4.5. part of the query is uniformly distributed or when there is
no prior knowledge about the attribute query and assuming
4.3 Structure and Construction of the Hierar- the data distribution is not skewed.
chical Bitmap Array Index The second strategy is to use equi-depth binning. This
is ideal if the attribute distribution of the child nodes is
To deal with the higher level index, we create a special
skewed. It is possible to maintain the weights of the bins
composite index on tree similar to n-dimensional tree. Each
for leaf nodes, since those have direct access to the data.
internal node of the index has at most F children, where F
However, internal nodes can only make estimates about the
is called a fanout. Note that, unlike in quadtrees, octrees or
weight of merged bins. In each internal node and leaf, we
n-dimensional trees, F is not necessarily 2n , where n is the
store the weight estimate w(b), where b ∈ B. The weighted
number of dimensions. Our bitmap indices are based on the
square error of a bin b is
fanout and we want to utilize binary operations as much as
2
possible. For this reason, the fanout F should be a multiple w(D)
wse(b) = w(b) −
of the processor word size W , or as close to it as possible. BINSN1 N1 R+(-∞,1)= 0000 Input: set of bins B, set of weights w(b), b ∈ B,
N2 N2 R+[1,3) = 0101
R+[3,+∞)= 1111
number of output bins BINS
N3 N3
Result: approx equi-depth bins R ⊂ B, |R| =BINS
N4 N4 R-(-∞,6]= 1111
B
R-(6,8] = 0011
R-(8,+∞)= 0000
1 R ← eq-width bins from B, |B| =BINS ;
R R 2 BS ← all possible split bins of R;
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Bitmap index of
nodes that have 3 BM ← all possible merged bins of R;
R -- Approximate bins for attribute index False positive attribute ranges started / ended
4 QSP LIT ← priority queue();
5 QM ERGE ← priority queue();
Figure 3: Example of merging |B| = 8 bin boundaries to 6 for s ∈ BS do // bins to split
|R| = 4 bin boundaries for 4 child nodes. False positive 7 add (s, ∆wse(s)) to QSP LIT ;
ranges are marked in red. Two sided range encoded bitmaps 8 end
are generated for R. 0
9 for (m, m ) ∈ BM do // bins to merge
10 add ((m, m0 ), ∆wse((m, m0 )) to QM ERGE ;
11 end
and the weighted sum square error is // split that decreases wsse the most
X 12 (s, ∆wse(s) ← min(QSP LIT );
wsse(B) = wse(b) // merge that increases wsse the least
b∈B 0 0
13 ((m, m ), ∆wse((m, m ))) ← min(QM ERGE );
. 0
14 while ∆wse((m, m )) > ∆wse(b) do
To estimate the weight of merged bin r ∈ R ⊂ B, we 15 split b;
assume uniform distribution of values over the intervals of 16 merge (b, b0 );
bins b ∈ B. Then the estimated weight of r is 17 update R, BS , BM , QM ERGE , QSP LIT ;
18 end
X
w(r) = w(b) · sizeof(b ∩ r)
b∈B Algorithm 1: Iterative equi-depth binning approximation
where sizeof(b ∩ r) is the size of the intersection of r and b.
We cannot use the trivial algorithm for equi-depth bin-
ning, because we can only iterate by bins of variable weight, added, and we add r0 to a set R+ . Else, if nodesin(r0 ) ⊂
instead of iterating by single data points. This is why we nodesin(r), then nodes are removed in set nodesin(r0 ), and
need to approximate the equi-depth using a simple iterative we add r0 to set R− . Otherwise, some nodes are added and
algorithm. Details on selecting R ⊂ B approximately equi- some are removed and we add r0 to both R+ and R− . In
depth bins are shown in Algorithm 1. We first start with our example in Figure 3, R+ = {[1, 3), [3, 6)} and R− =
equi-width binning (line 1). Then, we generate sets of all {(3, 6], (6, 8]}.
possible bin splits and merges (lines 2-3), setup two priority There is no guarantee that |R+ | = |R− |. If we wanted,
queues and evaluate all possible splits and merges in terms we could run Algorithm 1 separately on boundaries B+ and
of weighted sum square error (lines 4-11). After that, we B− (likewise defined) and with BINS
2
bins, but then we’d lose
perform one valid split and one merge on the binning as the equi-width approximation.
long as this leads to an improvement of the overall binning Now, we encode |R+ | + 1 bitmaps using range encoding,
(lines 14-18). This preserves the total number of bins. so that the index for bin r+ ∈ R+ corresponds to children,
In case a node has either a low cardinality attribute throu- whose attribute range minimum min(Ni ) is ≤ to the up-
ghout all the child nodes, we create bins mapped to single per boundary of interval r+ . In our example, bitmap cor-
values of the attribute and their corresponding bitmaps. responding to r = [1, 3) ∈ R+ is 0101, indicating that N1
Note that v-optimal binning does not work in our case, and N3 have started in or before this interval. Similarly,
since we don’t have the individual data values available dur- we encode |R− | + 1 bitmaps for values r− using inverse
ing construction of the internal nodes, although we could range encoding, i.e., children, whose attribute range max-
approximate this using uniformly or normally distributed imum max(Ni ) is > to r− are encoded by 0 in the bitmap,
estimates within the bins of child nodes, or by propagating representing children that have already ended before or in
at least basic data synopsis. the interval r− .
These two bitmaps easily allow evaluation of partial and
4.5 Double Range Encoding of Bitmap Indices complete matches (see Section 5.1) using only two bitmap
in Internal Nodes reads and one logical operation for both partial and complete
Unlike in bitmap indexing in leaves where one encodes query.
positions of individual values, we encode sets of child nodes
nodesin(a) for attribute values a in the internal nodes. Our 4.6 Locality of the Hierarchical Index
binning B has the property that for all attribute values In order to preserve locality of the data during queries, we
ab , a00b ∈ b ∈ B it holds that nodesin(ab ) = nodesin(a0b ). store the whole index in a locality preserving linearization of
Note that this does not hold for intervals r ∈ R (See Figure an n-dimensional tree. For each query, blocks of the index
3 for an example). are loaded sequentially and sparsely, based on the parame-
We will now describe an effective bitmap encoding of ters in the query. Thus, only one traversal, possibly incom-
nodesin(a), a ∈ r ∈ R. Let’s have two adjacent intervals plete, of the index data is needed. The index data consist
r ∈ R and r0 ∈ R, such that rh = r`0 Note that since of bin boundaries, weight estimates and bitmap indices.
R ⊂ B, we have nodesin(r) 6= nodesin(r0 ). If nodesin(r0 ) ⊃ We use space filling curves, namely the Z-order curve to
nodesin(r), then r0 corresponds to a bin, where nodes are linearize the multidimensional array index. We choose notto use recursive multi-level Z-order curves, as this would dimensions, in which case we fill all q with remaining di-
force the query processing to be based on pre-order traver- mensions, to a complete query. Dimensions, that were not
sal of the index tree. We also choose not to use row major specified, are filled with (dj , min(dj ), max(dj )) triples. One-
ordering, since it has poor locality and it would slow down sided range constraints are also extended in similar manner.
retrieving locations child nodes and partitions. Hilbert curve The core of the query algorithm is a breadth-first descent
has perfect locality, but it does not preserve dimensions or- through the index tree. At each level, the search space is
dering. This means we would need to precompute bitmaps pruned according to both dimension and attribute values.
for dimension constraints for each block of Hilbert curve Let N be the currently searched node, Ni be its child
separately. Z-order curve allows for fast child and parent nodes, where 0 ≤ i < F ; multidimensional range DN be the
node index computations, preserves dimensionality between set of dimension boundaries in the format [DN [d]` , DN [d]h ],
different level and has a good locality. where d is dimension, ` designates lower bound, h upper
The order Z` of the Z-order curve of level ` is determined bound, associated with node N .
by the maximal fanout Fmax = max1≤k≤n Fdk , where Fdk Throughout the query processing, we maintain a queue of
is a fanout of dimension k. partially matched nodes P and a set of completely matched
nodes C. We start at a root node Nr , setting P = {Nr },
Z` = ` · dlog2 Fmax e assuming that both: node N ’s boundaries and query dimen-
Assuming Fdk is the same for all dimensions, the order of sions are not disjoint: DN ∩ QD 6= ∅ and
Z-order curve is then (min(N ), max(N )) ∩ QA 6= ∅, otherwise node N ∈ / P and
l j 1 km N∈ / C.
Z` = ` · log2 F n Let p, p0 , p∗ and c, c0 , c∗ be zero bitmaps of size F ;
the bitmaps p indicates partial attribute matches among the
and such a Z-order curve has length of (Z` )n . children of node N , p0 indicated partial dimensions matches,
Several of the higher levels are stored in a dense vector, as p∗ indicates partial matches, similarly the vectors c, c0 , c∗
specified by a user parameter. These vectors are expected indicate complete matches. We will now set these vectors
to be densely filled. The remaining levels are stored as non- according to the query Q for the first node in queue P . The
overlapping intervals on a Z-order dimension (1D) in con- partial and complete matches bitmap computation is also
tinuous blocks, indexed by a binary search tree. This is a described in Algorithm 2 and in Figure 4.
compromise between sparse single node map and full vec-
tor used for higher levels. Note that the blocks may not be Input: query q = {(a` , ah ), (d1 , d` , dh ), . . .} with DIMS
sequential in memory, but at most a single transition is guar- dimension constraints; node N ; node children
anteed, i.e., no blocks are read twice during the processing N1 , . . . , NF ; boundaries [DN [d]` , DN [n]h ] for N and
of a single query. all Ni and dimensions d;
Result: partial matches p∗; complete matches c∗;
4.7 Appending and Modifying Data
1 PN,S , CN,S ← load index for node N ;
Scientific data is often considered either fixed or append 2 PS,d0 0
, CS,d ; // precomputed;
only, our indexing approach allows for both appending and F 0 F
3 p ← {0} , p ← {0} , p∗;
data modification, although the latter is not convenient. F 0 F
4 c ← {1} , c ← {1} , c∗;
To append data along any dimension, we apply the same
5 if ah < min(N ) or a` > max(N ) then
bottom-up procedure to update the index. It is necessary to
update the dimension bounds of internal nodes (that were 6 return p∗ ← {0}F , c∗ ← {0}F
possibly previously clipped by the global shape of the array) 7 c = c & CN,S (a` , ah );
and bitmap indices (to include the new child nodes). Note 8 p = p | PN,S (a` , ah ) & ∼c;
that we do not have to update the weight estimates and bin 9 for dimensions d, 1 ≤ d ≤ DIMS do
boundaries (except min and max) in order to assure index 10 if dh < DNi [d]` or a` > DNi [d]h then
correctness. However, in order to assure the equi-depth op- 11 return p∗ ← {0}F , c∗ ← {0}F
timal binning, we need to run the bin merge algorithm again 12 if d` > DN [d]` then
on affected nodes. 13 p0 = p0 | PS,d 0
(d` );
14 if dh < DN [d]h then
15 p0 = p0 | PS,d 0
(dh );
5. QUERYING DIMENSIONS AND 16 c0 = c0 & CS,d 0
(d` , dh );
ATTRIBUTES 17 end
0 0 0
In this work, we focus on selection queries over dimensions 18 p ← p & c ;
0 0 0
and attributes of an array. Such query consists of a set 19 c ← c & ∼p ;
0
of dimension constraints and attribute constraints. Let’s 20 c∗ ← c & c ;
0 0
specify a query q over an array Aha1 , . . . , am i[d1 , . . . , dn ] as 21 p∗ ← (p | c) & (p | c ) & ∼c∗;
a set of ranges over dimensions qD and attributes qA . 22 return p∗, c∗
Algorithm 2: Evaluation of partial and complete match
q = qA ∪ qD = {(a, a` , ah )} ∪ {(dj , j` , jh ), . . .} bitmaps for a single node.
where (a, a` , ah ) is a triple specifying attribute constraint:
attribute, its lower bound and its (exclusive) upper bound;
same goes for dimensions. In this work, we focus on a single 5.1 Attribute based Matches
attribute query. Therefore, we simplify qA to (a` , ah ). It In this subsection, we explain how attribute bitmask is set.
is possible for a query to not specify constraints for some This subsection further describes lines 5–8 in Algorithm 2.If ah < min(N ), or a` > max(N ), there are neither par- The second expression is similar, but for dh . Third and forth
tial nor complete attribute matches and we terminate pro- expression combine the partial matches over both query lim-
cessing the current node. its and all dimensions. Note that this results in excessive
Let PN,S (a` , ah ) be a partial attribute match bitmasks partial candidates since all child nodes that intersect the
specific to node N of for an array of shape S, with bits set query constraints along at least one dimension qualify as
to one corresponding to children Ni so that the intersection partial candidates.
[a` , ah ] ∩ [min(Ni ), max(Ni )] 6= ∅. Partial dimension matches are evaluated using one pre-
computed bitmap index corresponding to
PN,S (a` , ah )[i] = 1 ⇐⇒ PB|N,S (ah )[i] ∧ ¬PE|N,S (a` )[i]
0
PB|N,S (a)[i] = 1 ⇐⇒ min(Ni ) ≤ a PS,d (b)[i] = 1 ⇐⇒ b = DNi [d]
PE|N,S (a)[i] = 1 ⇐⇒ max(Ni ) ≥ a where b is a bucket corresponding to the chunking of the ar-
The second expression describes bitmap set to 1 for children ray A. There are a total of Fd such buckets along dimension
that have started before or at value a, the third one describes d, resulting in a total of Fd · d bitmaps of size F . We query
children that have ended at or after a. The first expression these bitmaps for all dimensions and combine them using
then combines both. OR into p0
To evaluate PN,S (a` , ah ), we first use binary search on There is a special case of false negative dimension result.
R+ and R− to find two bins L ∈ R+ and H ∈ R− such that If d` or dh is equal to the d’th dimension range border of
a` ∈ L and ah ∈ H. These bins L and H mark the attribute a child node Ni , and at the same time the other end of
boundary bins. Then, PB|N,S (ah ) is identical to R+ [H] and d` or dh causes the dimension to be fully covered in Ni ,
¬PE|N,S (a) is identical to R− [L], where R+ and R− are the i.e. d` = DNi [d]` and dh ≥ DNi [d]h or dh = DNi [d]h and
bitmap indices described in Section 4.3, each queried for a d` ≤ DNi [d]` , the query is evaluated as partial match for Ni
single bin. Then we add PN,S (a` , ah ) to p using bitwise OR. and dimension d, while in fact dimension d contributes to
Now, we process complete candidates in a similar fashion. complete matches. A check for this scenario requires com-
Let CN,S (a` , ah ) be a complete attribute match bitmask spe- paring the dimension ranges of child nodes to the query
cific to node N for array of shape S, so that the intersection range, and was ignored on purpose, as it complicates and
[a` , ah ] ∩ [min(Ni ), max(Ni )] = [a` , ah ]. slows down the query process.
For complete candidates, we will slightly modify the defi-
CN,S (a` , ah )[i] = 1 ⇐⇒ PB|N,S (a` )[i] ∧ ¬PE|N,S (ah )[i] 0
nition of C used for attributes. Let CS,d (d` , dh ) be a complete
This expression is very similar to PN,S (a` , ah ), describing dimension match for array of shape S, indicating which child
children that have started at or before a` and have not ended nodes Ni are partially or fully covered by interval [d` , dh ].
at or before ah . To evaluate CN,S (a` , ah ), we query R+ [L] Despite the semantics indicating partially matches should
and R− [H]. Then, we add the result to c using bitwise OR not be included, we later trim the complete dimension match
and remove those from p, i.e., p = p ∧ ¬c. bitmap accordingly.
Note that both partial and complete attribute candidates 0
CS,d (d` , dh )[i] = 1 ⇐⇒ [d` , dh ] ∩ DNi [n] 6= ∅
use a total of 4 index queries. An example of attribute query \
0 0
is displayed in the bottom row in Figure 4. CS (d` , dh )[i] = CS,d [i]
1≤n≤DIMS
5.2 Dimension based Matches
Complete dimension matches are evaluated using two pre-
Next, we explain how the dimension masks are set. This
computed bitmap indices corresponding to
subsection further describes lines 9–17 in Algorithm 2.
If for any dimension d it holds that dh < DNi [d]` or a` > 0
CB|S,d (b)[i] = 1 ⇐⇒ b ≤ DNi [d]
DNi [d]h , there are neither partial nor complete dimension 0
matches and we terminate processing the current node. CE|S,d (b)[i] = 1 ⇐⇒ b ≥ DNi [d]
Unlike attribute query, the evaluation of dimension query similarly to bitmaps used for partial matches. There is a
is the same for all nodes N , so all the bitmaps for processing total of 2 · Fd · d bitmaps of size F for complete matches. We
dimensions queries are precomputed. query these bitmaps for all dimensions and combine them
0
Let PS,d (d` , dh ) be a partial dimension match, where d is a using AND into c0 .
dimension in the query constraint (d, d` , dh ), for an array of We now combine the partial dimension matches bitmap c0
shape S, indicating child nodes Ni such that the intersection with p0 , such that p0 = p0 ∧ c0 . Then, we clip the complete
[DNi [d]` , DNi [d]h , ] ∩ [d` , dh ] 6= ∅. dimension bitmap by the partial bitmap as c0 = c0 ∧ ¬p0 .
Let’s fix a dimension d for which we evaluate partial mat- During the evaluation of dimension matches, we used a total
0
c0hes PS,d (d` , dh ): of 3 · d index queries. An example of dimension query is
0
PS,d (d` )[i] = 1 ⇐⇒ d` ∈ DNi [d] ∧ d` 6= DNi [d]` displayed in the top row in Figure 4.
0
PS,d (dh )[i] = 1 ⇐⇒ dh ∈ DNi [d] ∧ dh 6= DNi [d]h 5.3 Partial and Complete Matches
0 0 0
PS,d (d` , dh )[i] = 1 ⇐⇒ PS,d (d` )[i] ∨ PS,d (dh )[i] Now that we have both attribute and dimension, and both
0
[ 0 partial and complete candidates, we may proceed to merging
PS (d` , dh )[i] = PS,d [i]
the candidates and generating a bitmap representing the set
1≤d≤DIMS ∗
of result node children CN,S and a bitmap representing the
∗
The first expression describes which children Ni have di- set of potential node children PN,S that will be recursively
mension d range such that the query limit d` falls inside the explored. This subsection further describes lines 18–22 in
range, but it is not equal to the lower limit of that range. Algorithm 2.∗
The CN,S bitmap is easier to obtain, as it is the intersec- 6. EXPERIMENTAL EVALUATION
tion of both complete bitmaps without partial candidates We have tested our implementation against several other
bitmaps. solutions, of which none is specifically tailored to mixed at-
∗
CN,S = CN,S ∧ CS0 tribute and dimensions range queries, but those are the only
readily available solutions involving bitmap indices and be-
∗
We obtain the set of partial candidates PN,S by joining ing capable of executing range queries.
the dimension-based partial candidates with the attribute- We measured the time and space efficiencies for each in-
based candidates and clipping both by complete candidates dividual query, i.e. total query execution time, and space
∗
requirements for the index. Timing was measured as an
PN,S = (PN,S ∨ CN,S ) ∧ (PS0 ∨ CS0 ) ∧ ¬CN,S
∗
average of 3 runs with data preloaded into memory. For
Fastbit queries, we use their internal wall time measuring
We then iterate through the results, adding child nodes
∗ systems, meaning certain pre and post processing steps are
from CN,S to the result set C and the partial candidates
∗ not included in the time measurements, such as query string
PN,S into the queue P to be processed subsequently. This
parsing. Space requirements were measured based on the
process is done on top of Z-order indices, as it is trivial
disk space required to store the bitmap index together with
to generate Z-order indices corresponding to nodes in the
all relevant metadata.
lower levels. The Z-order ordering of the inner nodes and
The experiments were run on a single physical machine
breadth-first traversal also ensures single traversal through
– Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz, 16 GB
the index.
RAM, 1TB 7.2K RPM SATA 6Gbps; running Ubuntu 14.04.1
SELECT * FROM A WHERE 2 ≤ a ≤ 4 AND 1.3 ≤ d1 AND d2 ≤ 2.5; (3.19.0-32 kernel).
We use a synthetic dataset to test our queries on – ran-
C' = C' AND NOT P'
3-5 2-3 ~ ~ 3-5 2-3 ~ ~ 3-5 2-3 ~ ~ domly generated multidimensional sum gaussian distribution
4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 P' = P' AND C' SumGauss. Its only attribute aG is a sum of G randomly
4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5
initialized Gaussian distribution in D dimensions:
(P OR C) AND
(P' OR C') AND G !
~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1
(d − µi )T Σ−1
NOT (C AND C') 1 i (d − µi )
X
~ =
aG (d) exp −
Partial dimension Complete dimension p
matches - P' matches - C'
C AND C'
i=1 (2π)D |Σi | 2
3-5 2-3 ~ ~ 3-5 2-3 ~ ~ 3-5 2-3 ~ ~ 3-5 2-3 ~ ~ where µi and Σi are randomly generated distribution mean
4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 4-5 2-4 5-7 1-3 vector and a bounded symmetric positive definite covariance
4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5 4-6 7-8 2-2 3-5
matrix for dimension i. For sparse arrays, a threshold for the
Gaussian functions is used. Attribute is treated as empty if
~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1 ~ 5-6 4-6 1-1
the value is below this threshold. Only partitions with at
Partial attribute Complete attribute Node query
matches - P matches - C
C P = P AND NOT C
output
least one non empty value are generated.
Figure 4: Processing of a query in a single node of the hi-
6.1 Fastbit Integration
erarchical index. Top row represents dimension constraints, Fastbit [31] is an open source library that implements
bottom row represents attribute constraints. Bottom right bitmap indexing. It’s not a complete database management
is the final product. Blue nodes represent partial matches system, rather a data processing tool, as its main purpose
and green node represent complete matches. is to facilitate selection queries and estimates. Fastbit’s key
technological features are WAH bitmap compression multi-
component and multi-level indices with many different com-
Running the algorithm for multiple queries or multiple binations of encoding and binning schemes.
attribute constraints in a single query can be implemented We use Fastbit’s partitions to setup the lowest level of our
using iteration through the constraints in the worst case. indices (leaves), and base our binning indices on Fastbit’s
single-level binning index. This approach requires prepro-
5.4 Estimating Cardinality of Results; Mem- cessing of the data into evenly shaped partitions, generating
bership Queries empty bitmasks and shape metadata. Once a table is pre-
It is fairly straightforward to output estimates on minimal processed into even partitions, it is indexed as described in
and maximal number of matching cells by iterating some Section 4. The index generation processes one partition at
bounded number of levels of the index. The minimal number a time, and once processed, the partition is never accessed
outputs the size of nodes in C, while the maximum outputs again during the index generation.
the size of nodes in C ∪ P . Using the w(b) estimate, we
may also provide estimates on aggregates over the attribute, 6.2 Bitmap Indexing Methods
based on bin-wise linear approximation. BoxClip represents a naive algorithm using 32 equi-depth
There is a simple modification of the algorithm for mem- binned indices, interval encoding and WAH compression.
bership queries. (See Section 3.1 for details about member- The result bitmask from the attribute query is transformed
ship queries). On top of two sided range indices PN,S and to a set of “line” hyperrectangles (size of the hyperrectan-
CN,S for attribute queries, we keep equality indices and it- gle in all but one dimensions is 1), which are filtered from
erate through the attribute constraint. For dimension mem- the dimension query, then merged into a set of result hy-
bership queries, we precompute an index for all dimension perrectangles. All the steps except filtering are built on top
values (within a single chunk), as opposed to buckets corre- for Fastbit’s mesh query. The filtering is implemented using
0 0
sponding to child nodes, that are used in PS,d and CS,d . recursive sweeping line algorithm.crease is due to the results retrieval. ArrayBit achieves
BoxClip very good results for low or high hit rate queries. This is
2 1,000
space [MB]
DimsAtts due to a large number of complete matches, and due to fast
time [s]
ArrayBit pruning of search space. For medium hit rate queries, the
1 500 algorithm has relatively high number of candidate nodes to
explore, but still manages to prune the search space faster.
0 0 6.4 Parameterization
We also experimented with different setups of our hierar-
8MB 128MB 1GB 8MB 128MB 1GB chical index. The major objectives remain the same: query
array size array size execution time and space requirements of the index.
First, the partition size determines the ratio of partition
Figure 5: Query execution time and disk space required to
index vs hierarchical index. We set this in equilibrium with
store the indices for different array sizes.
number of index bins, which increases the precision of the
binning and results in higher probability of pruning the
search space earlier.
DimsAtts uses indexed uint auxiliary attributes made Another important parameter is a fanout of nodes. If we
from dimensions (see Section 4). The dimension query is use a smaller fanout (the smallest possible is 2D ), we may
preprocessed into attributes, then run as a multi constraint not fill a single memory word with the index, significantly
query in Fastbit. The configuration is the same as in Box- impair bit parallelism, furthermore the index size will be
Clip, using 32 binned indices, range encoding and WAH larger due to much deeper indexing tree. If the fanout is too
compression on all attributes. high, we will not prune infeasible candidates fast enough.
ArrayBit represents our hierarchical multidimensional We got optimal results with a fanout close to a multiple of
index. We use 16 equi-depth binned indices, range encod- the word size, such as 82 = 64 for 2D arrays, 43 = 64 for
ing and WAH compression to index the partitions, and 16 3D, 44 = 256 for 4D, 35 = 243 for 5D, etc.
approximately equi-depth binned indices (described in Sec-
tion 4.4) with two sided range encoding and no compression 7. CONCLUSIONS AND FUTURE WORK
for the hierarchical index. Note that compared to BoxClip Most of the work on bitmap indexing to date focus on
and DimsAtts, we only use half of the bins in the parti- improving the space efficiency and speed, while a few applied
tion index. It is sufficient in our algorithm, because the bin the bitmap indices to multidimensional data. However, the
boundaries are adapted to the actual data in each partition, linear form of bitmap indices was never adapted to support
and because we need to store the bin boundaries within the multidimensional array data.
partitions. We have proposed a bitmap indexing method that is de-
signed for multidimensional arrays and focuses on overcom-
6.3 Range Queries ing the dimensionality issue. The hierarchical nature of the
In our work, we focus on mixed attribute and dimension proposed method allows for continuous results and estimates
queries. Regardless of the dataset, we categorize the queries to be output as intermediate results. Our approach effec-
based on the overall ratio of the size of the query result to tively prunes the search space, uses data adaptive, approx-
the size of the total array size. imate equi-depth binning. Furthermore, the index supports
Figure 5 shows the time required to return all results. The partitioned array data and allows distributed storage.
index file is preloaded into memory prior to the test for all Our experimental results show that the proposed bitmap
the systems used. We used 2D array for this experiment. indexing method outperforms standard linearized approaches
and a query with ≈ 10% hit ratio. Both BoxClip and Dim- for mixed attribute and dimension range query processing.
sAtts run slower than ArrayBit. In case of BoxClip, the There is a possible caveat that more complex multi-level
reason is that all the attribute query results had to be pro- and multi-component indices exist. None of these indices
cessed, while for DimsAtts the reason is that the attribute overcome the problem of dimensionality, rather due to their
made from second dimension didn’t effectively compress. In effectiveness delay the threshold where the drawbacks be-
terms of space requirements, all of the algorithms save at- came noticeable (in terms of number of dimensions and size
tribute index. ArrayBit uses less bins in the leaves, but of the array).
stores bin boundaries for all leaves and internal nodes, plus Future work includes adapting the tree structure based on
bitmaps for internal nodes, effectively taking up the same dimensions, such as adaptive mesh refinement widely used
space as BoxClip. On the other hand, DimsAtts stores in physical simulations [4]. Another interesting possibility
indices for all dimension attributes. Row major ordering is is multi-attribute index in a single hierarchical structure.
used in this measurement. Last, we want to use better approximation algorithms to
Figure 6 demonstrates the dependency of the query pro- determine feasible regions from finer attribute bins.
cessing time on a hit ratio of the query, i.e., the ratio of
selected cells vs total cells in the array. BoxClip algorithm 8. ACKNOWLEDGEMENT
does not prune the search space based on the dimensions,
resulting in number of hits dependent on the attribute only. This research was supported in part by AcRF Grant RG-
Filtering these is is time intensive. DimsAtts depends lin- 18/14.
early on the total number of dimensions. This is because
there is an additional attribute for each dimension. There
is also a small dependency on the hit ratio, where the in-4 8 BoxClip
10 DimsAtts
6 ArrayBit
time [s]
2 4
5
2
0 0 0
0 20 40 60 80 100 0 50 100 0 50 100
2D – result/array ratio [%] 3D – result/array ratio [%] 4D – result/array ratio [%]
Figure 6: Query execution time for 2D, 3D and 4D queries of various hit ratios. Queries contained an attribute constraint
and all dimension constraints, each constraint with approximately the same domain reduction.
9. REFERENCES [19] H. Samet. The quadtree and related hierarchical data
[1] G. Antoshenkov. Byte-aligned bitmap compression. In Data structures. ACM Computing Surveys (CSUR), 16(2):187–260,
Compression Conference, 1995. DCC’95. Proceedings, page 1984.
476. IEEE, 1995. [20] H. Samet. Applications of spatial data structures. 1990.
[2] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and [21] H. Samet. Foundations of multidimensional and metric data
N. Widmann. The multidimensional database system rasdaman. structures. Morgan Kaufmann, 2006.
In Acm Sigmod Record, volume 27, pages 575–577. ACM, 1998.
[22] R. R. Sinha, S. Mitra, and M. Winslett. Bitmap indexes for
[3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The large scientific data sets: A case study. In Proceedings 20th
R*-tree: an efficient and robust access method for points and IEEE International Parallel & Distributed Processing
rectangles, volume 19. ACM, 1990. Symposium, pages 10–pp. IEEE, 2006.
[4] M. J. Berger and P. Colella. Local adaptive mesh refinement for [23] R. R. Sinha and M. Winslett. Multi-resolution bitmap indexes
shock hydrodynamics. Journal of computational Physics, for scientific data. ACM Transactions on Database Systems
82(1):64–84, 1989. (TODS), 32(3):16, 2007.
[5] C. Chan and Y. Ioannidis. An efficient bitmap encoding scheme [24] T. L. L. Siqueira, C. D. de Aguiar Ciferri, V. C. Times, and
for selection queries. ACM SIGMOD Record, 1999. R. R. Ciferri. The sb-index and the hsb-index: efficient indices
[6] C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and for spatial data warehouses. Geoinformatica, 16(1):165–205,
evaluation. In ACM SIGMOD Record, volume 27, pages 2012.
355–366. ACM, 1998. [25] K. Stockinger. Bitmap indices for speeding up high-dimensional
[7] J. Chmiel, T. Morzy, and R. Wrembel. Time-HOBI: indexing data analysis. In Database and Expert Systems Applications,
dimension hierarchies by means of hierarchically organized pages 881–890. Springer, 2002.
bitmaps. In Proceedings of the ACM 13th international [26] K. Stockinger and K. Wu. Bitmap indices for data warehouses.
workshop on Data warehousing and OLAP - DOLAP ’10, Data Warehouses and OLAP: Concepts, Architectures and
page 69, New York, New York, USA, oct 2010. ACM Press. Solutions, page 57, 2006.
[8] J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. Bethel, [27] M. Stonebraker, P. Brown, D. Zhang, and J. Becla. SciDB: A
A. Shoshani, O. Rübel, R. D. Ryne, et al. Parallel index and database management system for applications with complex
query for large scale data analysis. In Proceedings of 2011 analytics. Computing in Science and Engineering,
International Conference for High Performance Computing, 15(3):54–62, 2013.
Networking, Storage and Analysis, page 30. ACM, 2011.
[28] Y. Su, Y. Wang, and G. Agrawal. In-situ bitmaps generation
[9] L. Gosink, J. Shalf, K. Stockinger, K. Wu, and W. Bethel. and efficient data analysis based on bitmaps. In Proceedings of
Hdf5-fastquery: Accelerating complex queries on hdf datasets the 24th International Symposium on High-Performance
using fast bitmap indices. In Scientific and Statistical Parallel and Distributed Computing, pages 61–72. ACM, 2015.
Database Management, 2006. 18th International Conference
[29] Y. Wang, Y. Su, and G. Agrawal. A novel approach for
on, pages 149–158. IEEE, 2006.
approximate aggregations over arrays. In Proceedings of the
[10] A. Guttman. R-trees: a dynamic index structure for spatial 27th International Conference on Scientific and Statistical
searching, volume 14. ACM, 1984. Database Management, page 4. ACM, 2015.
[11] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, [30] Y. Wang, Y. Su, G. Agrawal, and T. Liu. Scisd: Novel subgroup
K. C. Sevcik, and T. Suel. Optimal histograms with quality discovery over scientific datasets using bitmap indices.
guarantees. In VLDB, volume 98, pages 24–27, 1998. Proceedings of Ohio State CSE Technical Report, 2015.
[12] A. Kalinin, U. Cetintemel, and S. Zdonik. Searchlight: enabling [31] K. Wu, S. Ahern, E. W. Bethel, J. Chen, H. Childs,
integrated search and exploration over large multidimensional E. Cormier-Michel, C. Geddes, J. Gu, H. Hagen, B. Hamann,
data. Proc. of the VLDB Endowment, 8(10):1094–1105, 2015. et al. Fastbit: interactively searching massive data. In Journal
[13] J. Lawder and P. King. Querying multi-dimensional data of Physics: Conference Series, volume 180, page 012053. IOP
indexed using the Hilbert space-filling curve. ACM Sigmod Publishing, 2009.
Record, 2001. [32] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap
[14] J. K. Lawder and P. J. King. Using space-filling curves for indices with efficient compression. ACM Transactions on
multi-dimensional indexing. In Advances in Databases, pages Database Systems (TODS), 31(1):1–38, 2006.
20–35. Springer, 2000. [33] K. Wu, A. Shoshani, and K. Stockinger. Analyses of multi-level
[15] T. L. Lopes Siqueira, R. R. Ciferri, V. C. Times, and C. D. and multi-component compressed bitmap indexes. ACM
de Aguiar Ciferri. A spatial bitmap-based index for Transactions on Database Systems (TODS), 35(1):2, 2010.
geographical data warehouses. In Proceedings of the 2009 [34] K. Wu, K. Stockinger, and A. Shoshani. Breaking the curse of
ACM symposium on Applied Computing, pages 1336–1342. cardinality on bitmap indexes. In International Conference on
ACM, 2009. Scientific and Statistical Database Management, pages
[16] T. Lungu and P. S. Callahan. QuikSCAT science data product 348–365. Springer, 2008.
user’s manual: Overview and geophysical data products. [35] K.-L. Wu and P. S. Yu. Range-based bitmap indexing for high
D-18053-Rev A, version, 3:91, 2006. cardinality attributes with skew. In COMPSAC’98.
[17] P. Nagarkar, K. Candan, and A. Bhat. Compressed spatial Proceedings. The Twenty-Second Annual International, pages
hierarchical bitmap (cSHB) indexes for efficiently processing 61–66. IEEE, 1998.
spatial range query workloads. Proceedings of the VLDB [36] G. Zhu, Y. Wang, and G. Agrawal. Scicsm: novel contrast set
Endowment, 2015. mining over scientific datasets using bitmap indices. In
[18] P. O’Neil and D. Quass. Improved query performance with Proceedings of the 27th International Conference on Scientific
variant indexes. In ACM Sigmod Record, volume 26, pages and Statistical Database Management, page 38. ACM, 2015.
38–49. ACM, 1997.You can also read