Impala: A Modern, Open-Source SQL Engine for Hadoop

Page created by Kenneth Malone
 
CONTINUE READING
Impala: A Modern, Open-Source SQL Engine for Hadoop

               Marcel Kornacker Alexander Behm Victor Bittorf Taras Bobrovytsky
              Casey Ching Alan Choi Justin Erickson Martin Grund Daniel Hecht
              Matthew Jacobs Ishaan Joshi Lenni Kuff Dileep Kumar Alex Leblang
              Nong Li Ippokratis Pandis Henry Robinson David Rorke Silvius Rus
            John Russell Dimitris Tsirogiannis Skye Wanderman-Milne Michael Yoder
                                                                     Cloudera
                                                                  http://impala.io/

ABSTRACT                                                                     that is on par or exceeds that of commercial MPP analytic
Cloudera Impala is a modern, open-source MPP SQL en-                         DBMSs, depending on the particular workload.
gine architected from the ground up for the Hadoop data                        This paper discusses the services Impala provides to the
processing environment. Impala provides low latency and                      user and then presents an overview of its architecture and
high concurrency for BI/analytic read-mostly queries on                      main components. The highest performance that is achiev-
Hadoop, not delivered by batch frameworks such as Apache                     able today requires using HDFS as the underlying storage
Hive. This paper presents Impala from a user’s perspective,                  manager, and therefore that is the focus on this paper; when
gives an overview of its architecture and main components                    there are notable differences in terms of how certain technical
and briefly demonstrates its superior performance compared                   aspects are handled in conjunction with HBase, we note that
against other popular SQL-on-Hadoop systems.                                 in the text without going into detail.
                                                                               Impala is the highest performing SQL-on-Hadoop system,
                                                                             especially under multi-user workloads. As Section 7 shows,
1.     INTRODUCTION                                                          for single-user queries, Impala is up to 13x faster than alter-
   Impala is an open-source 1 , fully-integrated, state-of-the-              natives, and 6.7x faster on average. For multi-user queries,
art MPP SQL query engine designed specifically to leverage                   the gap widens: Impala is up to 27.4x faster than alternatives,
the flexibility and scalability of Hadoop. Impala’s goal is                  and 18x faster on average – or nearly three times faster on
to combine the familiar SQL support and multi-user perfor-                   average for multi-user queries than for single-user ones.
mance of a traditional analytic database with the scalability                  The remainder of this paper is structured as follows: the
and flexibility of Apache Hadoop and the production-grade                    next section gives an overview of Impala from the user’s
security and management extensions of Cloudera Enterprise.                   perspective and points out how it differs from a traditional
Impala’s beta release was in October 2012 and it GA’ed in                    RDBMS. Section 3 presents the overall architecture of the
May 2013. The most recent version, Impala 2.0, was released                  system. Section 4 presents the frontend component, which
in October 2014. Impala’s ecosystem momentum continues                       includes a cost-based distributed query optimizer, Section 5
to accelerate, with nearly one million downloads since its                   presents the backend component, which is responsible for the
GA.                                                                          query execution and employs runtime code generation, and
   Unlike other systems (often forks of Postgres), Impala is a               Section 6 presents the resource/workload management com-
brand-new engine, written from the ground up in C++ and                      ponent. Section 7 briefly evaluates the performance of Im-
Java. It maintains Hadoop’s flexibility by utilizing standard                pala. Section 8 discusses the roadmap ahead and Section 9
components (HDFS, HBase, Metastore, YARN, Sentry) and                        concludes.
is able to read the majority of the widely-used file formats
(e.g. Parquet, Avro, RCFile). To reduce latency, such as
that incurred from utilizing MapReduce or by reading data
                                                                             2.     USER VIEW OF IMPALA
remotely, Impala implements a distributed architecture based                   Impala is a query engine which is integrated into the
on daemon processes that are responsible for all aspects of                  Hadoop environment and utilizes a number of standard
query execution and that run on the same machines as the                     Hadoop components (Metastore, HDFS, HBase, YARN, Sen-
rest of the Hadoop infrastructure. The result is performance                 try) in order to deliver an RDBMS-like experience. However,
                                                                             there are some important differences that will be brought up
1
     https://github.com/cloudera/impala                                      in the remainder of this section.
                                                                               Impala was specifically targeted for integration with stan-
                                                                             dard business intelligence environments, and to that end
                                                                             supports most relevant industry standards: clients can con-
                                                                             nect via ODBC or JDBC; authentication is accomplished
This article is published under a Creative Commons Attribution Li-           with Kerberos or LDAP; authorization follows the standard
cense(http://creativecommons.org/licenses/by/3.0/), which permits distri-
bution and reproduction in any medium as well as allowing derivative         SQL roles and privileges 2 . In order to query HDFS-resident
works, provided that you attribute the original work to the author(s) and    2
                                                                                  This is provided by another standard Hadoop component
CIDR 2015.
7th Biennial Conference on Innovative Data Systems Research (CIDR’15)             called Sentry [4] , which also makes role-based authoriza-
January 4-7, 2015, Asilomar, California, USA.                                     tion available to Hive, and other components.
data, the user creates tables via the familiar CREATE TABLE         location of that table, using HDFS’s API. Alternatively, the
statement, which, in addition to providing the logical schema       same can be accomplished with the LOAD DATA statement.
of the data, also indicates the physical layout, such as file          Similarly to bulk insert, Impala supports bulk data dele-
format(s) and placement within the HDFS directory struc-            tion by dropping a table partition (ALTER TABLE DROP PAR-
ture. Those tables can then be queried with standard SQL            TITION). Because it is not possible to update HDFS files
syntax.                                                             in-place, Impala does not support an UPDATE statement. In-
                                                                    stead, the user typically recomputes parts of the data set to
2.1     Physical schema design                                      incorporate updates, and then replaces the corresponding
   When creating a table, the user can also specify a list of       data files, often by dropping and re-adding the partition
partition columns:                                                     After the initial data load, or whenever a significant frac-
CREATE TABLE T (...) PARTITIONED BY (day int, month                 tion of the table’s data changes, the user should run the
int) LOCATION ’’ STORED AS PARQUET;                      COMPUTE STATS  statement, which instructs Impala
   For an unpartitioned table, data files are stored by de-         to gather statistics on the table. Those statistics will subse-
fault directly in the root directory 3 . For a partitioned          quently be used during query optimization.
table, data files are placed in subdirectories whose paths
reflect the partition columns’ values. For example, for day
17, month 2 of table T, all data files would be located in
                                                                    3.     ARCHITECTURE
directory /day=17/month=2/. Note that this form of               Impala is a massively-parallel query execution engine,
partitioning does not imply a collocation of the data of an         which runs on hundreds of machines in existing Hadoop
individual partition: the blocks of the data files of a partition   clusters. It is decoupled from the underlying storage engine,
are distributed randomly across HDFS data nodes.                    unlike traditional relational database management systems
   Impala also gives the user a great deal of flexibility when      where the query processing and the underlying storage engine
choosing file formats. It currently supports compressed and         are components of a single tightly-coupled system. Impala’s
uncompressed text files, sequence file (a splittable form of        high-level architecture is shown in Figure 1 .
text files), RCFile (a legacy columnar format), Avro (a binary         An Impala deployment is comprised of three services. The
row format), and Parquet, the highest-performance storage           Impala daemon (impalad) service is dually responsible for
option ( Section 5.3 discusses file formats in more detail).        accepting queries from client processes and orchestrating their
As in the example above, the user indicates the storage             execution across the cluster, and for executing individual
format in the CREATE TABLE or ALTER TABLE statements. It            query fragments on behalf of other Impala daemons. When
is also possible to select a separate format for each partition     an Impala daemon operates in the first role by managing
individually. For example one can specifically set the file         query execution, it is said to be the coordinator for that query.
format of a particular partition to Parquet with:                   However, all Impala daemons are symmetric; they may all
ALTER TABLE PARTITION(day=17, month=2) SET FILEFORMAT               operate in all roles. This property helps with fault-tolerance,
PARQUET.                                                            and with load-balancing.
   As an example for when this is useful, consider a table             One Impala daemon is deployed on every machine in the
with chronologically recorded data, such as click logs. The         cluster that is also running a datanode process - the block
data for the current day might come in as CSV files and get         server for the underlying HDFS deployment - and therefore
converted in bulk to Parquet at the end of each day.                there is typically one Impala daemon on every machine. This
                                                                    allows Impala to take advantage of data locality, and to read
2.2     SQL Support                                                 blocks from the filesystem without having to use the network.
                                                                       The Statestore daemon (statestored) is Impala’s meta-
  Impala supports most of the SQL-92 SELECT statement               data publish-subscribe service, which disseminates cluster-
syntax, plus additional SQL-2003 analytic functions, and            wide metadata to all Impala processes. There is a single
most of the standard scalar data types: integer and floating        statestored instance, which is described in more detail in
point types, STRING, CHAR, VARCHAR, TIMESTAMP,                      Section 3.1 below.
and DECIMAL with up to 38 digits of precision. Custom                  Finally, the Catalog daemon (catalogd), described in Section 3.2 ,
application logic can be incorporated through user-defined          serves as Impala’s catalog repository and metadata access
functions (UDFs) in Java and C++, and user-defined aggre-           gateway. Through the catalogd, Impala daemons may exe-
gate functions (UDAs), currently only in C++.                       cute DDL commands that are reflected in external catalog
  Due to the limitations of HDFS as a storage manager, Im-          stores such as the Hive Metastore. Changes to the system
pala does not support UPDATE or DELETE, and essentially only        catalog are broadcast via the statestore.
supports bulk insertions (INSERT INTO ... SELECT ...) 4 .              All these Impala services, as well as several configuration
Unlike in a traditional RDBMS, the user can add data to a           options, such as the sizes of the resource pools, the available
table simply by copying/moving data files into the directory        memory, etc.. (see Section 6 for more details about resource
3
                                                                    and workload management) are also exposed to Cloudera
    However, all data files that are located in any directory       Manager, a sophisticated cluster management application5 .
    below the root are part of the table’s data set. That is a
    common approach for dealing with unpartitioned tables,          Cloudera Manager can administer not only Impala but also
    employed also by Apache Hive.                                   pretty much every service for a holistic view of a Hadoop
4
    We should also note that Impala supports the VALUES             deployment.
    clause. However, for HDFS-backed tables this will generate
    one file per INSERT statement, which leads to very poor
    performance for most applications. For HBase-backed
                                                                    5
    tables, the VALUES variant performs single-row inserts by            http://www.cloudera.com/content/cloudera/en/products-
    means of the HBase API.                                              and-services/cloudera-enterprise/cloudera-manager.html
SQL App                                            Hive              HDFS
                                                                                                Statestore            Catalog
                                                        Metastore         NameNode
         ODBC
                                  (1) Send SQL
             (6) Query results
Impalad                                         Impalad                                       Impalad
           Query Planner                                  Query Planner                                Query Planner
                                        (2)
        Query Coordinator               (3)            Query Coordinator                (3)          Query Coordinator
                                          (5)                (3)    (5)
                                                                                       (5)
          Query Executor                                 Query Executor                               Query Executor
                                          (4)                                          (4)
      HDFS DN           HBase                       HDFS DN           HBase                      HDFS DN             HBase

Figure 1: Impala is a distributed query processing system for the Hadoop ecosystem. This figure also shows
the flow during query processing.

3.1    State distribution                                           date was successfully sent to the subscriber. Each subscriber
   A major challenge in the design of an MPP database that          maintains a per-topic most-recent-version identifier which
is intended to run on hundreds of nodes is the coordina-            allows the statestore to only send the delta between updates.
tion and synchronization of cluster-wide metadata. Impala’s         In response to a topic update, each subscriber sends a list
symmetric-node architecture requires that all nodes must be         of changes it wishes to make to its subscribed topics. Those
able to accept and execute queries. Therefore all nodes must        changes are guaranteed to have been applied by the time the
have, for example, up-to-date versions of the system catalog        next update is received.
and a recent view of the Impala cluster’s membership so that           The second kind of statestore message is a keepalive. The
queries may be scheduled correctly.                                 statestore uses keepalive messages to maintain the connec-
   We might approach this problem by deploying a separate           tion to each subscriber, which would otherwise time-out its
cluster-management service, with ground-truth versions of           subscription and attempt to re-register. Previous versions of
all cluster-wide metadata. Impala daemons could then query          the statestore used topic update messages for both purposes,
this store lazily (i.e. only when needed), which would ensure       but as the size of topic updates grew it became difficult to
that all queries were given up-to-date responses. However,          ensure timely delivery of updates to each subscriber, leading
a fundamental tenet in Impala’s design has been to avoid            to false-positives in the subscriber’s failure-detection process.
synchronous RPCs wherever possible on the critical path of             If the statestore detects a failed subscriber (for example,
any query. Without paying close attention to these costs, we        by repeated failed keepalive deliveries), it will cease sending
have found that query latency is often compromised by the           updates. Some topic entries may be marked as ’transient’,
time taken to establish a TCP connection, or load on some           meaning that if their ’owning’ subscriber should fail, they
remote service. Instead, we have designed Impala to push            will be removed. This is a natural primitive with which to
updates to all interested parties, and have designed a simple       maintain liveness information for the cluster in a dedicated
publish-subscribe service called the statestore to disseminate      topic, as well as per-node load statistics.
metadata changes to a set of subscribers.                              The statestore provides very weak semantics: subscribers
   The statestore maintains a set of topics, which are arrays       may be updated at different rates (although the statestore
of (key, value, version) triplets called entries where ’key’        tries to distribute topic updates fairly), and may therefore
and ’value’ are byte arrays, and ’version’ is a 64-bit integer.     have very different views of the content of a topic. How-
A topic is defined by an application, and so the statestore         ever, Impala only uses topic metadata to make decisions
has no understanding of the contents of any topic entry.            locally, without any coordination across the cluster. For
Topics are persistent through the lifetime of the statestore,       example, query planning is performed on a single node based
but are not persisted across service restarts. Processes that       on the catalog metadata topic, and once a full plan has been
wish to receive updates to any topic are called subscribers,        computed, all information required to execute that plan is
and express their interest by registering with the statestore       distributed directly to the executing nodes. There is no
at start-up and providing a list of topics. The statestore          requirement that an executing node should know about the
responds to registration by sending the subscriber an initial       same version of the catalog metadata topic.
topic update for each registered topic, which consists of all          Although there is only a single statestore process in exist-
the entries currently in that topic.                                ing Impala deployments, we have found that it scales well
   After registration, the statestore periodically sends two        to medium sized clusters and, with some configuration, can
kinds of messages to each subscriber. The first kind of mes-        serve our largest deployments. The statestore does not per-
sage is a topic update, and consists of all changes to a topic      sist any metadata to disk: all current metadata is pushed
(new entries, modified entries and deletions) since the last up-    to the statestore by live subscribers (e.g. load information).
Therefore, should a statestore restart, its state can be recov-      plan optimizations such as ordering and coalescing analytic
ered during the initial subscriber registration phase. Or if         window functions and join reordering to minimize the total
the machine that the statestore is running on fails, a new           evaluation cost. Cost estimation is based on table/partition
statestore process can be started elsewhere, and subscribers         cardinalities plus distinct value counts for each column 6 ;
may fail over to it. There is no built-in failover mechanism         histograms are currently not part of the statistics. Impala
in Impala, instead deployments commonly use a retargetable           uses simple heuristics to avoid exhaustively enumerating and
DNS entry to force subscribers to automatically move to the          costing the entire join-order space in common cases.
new process instance.                                                   The second planning phase takes the single-node plan
                                                                     as input and produces a distributed execution plan. The
3.2    Catalog service                                               general goal is to minimize data movement and maximize scan
   Impala’s catalog service serves catalog metadata to Impala        locality: in HDFS, remote reads are considerably slower than
daemons via the statestore broadcast mechanism, and exe-             local ones. The plan is made distributed by adding exchange
cutes DDL operations on behalf of Impala daemons. The                nodes between plan nodes as necessary, and by adding extra
catalog service pulls information from third-party metadata          non-exchange plan nodes to minimize data movement across
stores (for example, the Hive Metastore or the HDFS Na-              the network (e.g., local aggregation nodes). During this
menode), and aggregates that information into an Impala-             second phase, we decide the join strategy for every join
compatible catalog structure. This architecture allows Im-           node (the join order is fixed at this point). The supported
pala to be relatively agnostic of the metadata stores for the        join strategies are broadcast and partitioned. The former
storage engines it relies upon, which allows us to add new           replicates the entire build side of a join to all cluster machines
metadata stores to Impala relatively quickly (e.g. HBase sup-        executing the probe, and the latter hash-redistributes both
port). Any changes to the system catalog (e.g. when a new            the build and probe side on the join expressions. Impala
table has been loaded) are disseminated via the statestore.          chooses whichever strategy is estimated to minimize the
   The catalog service also allows us to augment the system          amount of data exchanged over the network, also exploiting
catalog with Impala-specific information. For example, we            existing data partitioning of the join inputs.
register user-defined-functions only with the catalog service           All aggregation is currently executed as a local pre-aggregation
(without replicating this to the Hive Metastore, for example),       followed by a merge aggregation operation. For grouping
since they are specific to Impala.                                   aggregations, the pre-aggregation output is partitioned on
   Since catalogs are often very large, and access to tables         the grouping expressions and the merge aggregation is done
is rarely uniform, the catalog service only loads a skeleton         in parallel on all participating nodes. For non-grouping aggre-
entry for each table it discovers on startup. More detailed          gations, the merge aggregation is done on a single node. Sort
table metadata can be loaded lazily in the background from           and top-n are parallelized in a similar fashion: a distributed
its third-party stores. If a table is required before it has been    local sort/top-n is followed by a single-node merge operation.
fully loaded, an Impala daemon will detect this and issue            Analytic expression evaluation is parallelized based on the
a prioritization request to the catalog service. This request        partition-by expressions. It relies on its input being sorted
blocks until the table is fully loaded.                              on the partition-by/order-by expressions. Finally, the dis-
                                                                     tributed plan tree is split up at exchange boundaries. Each
4.    FRONTEND                                                       such portion of the plan is placed inside a plan fragment,
                                                                     Impala’s unit of backend execution. A plan fragment encap-
   The Impala frontend is responsible for compiling SQL
                                                                     sulates a portion of the plan tree that operates on the same
text into query plans executable by the Impala backends.
                                                                     data partition on a single machine.
It is written in Java and consists of a fully-featured SQL
                                                                        Figure 2 illustrates in an example the two phases of query
parser and cost-based query optimizer, all implemented from
                                                                     planning. The left side of the figure shows the single-node
scratch. In addition to the basic SQL features (select, project,
                                                                     plan of a query joining two HDFS tables (t1, t2) and one
join, group by, order by, limit), Impala supports inline views,
                                                                     HBase table (t3) followed by an aggregation and order by
uncorrelated and correlated subqueries (that are rewritten as
                                                                     with limit (top-n). The right-hand side shows the distributed,
joins), all variants of outer joins as well as explicit left/right
                                                                     fragmented plan. Rounded rectangles indicate fragment
semi- and anti-joins, and analytic window functions.
                                                                     boundaries and arrows data exchanges. Tables t1 and t2
   The query compilation process follows a traditional divi-
                                                                     are joined via the partitioned strategy. The scans are in a
sion of labor: Query parsing, semantic analysis, and query
                                                                     fragment of their own since their results are immediately
planning/optimization. We will focus on the latter, most chal-
                                                                     exchanged to a consumer (the join node) which operates
lenging, part of query compilation. The Impala query planner
                                                                     on a hash-based partition of the data, whereas the table
is given as input a parse tree together with query-global in-
                                                                     data is randomly partitioned. The following join with t3
formation assembled during semantic analysis (table/column
                                                                     is a broadcast join placed in the same fragment as the join
identifiers, equivalence classes, etc.). An executable query
                                                                     between t1 and t2 because a broadcast join preserves the
plan is constructed in two phases: (1) Single node planning
                                                                     existing data partition (the results of joining t1, t2, and
and (2) plan parallelization and fragmentation.
                                                                     t3 are still hash partitioned based on the join keys of t1
   In the first phase, the parse tree is translated into a non-
                                                                     and t2). After the joins we perform a two-phase distributed
executable single-node plan tree, consisting of the following
                                                                     aggregation, where a pre-aggregation is computed in the
plan nodes: HDFS/HBase scan, hash join, cross join, union,
                                                                     same fragment as the last join. The pre-aggregation results
hash aggregation, sort, top-n, and analytic evaluation. This
                                                                     are hash-exchanged based on the grouping keys, and then
step is responsible for assigning predicates at the lowest pos-
sible plan node, inferring predicates based on equivalence
                                                                     6
classes, pruning table partitions, setting limits/offsets, apply-        We use the HyperLogLog algorithm [5] for distinct value
ing column projections, as well as performing some cost-based            estimation.
TopN                Merge
                       Single-Node                      Distributed Plan                                                          TopN
          TopN
                           Plan
                                                            at HDFS                     MergeAgg
          Agg                                               at HBase
                                                                                                  Hash(t1.custid)
                                                         at Coordinator
       HashJoin          Scan: t3                                                            Pre-Agg

       HashJoin          Scan: t2                                                           HashJoin              Broadcast
                                                                                                                                 Scan: t3

                                                                   Hash(t1.id1)                               Hash(t2.id)
        Scan: t1                                      Scan: t1                              HashJoin                             Scan: t2

                                Figure 2: Example of the two phase query optimization.

aggregated once more to compute the final aggregation result.          IntVal'my_func(const'IntVal&'v1,'const'IntVal&'v2)'{'
The same two-phased approach is applied to the top-n, and              ''return'IntVal(v1.val'*'7'/'v2.val);'
the final top-n step is performed at the coordinator, which            }'    function
                                                                                  pointer
returns the results to the user.                                                            my_func
                                                                             function                  function
5.   BACKEND                                                                  pointer                   pointer

   Impala’s backend receives query fragments from the fron-                         +                  col2
tend and is responsible for their fast execution. It is designed       function             function
to take advantage of modern hardware. The backend is writ-              pointer              pointer
                                                                                                                    (col1 + 10) * 7 / col2
ten in C++ and uses code generation at runtime to produce
efficient codepaths (with respect to instruction count) and               col1               10
small memory overhead, especially compared to other engines
                                                                                   interpreted                            codegen’d
implemented in Java.
   Impala leverages decades of research in parallel databases.
The execution model is the traditional Volcano-style with             Figure 3: Interpreted vs codegen’ed code in Impala.
Exchange operators [7] . Processing is performed batch-at-
a-time: each GetNext() call operates over batches of rows,
similar to [10] . With the exception of “stop-and-go” opera-
tors (e.g. sorting), the execution is fully pipeline-able, which      5.1     Runtime Code Generation
minimizes the memory consumption for storing intermedi-                  Runtime code generation using LLVM [8] is one of the
ate results. When processed in memory, the tuples have a              techniques employed extensively by Impala’s backend to
canonical in-memory row-oriented format.                              improve execution times. Performance gains of 5x or more
   Operators that may need to consume lots of memory are              are typical for representative workloads.
designed to be able to spill parts of their working set to               LLVM is a compiler library and collection of related tools.
disk if needed. The operators that are spillable are the hash         Unlike traditional compilers that are implemented as stand-
join, (hash-based) aggregation, sorting, and analytic function        alone applications, LLVM is designed to be modular and
evaluation.                                                           reusable. It allows applications like Impala to perform just-
   Impala employs a partitioning approach for the hash join           in-time (JIT) compilation within a running process, with the
and aggregation operators. That is, some bits of the hash             full benefits of a modern optimizer and the ability to generate
value of each tuple determine the target partition and the            machine code for a number of architectures, by exposing
remaining bits for the hash table probe. During normal oper-          separate APIs for all steps of the compilation process.
ation, when all hash tables fit in memory, the overhead of the           Impala uses runtime code generation to produce query-
partitioning step is minimal, within 10% of the performance           specific versions of functions that are critical to performance.
of a non-spillable non-partitioning-based implementation.             In particular, code generation is applied to “inner loop” func-
When there is memory-pressure, a “victim” partition may be            tions, i.e., those that are executed many times (for every
spilled to disk, thereby freeing memory for other partitions          tuple) in a given query, and thus constitute a large portion
to complete their processing. When building the hash tables           of the total time the query takes to execute. For example, a
for the hash joins and there is reduction in cardinality of the       function used to parse a record in a data file into Impala’s
build-side relation, we construct a Bloom filter which is then        in-memory tuple format must be called for every record in
passed on to the probe side scanner, implementing a simple            every data file scanned. For queries scanning large tables,
version of a semi-join.                                               this could be billions of records or more. This function must
40                                                 simple aggregation queries. Code generation speeds up the
                                 Codegen Off
                                                                         execution by up to 5.7x, with the speedup increasing with
                                 Codegen On
   Query Time (sec)
                      30                                                 the query complexity.

                                                                         5.2    I/O Management
                      20
                                                                            Efficiently retrieving data from HDFS is a challenge for
                                                                         all SQL-on-Hadoop systems. In order to perform data scans
                      10                                        5.7x     from both disk and memory at or near hardware speed, Im-
                               1.19x           1.87x                     pala uses an HDFS feature called short-circuit local reads [3]
                       0                                                 to bypass the DataNode protocol when reading from local
                           select count(*)     select         TPC-H Q1   disk. Impala can read at almost disk bandwidth (approx.
                            from lineitem count(l_orderkey)              100MB/s per disk) and is typically able to saturate all avail-
                                            from lineitem                able disks. We have measured that with 12 disks, Impala is
                                                                         capable of sustaining I/O at 1.2GB/sec. Furthermore, HDFS
                                                                         caching [2] allows Impala to access memory-resident data at
Figure 4: Impact in performance of run-time code
                                                                         memory bus speed and also saves CPU cycles as there is no
generation in Impala.
                                                                         need to copy data blocks and/or checksum them.
                                                                            Reading/writing data from/to storage devices is the respon-
                                                                         sibility of the I/O manager component. The I/O manager
therefore be extremely efficient for good query performance,             assigns a fixed number of worker threads per physical disk
and even removing a few instructions from the function’s                 (one thread per rotational disk and eight per SSD), providing
execution can result in large query speedups.                            an asynchronous interface to clients (e.g. scanner threads).
   Without code generation, inefficiencies in function execu-            The effectiveness of Impala’s I/O manager was recently cor-
tion are almost always necessary in order to handle runtime              roborated by [6] , which shows that Impala’s read throughput
information not known at program compile time. For ex-                   is from 4x up to 8x higher than the other tested systems.
ample, a record-parsing function that only handles integer
types will be faster at parsing an integer-only file than a              5.3    Storage Formats
function that handles other data types such as strings and                  Impala supports most popular file formats: Avro, RC,
floating-point numbers as well. However, the schemas of the              Sequence, plain text, and Parquet. These formats can be
files to be scanned are unknown at compile time, and so a                combined with different compression algorithms, such as
general-purpose function must be used, even if at runtime it             snappy, gzip, bz2.
is known that more limited functionality is sufficient.                     In most use cases we recommend using Apache Parquet,
   A source of large runtime overheads are virtual functions.            a state-of-the-art, open-source columnar file format offering
Virtual function calls incur a large performance penalty, par-           both high compression and high scan efficiency. It was co-
ticularly when the called function is very simple, as the calls          developed by Twitter and Cloudera with contributions from
cannot be inlined. If the type of the object instance is known           Criteo, Stripe, Berkeley AMPlab, and LinkedIn. In addi-
at runtime, we can use code generation to replace the vir-               tion to Impala, most Hadoop-based processing frameworks
tual function call with a call directly to the correct function,         including Hive, Pig, MapReduce and Cascading are able to
which can then be inlined. This is especially valuable when              process Parquet.
evaluating expression trees. In Impala (as in many systems),                Simply described, Parquet is a customizable PAX-like
expressions are composed of a tree of individual operators                [1] format optimized for large data blocks (tens, hundreds,
and functions, as illustrated in the left-hand side of Figure            thousands of megabytes) with built-in support for nested data.
Figure 3 . Each type of expression that can appear in a tree             Inspired by Dremel’s ColumnIO format [9] , Parquet stores
is implemented by overriding a virtual function in the expres-           nested fields column-wise and augments them with minimal
sion base class, which recursively calls its child expressions.          information to enable re-assembly of the nesting structure
Many of these expression functions are quite simple, e.g.,               from column data at scan time. Parquet has an extensible
adding two numbers. Thus, the cost of calling the virtual                set of column encodings. Version 1.2 supports run-length and
function often far exceeds the cost of actually evaluating               dictionary encodings and version 2.0 added support for delta
the function. As illustrated in Figure 3 , by resolving the              and optimized string encodings. The most recent version
virtual function calls with code generation and then inlining            (Parquet 2.0) also implements embedded statistics: inlined
the resulting function calls, the expression tree can be eval-           column statistics for further optimization of scan efficiency,
uated directly with no function call overhead. In addition,              e.g. min/max indexes.
inlining functions increases instruction-level parallelism, and             As mentioned earlier, Parquet offers both high compres-
allows the compiler to make further optimizations such as                sion and scan efficiency. Figure 5 (left) compares the size
subexpression elimination across expressions.                            on disk of the Lineitem table of a TPC-H database of scal-
   Overall, JIT compilation has an effect similar to custom-             ing factor 1,000 when stored in some popular combinations
coding a query. For example, it eliminates branches, unrolls             of file formats and compression algorithms. Parquet with
loops, propagates constants, offsets and pointers, inlines               snappy compression achieves the best compression among
functions. Code generation has a dramatic impact on per-                 them. Similarly, Figure 5 (right) shows the Impala execu-
formance, as shown in Figure 4 . For example, in a 10-node               tion times for various queries from the TPC-DS benchmark
cluster with each node having 8 cores, 48GB RAM and 12                   when the database is stored in plain text, Sequence, RC, and
disks, we measure the impact of codegen. We are using                    Parquet formats. Parquet consistently outperforms by up to
an Avro TPC-H database of scaling factor 100 and we run                  5x all the other formats.
900                                                                            450
                                                                                                                                                            Text
                               750

                                                                              Query runtime (in secs)
                                                                                                        400                                                 Seq w/ Snappy
                         750
 Database size (in GB)

                                                                                                        350                                                 RC w/Snappy
                         600                                                                            300                                                 Parquet w/Snappy
                                                                                                        250
                         450                  410     386
                                       367                                                              200
                                                             317
                         300                                         245                                150
                                                                                                        100
                         150
                                                                                                         50
                           0                                                                              0
                               Text   Text w/ Seq w/ Avro w/ RCFile Parquet

                                                                                                              27

                                                                                                                   34

                                                                                                                         42

                                                                                                                              43

                                                                                                                                   46

                                                                                                                                        52

                                                                                                                                             55

                                                                                                                                                  59

                                                                                                                                                       65

                                                                                                                                                            73

                                                                                                                                                                 79

                                                                                                                                                                      96

                                                                                                                                                                                    n
                                       LZO Snappy Snappy w/           w/

                                                                                                                                                                                ea
                                                                                                              Q

                                                                                                                   Q

                                                                                                                        Q

                                                                                                                              Q

                                                                                                                                   Q

                                                                                                                                        Q

                                                                                                                                             Q

                                                                                                                                                  Q

                                                                                                                                                       Q

                                                                                                                                                            Q

                                                                                                                                                                 Q

                                                                                                                                                                      Q

                                                                                                                                                                               .m
                                                             Snappy Snappy

                                                                                                                                                                           eo
                                                                                                                                                                           G
Figure 5: (Left) Comparison of the compression ratio of popular combinations of file formats and compression.
(Right) Comparison of the query efficiency of plain text, SEQUENCE, RC, and Parquet in Impala.

6.                       RESOURCE/WORKLOAD MANAGEMENT                                                              associated with a resource pool, which defines the fair share
   One of the main challenges for any cluster framework                                                            of the the cluster’s available resources that a query may use.
is careful control of resource consumption. Impala often                                                              If resources for the resource pool are available in Llama’s
runs in the context of a busy cluster, where MapReduce                                                             resource cache, Llama returns them to the query immedi-
tasks, ingest jobs and bespoke frameworks compete for finite                                                       ately. This fast path allows Llama to circumvent YARN’s
CPU, memory and network resources. The difficulty is to                                                            resource allocation algorithm when contention for resources
coordinate resource scheduling between queries, and perhaps                                                        is low. Otherwise, Llama forwards the request to YARN’s
between frameworks, without compromising query latency                                                             resource manager, and waits for all resources to be returned.
or throughput.                                                                                                     This is different from YARN’s ’drip-feed’ allocation model
   Apache YARN [12] is the current standard for resource                                                           where resources are returned as they are allocated. Impala’s
mediation on Hadoop clusters, which allows frameworks to                                                           pipelined execution model requires all resources to be avail-
share resources such as CPU and memory without partition-                                                          able simultaneously so that all query fragments may proceed
ing the cluster. YARN has a centralized architecture, where                                                        in parallel.
frameworks make requests for CPU and memory resources                                                                 Since resource estimations for query plans, particularly
which are arbitrated by the central Resource Manager service.                                                      over very large data sets, are often inaccurate, we allow
This architecture has the advantage of allowing decisions to                                                       Impala queries to adjust their resource consumption estimates
be made with full knowledge of the cluster state, but it also                                                      during execution. This mode is not supported by YARN,
imposes a significant latency on resource acquisition. As                                                          instead we have Llama issue new resource requests to YARN
Impala targets workloads of many thousands of queries per                                                          (e.g. asking for 1GB more of memory per node) and then
second, we found the resource request and response cycle to                                                        aggregate them into a single resource allocation from Impala’s
be prohibitively long.                                                                                             perspective. This adapter architecture has allowed Impala
   Our approach to this problem was two-fold: first, we imple-                                                     to fully integrate with YARN without itself absorbing the
mented a complementary but independent admission control                                                           complexities of dealing with an unsuitable programming
mechanism that allowed users to control their workloads                                                            interface.
without costly centralized decision-making. Second, we de-
signed an intermediary service to sit between Impala and                                                               6.2    Admission Control
YARN with the intention of correcting some of the impedance                                                           In addition to integrating with YARN for cluster-wide
mismatch. This service, called Llama for Low-Latency Appli-                                                        resource management, Impala has a built-in admission con-
cation MAster, implements resource caching, gang scheduling                                                        trol mechanism to throttle incoming requests. Requests are
and incremental allocation changes while still deferring the                                                       assigned to a resource pool and admitted, queued, or re-
actual scheduling decisions to YARN for resource requests                                                          jected based on a policy that defines per-pool limits on the
that don’t hit Llama’s cache.                                                                                      maximum number of concurrent requests and the maximum
   The rest of this section describes both approaches to re-                                                       memory usage of requests. The admission controller was de-
source management with Impala. Our long-term goal is to                                                            signed to be fast and decentralized, so that incoming requests
support mixed-workload resource management through a                                                               to any Impala daemon can be admitted without making syn-
single mechanism that supports both the low latency deci-                                                          chronous requests to a central server. State required to make
sion making of admission control, and the cross-framework                                                          admission decisions is disseminated among Impala daemons
support of YARN.                                                                                                   via the statestore, so every Impala daemon is able to make
                                                                                                                   admission decisions based on its aggregate view of the global
                                                                                                                   state without any additional synchronous communication
6.1                       Llama and YARN                                                                           on the request execution path. However, because shared
  Llama is a standalone daemon to which all Impala daemons                                                         state is received asynchronously, Impala daemons may make
send per-query resource requests. Each resource request is                                                         decisions locally that result in exceeding limits specified by
the policy. In practice this has not been problematic because                                      250
state is typically updated faster than non-trivial queries. Fur-                                                                       Impala               SparkSQL               216

                                                                        Geometric Mean (in secs)
ther, the admission control mechanism is designed primarily                                                                            Presto               Hive 0.13          190
                                                                                                   200
to be a simple throttling mechanism rather than a resource                                                                                                      176
management solution such as YARN.
   Resource pools are defined hierarchically. Incoming re-                                         150
                                                                                                                                                                  127
quests are assigned to a resource pool based on a placement                                                                                                                  114
policy and access to pools can be controlled using ACLs.
                                                                                                   100                                           78
The configuration is specified with a YARN fair scheduler
allocation file and Llama configuration, and Cloudera Man-
ager provides a simple user interface to configure resource                                                     50                          39             38
                                                                                                                                       28                               30
pools, which can be modified without restarting any running                                                                                           18
                                                                                                                                   6
services.
                                                                                                                               0
                                                                                                                                   Interactive         Reporting         Analytic
7.     EVALUATION
  The purpose of this section is not to exhaustively evaluate      Figure 6: Comparison of query response times on
the performance of Impala, but mostly to give some indi-           single-user runs.
cations. There are independent academic studies that have
derived similar conclusions, e.g. [6] .
                                                                   performance advantage against Hive 0.13 (from an average
7.1     Experimental setup                                         of 4.9x to 9x) and Presto (from an average of 5.3x to 7.5x)
                                                                   from earlier versions of Impala 9 .
  All the experiments were run on the same 21-node cluster.                   320                                  302                                                                                      2500
                                                                   7.3 Mutli-User Single  Performance

                                                                                                   Completion Time (in secs)
Each node in the cluster is a 2-socket machine with 6-core                                        User
Intel Xeon CPU E5-2630L at 2.00GHz. Each node has 64GB                                    10 Users becomes more pronounced
                                                                      Impala’s superior performance                                                                                                         2000

                                                                                                                                                                                         Queries per Hour
RAM and 12 932GB disk drives (one for the OS, the rest for                    240workloads, which are ubiquitous in real-world
                                                                   in multi-user
HDFS).                                                                                                                         202
                                                                   applications. Figure 7 (left) shows the response time of the                                                                             1500
  We run a decision-support style benchmark consisting of a        four systems
subset of the queries of TPC-DS on a 15TB scale factor data                   160when there are 10 concurrent users submitting
                                                                                                       120In this scenario, Impala
                                                                   queries from the interactive category.                                                                                                   1000
set. In the results below we categorize the queries based on       outperforms the other systems from 6.7x to 18.7x when
the amount of data they access, into interactive, reporting,                                                              77 going
                                                                   from single 80
                                                                                user to concurrent user workloads. The speedup                                                                              500
and deep analytic queries. In particular, the interactive          varies from 10.6x to 27.4x depending       37 the comparison.
                                                                                                              on
                                                                                                   25
bucket contains queries: q19, q42, q52, q55, q63, q68, q73,                             5   11
                                                                   Note that Impala’s speed under 10-user load was nearly half
and q98; the reporting bucket contains queries: q27, q3, q43,                    0                                                                                                                             0
                                                                   that under single-user   load–whereas the average across the
q53, q7, and q89; and the deep analytic bucket contains                                Impala
                                                                   alternatives was just         SparkSQL
                                                                                          one-fifth that underPresto    Hive
                                                                                                                single-user    0.13
                                                                                                                            load.
queries: q34, q46, q59, q79, and ss max. The kit we use for           Similarly, Figure 7 (right) compares the throughput of the
these measurements is publicly available 7 .                       four systems. Impala achieves from 8.7x up to 22x higher
  For our comparisons we used the most popular SQL-on-             throughput than the other systems when 10 users submit
Hadoop systems for which we were able to show results 8 :          queries from the interactive bucket.
Impala, Presto, Shark, SparkSQL, and Hive 0.13. Due to the
lack of a cost-based optimizer in all tested engines except        7.4                              Comparing against a commercial RDBMS
Impala we tested all engines with queries that had been
                                                                     From the above comparisons it is clear that Impala is on
converted to SQL-92 style joins. For consistency, we ran
                                                                   the forefront among the SQL-on-Hadoop systems in terms of
those same queries against Impala, although Impala produces
                                                                   performance. But Impala is also suitable for deployment in
identical results without these modifications.
                                                                   traditional data warehousing setups. In Figure 8 we compare
  Each engine was assessed on the file format that it performs
                                                                   the performance of Impala against a popular commercial
best on, while consistently using Snappy compression to
                                                                   columnar analytic DBMS, referred to here as “DBMS-Y” due
ensure fair comparisons: Impala on Apache Parquet, Hive
                                                                   to a restrictive proprietary licensing agreement. We use a
0.13 on ORC, Presto on RCFile, and SparkSQL on Parquet.
                                                                   TPC-DS data set of scale factor 30,000 (30TB of raw data)
                                                                   and run queries from the workload presented in the previous
7.2     Single User Performance                                    paragraphs. We can see that Impala outperforms DBMS-Y
   Figure 6 compares the performance of the four systems           by up to 4.5x, and by an average of 2x, with only three
on single-user runs, where a single user is repeatedly submit-     queries performing more slowly.
ting queries with zero think time. Impala outperforms all
alternatives on single-user workloads across all queries run.
Impala’s performance advantage ranges from 2.1x to 13.0x           8.                              ROADMAP
and on average is 6.7x faster. Actually, this is a wider gap of      In this paper we gave an overview of Cloudera Impala.
                                                                   Even though Impala has already had an impact on modern
7
    https://github.com/cloudera/impala-tpcds-kit                   data management and is the performance leader among SQL-
8
    There are several other SQL engines for Hadoop, for exam-      on-Hadoop systems, there is much left to be done. Our
    ple Pivotal HAWQ and IBM BigInsights. Unfortunately,
                                                                   9
    as far as we know, these systems take advantage of the De-          http://blog.cloudera.com/blog/2014/05/new-sql-choices-
    Witt clause and we are legally prevented from presenting            in-the-apache-hadoop-ecosystem-why-impala-continues-
    comparisons against them.                                           to-lead/
320                                                 302                                        2500          2,333
                  Completion Time (in secs)
                                                              Single User                                                                                                                     Impala
                                                              10 Users                                                                       2000                                             SparkSQL

                                                                                                                          Queries per Hour
                                              240                                                                                                                                             Presto
                                                                                                               202
                                                                                                                                             1500                                             Hive 0.13
                                              160
                                                                                  120                                                        1000
                                                                                                          77
                                               80                                                                                             500
                                                                                             37                                                                               266                             175
                                                                             25                                                                                                              106
                                                         5         11
                                                0                                                                                                0
                                                         Impala SparkSQL Presto Hive 0.13                                                                      Throughput in Interactive bucket

                                                    Figure 7: Comparison of query response times and throughput on multi-user runs.

                                                                                                                                                       1,333
                                              700
      Completion Time (in secs)

                                                                                                                                                     630
                                                              Impala
                                              600
                                                              DBMS-Y                                                                                                                                                 510
                                              500                                                                                                                      464
                                                                                                                                                                     388
                                              400
                                              300
                                              200                                        166
                                                                                                         140 159
                                                                   99                93 89             91                                                                                            104           113
                                                         83                                                 70                 78                               78
                                              100   34        38        53                                                                                                   51               49   55
                                                                             19 30             13 17               15 17 34                  13 16         33                     25 17 19 29              18 11
                                               0
                                                     q3        q7       q19 q27 q34 q42 q43 q46 q52 q53 q55 q59 q63 q65 q68 q73 q79 q89 q98 ss_max

                                                Figure 8: Comparison of the performance of Impala and a commercial analytic RDBMS.

roadmap items roughly fall into two categories: the addition                                                                           Another area of planned improvements is Impala’s query
of yet more traditional parallel DBMS technology, which is                                                                          optimizer. The plan space it explores is currently intention-
needed in order to address an ever increasing fraction of the                                                                       ally restricted for robustness/predictability in part due to
existing data warehouse workloads, and solutions to problems                                                                        the lack of sophisticated data statistics (e.g. histograms)
that are somewhat unique to the Hadoop environment.                                                                                 and additional schema information (e.g. primary/foreign key
                                                                                                                                    constraints, nullability of columns) that would enable more
8.1                                   Additional SQL Support                                                                        accurate costing of plan alternatives. We plan on adding
   Impala’s support of SQL is fairly complete as of the 2.0                                                                         histograms to the table/partition metadata in the near-to-
version, but some standard language features are still miss-                                                                        medium term to rectify some of those issues. Utilizing such
ing: set MINUS and INTERSECT; ROLLUP and GROUPING SET;                                                                              additional metadata and incorporating complex plan rewrites
dynamic partition pruning; DATE/TIME/DATETIME data                                                                                  in a robust fashion is a challenging ongoing task.
types. We plan on adding those over the next releases.
   Impala is currently restricted to flat relational schemas, and                                                                   8.3              Metadata and Statistics Collection
while this is often adequate for pre-existing data warehouse                                                                           The gathering of metadata and table statistics in a Hadoop
workloads, we see increased use of newer file formats that                                                                          environment is complicated by the fact that, unlike in an
allow what are in essence nested relational schemas, with the                                                                       RDBMS, new data can show up simply by moving data
addition of complex column types (structs, arrays, maps).                                                                           files into a table’s root directory. Currently, the user must
Impala will be extended to handle those schemas in a manner                                                                         issue a command to recompute statistics and update the
that imposes no restrictions on the nesting levels or number                                                                        physical metadata to include new data files, but this has
of nested elements that can be addressed in a single query.                                                                         turned out to be problematic: users often forget to issue that
                                                                                                                                    command or are confused when exactly it needs to be issued.
8.2                                   Additional Performance Enhancements                                                           The solution to that problem is to detect new data files
   Planned performance enhancements include intra-node                                                                              automatically by running a background process which also
parallelization of joins, aggregation and sort as well as more                                                                      updates the metadata and schedules queries which compute
pervasive use of runtime code generation for tasks such as                                                                          the incremental table statistics.
data preparation for network transmission, materialization
of query output, etc. We are also considering switching to a                                                                        8.4              Automated Data Conversion
columnar canonical in-memory format for data that needs                                                                               One of the more challenging aspects of allowing multiple
to be materialized during query processing, in order to take                                                                        data formats side-by-side is the conversion from one format
advantage of SIMD instructions [11, 13] .                                                                                           into another. Data is typically added to the system in a struc-
tured row-oriented format, such as Json, Avro or XML, or as       iments for an organization to do something useful with its
text. On the other hand, from the perspective of performance      data.
a column-oriented format such as Parquet is ideal. Letting           Data management in the Hadoop ecosystem is still lacking
the user manage the transition from one to the other is often     some of the functionality that has been developed for commer-
a non-trivial task in a production environment: it essentially    cial RDBMSs over the past decades; despite that, we expect
requires setting up a reliable data pipeline (recognition of      this gap to shrink rapidly, and that the advantages of an open
new data files, coalescing them during the conversion pro-        modular environment will allow it to become the dominant
cess, etc.), which itself requires a considerable amount of       data management architecture in the not-too-distant future.
engineering. We are planning on adding automation of the
conversion process, such that the user can mark a table for       References
auto-conversion; the conversion process itself is piggy-backed
onto the background metadata and statistics gathering pro-         [1] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis.
cess, which additionally schedules conversion queries that             Weaving relations for cache performance. In VLDB,
run over the new data files.                                           2001.
                                                                   [2] Apache. Centralized cache management in HDFS. Avail-
8.5    Resource Management                                             able at https://hadoop.apache.org/docs/r2.3.0/hadoop-
  Resource management in an open multi-tenancy environ-                project-dist/hadoop-hdfs/CentralizedCacheManagement.html.
ment, in which Impala shares cluster resource with other pro-      [3] Apache. HDFS short-circuit local reads. Available at
cessing frameworks such as MapReduce, Spark, etc., is as yet           http://hadoop.apache.org/docs/r2.5.1/hadoop-project-
an unsolved problem. The existing integration with YARN                dist/hadoop-hdfs/ShortCircuitLocalReads.html.
does not currently cover all use cases, and YARN’s focus on
having a single reservation registry with synchronous resource     [4] Apache.             Sentry.             Available     at
                                                                       http://sentry.incubator.apache.org/.
reservation makes it difficult to accommodate low-latency,
high-throughput workloads. We are actively investigating           [5] P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier.
new solutions to this problem.                                         HyperLogLog: The analysis of a near-optimal cardinality
                                                                       estimation algorithm. In AOFA, 2007.
8.6    Support for Remote Data Storage                             [6] A. Floratou, U. F. Minhas, and F. Ozcan. SQL-on-
   Impala currently relies on collocation of storage and com-          Hadoop: Full circle back to shared-nothing database
putation in order to achieve high performance. However,                architectures. PVLDB, 2014.
cloud data storage such as Amazon’s S3 is becoming more
                                                                   [7] G. Graefe. Encapsulation of parallelism in the Volcano
popular. Also, legacy storage infrastructure based on SANs
                                                                       query processing system. In SIGMOD, 1990.
necessitates a separation of computation and storage. We are
actively working on extending Impala to access Amazon S3           [8] C. Lattner and V. Adve. LLVM: A compilation frame-
(slated for version 2.2) and SAN-based systems. Going be-              work for lifelong program analysis & transformation. In
yond simply replacing local with remote storage, we are also           CGO, 2004.
planning on investigating automated caching strategies that        [9] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivaku-
allow for local processing without imposing an additional              mar, M. Tolton, and T. Vassilakis. Dremel: Interactive
operational burden.                                                    analysis of web-scale datasets. PVLDB, 2010.
                                                                  [10] S. Padmanabhan, T. Malkemus, R. C. Agarwal, and
9.    CONCLUSION                                                       A. Jhingran. Block oriented processing of relational
   In this paper we presented Cloudera Impala, an open-                database operations in modern computer architectures.
source SQL engine that was designed to bring parallel DBMS             In ICDE, 2001.
technology to the Hadoop environment. Our performance             [11] V. Raman, G. Attaluri, R. Barber, N. Chainani,
results showed that despite Hadoop’s origin as a batch pro-            D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Light-
cessing environment, it is possible to build an analytic DBMS          stone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller,
on top of it that performs just as well or better that cur-            I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm,
rent commercial solutions, but at the same time retains the            and L. Zhang. DB2 with BLU Acceleration: So much
flexibility and cost-effectiveness of Hadoop.                          more than just a column store. PVLDB, 6, 2013.
   In its present state, Impala can already replace a tradi-      [12] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal,
tional, monolithic analytic RDBMSs for many workloads. We              M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah,
predict that the gap to those systems with regards to SQL              S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia,
functionality will disappear over time and that Impala will            B. Reed, and E. Baldeschwieler. Apache Hadoop YARN:
be able to take on an every increasing fraction of pre-existing        Yet another resource negotiator. In SOCC, 2013.
data warehouse workloads. However, we believe that the
modular nature of the Hadoop environment, in which Impala         [13] T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner,
draws on a number of standard components that are shared               A. Zeier, and J. Schaffner. SIMD-scan: ultra fast in-
across the platform, confers some advantages that cannot be            memory table scan using on-chip vector processing units.
replicated in a traditional, monolithic RDBMS. In particular,          PVLDB, 2, 2009.
the ability to mix file formats and processing frameworks
means that a much broader spectrum of computational tasks
can be handled by a single system without the need for data
movement, which itself is typically one of the biggest imped-
You can also read