Towards a Polyglot Data Access Layer for a Low-Code Application Development Platform

Page created by Jane Lang
 
CONTINUE READING
Towards a Polyglot Data Access Layer for a Low-Code Application Development Platform
Towards a Polyglot Data Access Layer for a Low-Code
                                                             Application Development Platform∗
                                                         Ana Nunes Alonso1 , João Abreu2 , David Nunes2 , André Vieira2 ,
arXiv:2004.13495v1 [cs.DB] 28 Apr 2020

                                                                    Luiz Santos2 , Tércio Soares2 , and José Pereira1
                                                1
                                                    INESC TEC and U. Minho, ana.n.alonso@inesctec.pt, jop@di.uminho.pt
                                                                2
                                                                  OutSystems, first.last @outsystems.com

                                         Abstract                                                       1   Introduction
                                         Low-code application development as proposed by        According to Forrester Research, that defines low-
                                         the OutSystems Platform enables fast mobile and        code as “enabl[ing] rapid delivery of business applica-
                                         desktop application development and deployment. It     tions with a minimum of hand-coding and minimal
                                         hinges on visual development of the interface and      upfront investment in setup, training, and deploy-
                                         business logic but also on easy integration with data  ment,” OutSystems is a leading low-code platform for
                                         stores and services while delivering robust applica-   application development and delivery [19]. It empha-
                                         tions that scale.                                      sizes drag-and-drop to define the functionality for UI,
                                            Data integration increasingly means accessing a va- business processes, logic, and data models to create
                                         riety of NoSQL stores. Unfortunately, the diversity of full-stack, cross-platform applications.
                                         data and processing models, that make them useful in      Integrating with existing systems increasingly
                                         the first place, is difficult to reconcile with the sim-
                                                                                                means connecting to a variety of NoSQL data stores,
                                         plification of abstractions exposed to developers in   deployed as businesses take advantage of Big Data.
                                         a low-code platform. Moreover, NoSQL data stores       Our goal is to enable interactive applications built
                                         also rely on a variety of general purpose and custom   with the OutSystems Platform [16] to query data in
                                         scripting languages as their main interfaces.          NoSQL stores.
                                            In this paper we propose a polyglot data access        The current standard for integrating NoSQL stores
                                         layer for the OutSystems Platform that uses SQL        with available low-code platforms is for developers
                                         with optional embedded script snippets to bridge       to manually define how the available data must be
                                         the gap between low-code and full access to NoSQL      imported and consumed by the platform, requiring
                                         stores. In detail, we characterize the challenges for  expertise in each particular NoSQL store, especially
                                         integrating a variety of NoSQL data stores; we de-     if performance is a concern.
                                         scribe the architecture and proof-of-concept imple-       Conversely, an ideal OutSystems experience for
                                         mentation; and evaluate it with a sample applica-      leveraging Big Data should be: (1) create an inte-
                                         tion.                                                  gration with the Big Data repository, providing the
                                           ∗ This work was supported by Lisboa2020, Compete2020 connection details including credentials; (2) the plat-
                                         and FEDER through Project RADicalize (LISBOA-01-0247- form introspects the data in the repository and cre-
                                         FEDER-017116 | POCI-01-0247-FEDER-017116).             ates a representation for it; (3) the developer trims

                                                                                                    1
down the mapping to the information that is valuable           NoSQL stores to interactive applications creating us-
for the application; (4) the developer is able to query        ing the OutSystems platform, preserving the low-
this information with some transformation capabili-            code developer experience, i.e. without requiring spe-
ties which include filtering, sorting, grouping; (5) the       cific NoSQL knowledge for a developer to success-
platform handles all the requirements for providing            fully write applications that leverage this type of data
this information with enterprise-grade non-functional          stores.
requirements (NFRs) such as security, performance,                Enabling the seamless integration of a multitude
and scalability; and (6) the developer is able to cre-         NoSQL stores with the OutSystems platform will of-
ate applications that can interact with the Big Data           fer its more than 200 000 developers a considerable
repository and leverage this information in business           competitive advantage over other currently available
logic and processes, and also to create visualizations.        low-code offers. Moreover, Gartner predicts low code
   Delivering the ideal experience for Big Data does           application platforms will be used for 65% of all appli-
however raise significant challenges. First, step (2)          cation development activity in 5 years time[25], am-
is challenged by NoSQL systems not having a stan-              plifying the impact of our contribution.
dardized data model, a standard method to query                   In this paper we summarize our work on a proof-
metadata, or even in many cases by not enforcing a             of-concept polyglot data access layer for the OutSys-
schema at all. In fact, even if stored data conform to         tems Platform that addresses these challenges, thus
a well defined schema, it may be only implicit in the          making the following contributions:
code of applications that manipulate it.
   Second, the value added by NoSQL data stores                  • We describe in detail the challenge in integrat-
rests precisely on a diversity of query operations and             ing NoSQL data stores in a low-code develop-
query composition mechanisms, that exploit specific                ment platform targeted at relational data. This
data models, storage, and indexing structures. Ex-                 is achieved mainly by surveying how data and
posing these as visual abstractions for manipulation               query models in a spectrum of NoSQL data
in step (4) risks polluting the low-code platform with             stores match the abstractions that underlie the
multiple particular and overlapping concepts, instead              low-code approach in OutSystems.
of general purpose abstractions. On the other hand,              • We propose to use a polyglot query engine, based
if we expose the minimal common factor between all                 on extended relational data and query models,
NoSQL data stores, we are likely to end up with                    with embedded NoSQL query script fragments
minimal filtering capabilities that prevent developers             as the approach that reconciles the expecta-
from fully exploiting NoSQL integration. In either                 tion of low-code integration with the reality of
case, some NoSQL data stores offer only very mini-                 NoSQL diversity.
mal query processing capabilities and thus force client
applications to code all other data manipulation op-             • We describe a proof-of-concept implementation
erations, which also conflicts with the low-code ap-                that leverages an off-the-shelf SQL query engine
proach. Finally, step (5) requires ensuring that per-               that implements the SQL/MED standard [10] for
formance is compatible with interactive applications                managing external data.
means that one cannot resort to built-in MapReduce             As a result, we describe various lessons learned, that
to cope with missing query functionality, as it leads          are relevant to the integration of NoSQL data stores
to high latency and resource usage. Also, coping with          with low-code tools in general, to how NoSQL data
large scale data sets means full data traversals should        stores can evolve to make this integration easier and
be avoided. This can be done by exposing relevant              more effective, and to research and development in
indexing mechanisms and resorting to approximate               polyglot query processing systems in general.
and incomplete data, for instance, when displaying a             The rest of the paper is structured as follows. In
developer preview during step (3).                             Section 2 we briefly describe the OutSystems plat-
   Our goal is to add the ability to query data in             form, how it handles data access, and integrates with

                                                           2
external SQL databases. Section 3 presents the main
results of an analysis of target data stores focusing
on how their characteristics impact our goal. Sec-
tion 5 describes our proposal to integrate NoSQL
data stores in the OutSystems platform, including
our current proof-of-concept implementation. This
proposal is then evaluated in Section 6 and compared
to related work in Section 7. Section 8 concludes the
paper by discussing the main lessons learned.

2     Background
The OutSystems Platform is used to develop, deploy
and operate a large number of custom applications                                                                  .
through a model driven approach and based on a low
                                                                Figure 1: OutSystems Platform Architecture
code visual language [16]. A key trait of the approach
is that the most common tasks executed in the cre-
ation of an Information System are visually modeled.         be ASP.Net and SQL code). The compiled applica-
This is a major contribution for providing accelera-         tion is then deployed to the Application Server.
tion to customers. Visual models abstract a lot of              Application Server. Runs on top of IIS. The
the complexity of low level coding, are easier to read       server stores and runs the developed applications
and therefore reduce the time needed for knowledge
                                                                Service Center. Provides a web interface for
transfer. Having a common modeling language favors
                                                             sysadmins, managers and operations teams. Connec-
skill reuse either when implementing back end logic,
                                                             tions to external databases are configured here.
business processes, web UIs, or mobile UIs. A second
                                                                Integration Studio. A desktop environment,
key trait is that customers should never hit a wall,
                                                             targeted at technical developers, used to integrate ex-
in the sense that the not so common tasks can also
                                                             ternal libraries, services, and databases.
be achieved, while sometimes not as easily, but still
using visual modelling and/or extensibility to 3GL
languages.                                                   2.2    Data Access
                                                             One of the most common operations in data-driven
2.1    Architecture                                          applications is to fetch and display data from the
                                                             database. The OutSystems Platform models the data
The OutSystems solution architecture [13] features           access through the concept of Entities, elements that
five main components, as shown in Figure 1:                  enable information to be persisted in the database
   Service Studio. The development environment               and the implementation of a database model, follow-
for all the DSLs supported by OutSystems. When               ing the relational model to represent data and its re-
the developer publishes an application, Service Stu-         lationships. A visual editor that allows development
dio saves a document with the application’s model            teams to query and aggregate data visually is also
and sends it to the Platform Server.                         provided, so that developers with any skill set can
   Platform Server. Takes care of all the steps re-          work with the complex data needed for any applica-
quired to generate, build, package, and deploy na-           tion.
tive C# web applications on top of a particular stack           Using this platform, developers can create inte-
(e.g., for a Windows Server using SQL Server this will       grations of local and external data sources without

                                                         3
having to write custom code, significantly reducing            ally offering complex declarative languages, that do
time and effort, and eliminating errors. OutSystems            not extend SQL but are tailored to their data mod-
integrates natively with several of the major rela-            els, as is the case with Neo4j. In all of these, the
tional database systems: SQL Server, SQL Azure,                increasing complexity and expressiveness of the lan-
Oracle, MySQL, and DB2 iSeries. This allows the                guage means that these have actually evolved to have
development of applications that access data in ex-            query engines that, to some extent, perform query
ternal databases using the OutSystems entity model             optimization and execution.
without having to worry about data migration. Inte-               The selection of data stores to be analysed in the
gration with external databases involves: (1) In the           context of this work was based on two criteria. First,
Service Center, defining a connection to the external          variety, to include the most common NoSQL mod-
database; (2) In the Integration Studio, mapping ta-           els, such as document-based, wide-column, pure key-
bles or views in the external database to entities in          value and graph-based. Second, utility, targeting the
an extension module; (3) In the Service Studio, ref-           most popular data stores, in each class, considering
erencing and using the extension in an application.            the client base of the OutSystems platform.
                                                                  The minimum requirements for NoSQL data stores
                                                               to be compatible with the OutSystems platform are:
3    Data stores                                               (1) the ability to expose the database schema, to
                                                               enable the platform user to select the appropriate
NoSQL data stores, in general, forgo the require-              databases, tables, attributes or equivalent constructs,
ment for a strict structure, sacrificing query process-        explored in Section 4; (2) a compatible model (and
ing capabilities for flexibility and efficiency. While         syntax) for expressing queries considering filtering
SQL became an ANSI standard, allowing queries                  operators such as selection and projection, ordering
to be executed in compatible RDBMS with few id-                and grouping operators such as sort by and group by,
iomatic changes, NoSQL data stores offer a variety             and aggregation operators for counting, summation
of data models, also supporting different data types           and average; a compatible structure for consuming
and query capabilities. Additionally, most support             query results, currently restricted to scalar values or
heterogeneity at each hierarchical level, allowing, for        flat structures, i.e., no nested data.
example, rows of a given table to have different at-              Additionally, it would be desirable to: enable visual
tributes. Values can be any type of object: For doc-           construction of queries based on online partial results;
ument stores, values can be semi-structured JSON               have the ability to take advantage of NoSQL-specific
documents, for example; for wide-column data stores,           features; and provide client-side query simplification.
values can be variable sets of columns. This flexibility          While the platform also supports a hierarchical
harms predictability, as additional knowledge (meta-           model currently geared towards consuming data from
data) of what is being stored is required, as well as          REST endpoints, data manipulation using the rela-
the ability to handle similar data items with irregular        tional model is better supported, as the current query
structures.                                                    construction interface uses a tabular representation
   NoSQL data stores have initially provided only              for data. Also, it is a more natural fit for the guiding
minimal query capabilities. In fact, pure key-value            principles of a low-code platform as it is more likely to
stores offer only minimal get, put, and delete oper-           be familiar to developers. Focusing on this route for
ations. However, query capabilities in NoSQL data              integration, for each examined data store, we eval-
stores have been enriched over time: by embedding              uate: (1) the available query operations, including
different subsets and extensions of SQL, as is the case        traversal of data, filtering, and computing different
with Cassandra, Couchbase, Hive and Cloudera; by               aggregations (Section 3.1); (2) how such operations
evolving increasingly complex native query languages           can be combined (Section 3.2); and (3) how to map
out of initially simplistic query mechanisms, as is the        such operators and compositions to the OutSystems
case with MongoDB and Elasticsearch; and by actu-              model (Section 3.3).

                                                           4
HyperLogLogs, a probabilistic data structure used to
                                                               count unique items. Selection (filtering rows/records)
3.1    Query Operators                                         and projection (filtering columns/attributes) opera-
                                                               tors are generally supported in NoSQL data stores,
The fitness of the relational model for interfacing with       even if projections require defining paths on possibly
each NoSQL data store rests on: (1) the ability to ex-         nested structures.
press a given operation in the data store’s model and             The commands/keywords to make use of a given
query language, even if some translation is required           feature in the particular data store are presented, in
and (2) on results being either scalar values or con-          Figure 4, as a command where available, or a short
vertible to a tabular form.                                    explanation of what is available, where convenient.
   In order to use a relational notation with both                MongoDB is a JSON-centric database system, us-
document-based and row-based models supporting                 ing it both as its data model and as its query lan-
nested structures (including collections) these must           guage. It has thus been highly successful for Web
be flattened. One way to do this is to promote the             applications where used together with JavaScript.
fields of the nested structure to outermost attributes         Objects are grouped in collections, that are stored
and unnest lists, creating a separate row for each el-         in databases. Each object is identified in the con-
ement of the list.                                             text of a collection by the _id field for direct access.
   As an example, consider the MongoDB document                This field can be provided by the application with
shown in Figure 2. This document can be mod-                   any relevant content, or is otherwise automatically
elled as a relation, where the name of each attribute          generated with a unique ObjectId. MongoDB can
is prefixed with a path to reach it in the original            store relational data by introducing reference fields
model. The result from converting the document to              that hold the value in the_id of the referenced docu-
this model is depicted in Figure 3. Data from other            ment. Nonetheless, MongoDB encourages denormal-
documents in the same collection could be added as             ization and storing related entities as a hierarchical
rows. If the schema is partial (or probabilistic), some        structure in a single document. The native API for
rows will likely be missing some attributes. Natu-             MongoDB is a JavaScript library, where queries are
rally, this step requires the schema to be retrievable,        expressed as JSON documents. Similar APIs exist
whether explicit or inferred, a concern addressed in           however for a variety of programming languages and
Section 4.                                                     platforms. First, this interface allows the application
   This method can be extended to graph databases,             to insert, retrieve, and delete documents directly ref-
for acyclic schemas. Using Neo4j as an example, con-           erenced by their _id. Second, it allows searching for
sidering the information that can be retrieved regard-         documents providing a document for matching. At
ing node and edge properties, a relational schema              its simplest, this includes key-value pairs that need to
could be exposed with a relation per node type, with           exist in the target documents. It allows also several
properties as attributes and a relation per edge type          relational and boolean operators to be specified with
(relationship type in Neo4j). However, taking full             keywords prefixed by the $ symbol, allowing com-
advantage of the capabilities of the graph model may           plex expressions to be composed. In update opera-
not be as straightforward, such as, for example, using         tions, the new value can also be computed by using a
the length of the path in a query.                             set of functions accessed also with $ keywords. The
   Consider, instead, a key-value data model with col-         db.collection.find() method retrieves all docu-
lections, such as Redis. While extracting a high-level         ments that match a set of criteria defined over the
schema for the stored data from Redis is not gen-              fields of the document. While, by default, all fields
erally feasible, it can be done on a per key basis,            of matching documents are retrieved, the set of fields
first by listing the keys and querying the type of the         to be projected can be defined in a projection doc-
associated value. Each value type can be straight-             ument. Projection of elements in nested arrays can
forwardly converted to a tabular format, except for            also be controlled. Couchbase is, like MongoDB, a

                                                           5
{
 _id:                 (hidden)
 id: "store::1",
 location: "Braga",
 sells: [
  { widget: { id: "Widget1", color: "red" }, qty: 5 },
  { widget: { id: "Widget2", color: "blue" }, qty: 2 },
 ]
}

                              Figure 2: Example MongoDB document for a widget store.

   id       location   sells.widget.id   sells.widget.color   sells.widget.qty       Redis is used primarily for caching, with optional per-
 store::1   Braga         Widget1               red                  5               sistence. Another usage pattern is to use Redis for
                                                                                     write-intensive data items, using a backing data store
 store::1   Braga         Widget2              blue                  2
                                                                                     for other data.
 store::2   Braga         Widget1              blue                  2
                                                                                        Apache Cassandra is a wide-column store inspired
                                                                                     by BigTable. Cassandra, however, supports super
Figure 3: Representation of a MongoDB document                                       column families as an additional aggregate. It sup-
in a relational model.                                                               ports CQL, a SQL-like query language. Queries that
                                                                                     use relation traversal, i.e. the equivalent to JOINs in
                                                                                     the relational model, require either custom indices or
document store that uses JSON as its data model.                                     secondary indices on would-be JOIN attributes to be
It provides a simple key-value interface and N1QL,                                   available. Index creation is asynchronous.
a rich SQL-like query language. The outermost data                                      Apache HBase is also a wide-column store, mod-
organization units are buckets, which group key-value                                elled on Google BigTable and implemented on the
pairs (items). Keys are immutable unique identi-                                     Hadoop stack. HBase does not constrain data stored
fiers, mapped to either binary or JSON values. Like                                  in keys, qualifiers (attributes), and values. Data
MongoDB, relations between documents can be cre-                                     items have therefore to be serialized to byte arrays
ated either by referencing other documents by key                                    before storage and deserialized back upon retrieval.
or through embedding, creating a hierarchical struc-                                 HBase natively offers a simple Java client API and
ture. Selection and projection are similar to SQL’s                                  no query language. Selection is implemented by ei-
but dealing with nested structures such as arrays, re-                               ther retrieving a single or multiple rows, given the
quires explicitly unnesting these, using the UNNEST                                  keys and an optional upper bound on the timestamp,
clause.                                                                              or by scanning a range of rows, given (optional) start
   Redis is an in-memory key-value store using strings                               and stop keys, optionally providing a filter expression
as keys and where values can be strings or data struc-                               comparing a value to a provided constant. Projec-
tures such as lists, sorted or unsorted sets, hash                                   tion is supported through the definition of a subset
maps, bit arrays and HyperLogLogs, probabilistic                                     of qualifiers to be retrieved.
data structures used to count unique items, trading                                     Amazon DynamoDB is a large scale distributed
in precision for space savings. Querying over values                                 data store offered as part of Amazon Web Services.
is not supported, only queries by key. The alterna-                                  Data is stored in tables, where rows are identified by
tive is costly: a full scan, returning key-value pairs                               simple or composite primary keys. If using a simple
that match a specified condition. In terms of projec-                                primary key, it serves as the partition key. Composite
tion, either a single attribute is retrieved or all are.                             primary keys are limited to two attributes, where the

                                                                                 6
first attribute is also used as a partition key and the       Cypher, a declarative graph query and manipulation
second is used for sorting the rows. Selection and pro-       language. It natively supports graph concepts such as
jection capabilities are similar to those described for       paths and matching on graph structure. Along with
HBase. Answering queries that use relation traversal          SQL-like selection and projection capabilities on the
efficiently requires creating secondary indexes. How-         data attached to nodes and edges, attributes such as
ever, the number of indices that can be created is            the length of a path can also be specified.
limited. The costly alternative is to use table scans
and combine data client-side.                                   Apache Hive and Cloudera Impala are analytic
   Elasticsearch is a document store focusing on text         query engines for a SQL-like query language using,
indexing and retrieval, primarily developed as a dis-         preferably, data stored in HDFS files. Hive is built
tributed version of the well known Lucene text in-            on top of Hadoop and translates queries to MapRe-
dexing library. It uses the well known JSON format            duce jobs. Impala translates queries to a traditional
as its base data model and groups JSON objects in             Volcano-style parallel execution plan. They share the
collections called “indexes". Each object in an index         same client API, most of the language, and some of
has a unique identifier used for direct access. It is         the infrastructure, namely, for storage and metadata.
expected that each of these indexes contains a fairly         Data can be organized in rich hierarchical structures,
homogeneous set of objects, with mostly the same at-          that are however strongly typed. Nevertheless, pro-
tributes and being used for the same purpose. Elas-           cessing is done in terms of flat rows.
ticsearch allows a very limited form of relational data
by introducing document parent-child relationships.             Figure 4 also presents the output format for query
This allows splitting very large documents that con-          results. While some of the analysed data stores
tain many sub-elements and would be inefficiently             present results in document or row-based data models
managed, while maintaining the ability to traverse            with nested structures, these can be easily converted
them together in some operations. This possibility            to a tabular format using the already discussed tech-
has however many restrictions and the documenta-              niques, fulfilling requirement (2).
tion emphatically discourages its generalized use. Se-
lection and projection are supported either in the na-           Regarding sorting, grouping and aggregation op-
tive API or an experimental SQL-like API.                     erators, Figure 5 shows the commands/keywords to
   Neo4j is a graph data management system based              make use of the given feature in the particular data
on the Java platform. The graph data model and op-            store, presented as a command where available, or
erations make it a very different alternative to most         a short explanation of known limitations where con-
other well known NoSQL systems. Information is                venient. Redis only sorts elements in a data struc-
represented as a directed graph defined as a set of           ture (list, set, sorted set) stored for a specific key.
named nodes and a set of edges, each connecting a             Again, support for SQL-like commands in Elastic-
source node to a target node. Additional informa-             search is experimental. Cassandra supports sorting
tion can then be attached to nodes and edges in one           only on one of a table’s attributes, the first cluster-
of two ways: as a single tag, that is a symbolic iden-        ing column, while grouping can only be done using
tifier with particular significance as metadata and for       the partition key. DynamoDB has similar restric-
indexing; or as a set of named properties, that can           tions on ordering. Aggregates require creating in-
have various types and hold arbitrary data. Values            dexes/materialized views, precluding ad-hoc queries.
of types attached to properties, that can be manipu-          Results in HBase are sorted in a fixed order, using the
lated in queries, can be both primitive types such as         row key for the outermost comparison. Aggregates
numbers or strings, but also composite values such as         can be implemented in HBase using coprocessors.
lists and maps. This means that the data attached to          Remaining data stores implement sorting, grouping
nodes and edges can be nested in structures as typical        and aggregation operators that are compatible with
of JSON documents. The main interface to Neo4j is             SQL’s, indicated in the column header.

                                                          7
3.2     Query Construction
                                                                                                   Elementary operators such as projection and selec-
                     Selection              Projection           Output                            tion can be combined into complete query expres-
 MongoDB        db.collection.find()             projection document                    JSON       sions. On one extreme, data stores with limited query
                                                    $, $elemMatch
                                                                                                   capabilities allow only simple fixed combinations of
 Couchbase                WHERE​ clause                  SELECT                         JSON
                                                                                                   supported operators. For instance, HBase allows pro-
 Redis                      by key or
                      ​SCAN ​with​ MATCH
                                                 single attribute or all             scalar or
                                                                                 flattened array
                                                                                                   jection and selection to be optionally specified for
 Cassandra         WHERE​ clause (partial)               SELECT                        tabular
                                                                                                   each query. On the other extreme, data stores allow
                                                                                                   arbitrary combinations of operators to be submitted
 HBase                      by key ​or               addFamily()                  flattened map
                     ​SCAN ​with​ filter             addColumn()                                   and often perform an optimization step, reordering
 DynamoDB           by key ​or​ ​SCAN ​with    ProjectionExpression                     JSON       and transforming the proposed expression for efficient
                    FilterExpression
                            Selection                  Projection                      Output      execution. This section focuses on the latter, namely
 ElasticSearch
 MongoDB
                          match​ clause
                 db.collection.find()
                              WHERE
                                                        _source
                                                 projection document
                                                         SELECT
                                                                                        JSON
                                                                                        JSON
                                                                                       tabular     on the MapReduce, pipeline and tree query construc-
                                                   $, $elemMatch
 Neo4j                  MATCH … WHERE                  RETURNING               tabular or JSON
                                                                                                   tion paradigms.
 Couchbase                WHERE​ clause                  SELECT                         JSON
 Hive/Cloudera            WHERE​ clause                  SELECT                        tabular        MapReduce is a particular query construction ap-
 Redis                      by key or            single attribute or all             scalar or
                       ​SCAN ​with​ MATCH                                        flattened array   proach as it always assumes an intermediate grouping
 Cassandra
Figure             WHERE​ clause (partial)
              4: Sorting
                      Selection
                             (​ORDER BY​) and    Projection
                                            Grouping
                                                         SELECT
                                                     (​GROUP BY​)         operators
                                                                         Aggr.
                                                                                      tabular
                                                                               (​COUNT, SUM,in
                                                                                                   operation. Briefly, it works as follows: The first step,
NoSQL data​SCAN
 HBase
                         stores.
                            by key ​or               addFamily()
                                                                                    …​)
                                                                                 flattened map     Map, allows an arbitrary function to translate each
                              ​with​ filter
 MongoDB                  $orderby
                                                     addColumn()
                                                  $group                            yes            data item in an input collection to zero or more out-
                    by key ​or​ ​SCAN ​with
 DynamoDB
 Couchbase          FilterExpression
                          ORDER BY
                                               ProjectionExpression
                                                 GROUP BY                           yes
                                                                                        JSON
                                                                                                   put keys and values, usually, to filter and transform
 ElasticSearch            match​ clause                 _source                         JSON
                                                                                                   data. In a second step, data items are grouped by
 Redis          SORT  ​(partial)                no                              no
                              WHERE                      SELECT                       tabular      key and Reduced, allowing an arbitrary function to
 Cassandra
 Neo4j            ORDER MATCHBY …​(partial)
                                     WHERE              ​(partial)
                                             GROUP BYRETURNING                      yes or JSON
                                                                               tabular             process the set of values associated with each key,
 Hive/Cloudera
 HBase                   not      clause
                              ad-hoc
                          WHERE​                not ad-hoc
                                                         SELECT                       tabular
                                                                             not ad-hoc            usually to aggregate them. Often, multiple map and
 DynamoDB                partial/fixed          not ad-hoc                   not ad-hoc            reduce stages can be used, thus allowing more com-
 Elasticsearch
                  Sorting (​ORDER BY​)
                  sort/terms.order
                                            Grouping (​GROUP BY​)
                                                   terms
                                                                         Aggr. (​COUNT, SUM,
                                                                             sort.mode
                                                                                                   plex queries that group by different criteria in differ-
                                                                                    …​)
                          ORDER BY               GROUP BY                           yes            ent stages. MapReduce jobs are mostly used to asyn-
 MongoDB
 Neo4j
                          $orderby
                          ORDER BY
                                                  $group
                                                  RETURN
                                                                                    yes
                                                                                    yes
                                                                                                   chronously perform table scans and build secondary
 Couchbase                ORDER BY               GROUP BY                           yes            indexes, that enable these stores to offer querying ca-
 Hive/Cloudera                                                                      yes
 Redis
                          ORDER BY
                 SORT  ​(partial)
                                                 GROUP BY
                                                     no                             no
                                                                                                   pabilities that are expected of the relational model:
                                                                                                   materialised views, aggregations, etc. MapReduce
 Cassandra        ORDER BY ​(partial)        GROUP BY ​(partial)                    yes
                                                                                                   can be used to execute relational equi-JOINs by using
 HBase                   not ad-hoc             not ad-hoc                   not ad-hoc
                                                                                                   the join key for grouping, as generating all possible
 DynamoDB                partial/fixed          not ad-hoc                   not ad-hoc            matches with a reducer. In this perspective, con-
 Elasticsearch    sort/terms.order                 terms                     sort.mode             suming MapReduce results can be done within the
                          ORDER BY               GROUP BY                           yes            relational model, as long as these come in (or are
 Neo4j                    ORDER BY                RETURN                            yes            easily translatable to) tabular form, possibly with an
 Hive/Cloudera            ORDER BY               GROUP BY                           yes
                                                                                                   inferred schema. First, MapReduce focuses on ana-
                                                                                                 . lytical workloads, that read most of or all input data,
                                                                                                   resulting in poor interactive performance. Moreover,
Figure 5: Sorting, Grouping and Aggregation opera- limiting the specification of MapReduce tasks to SQL
tors in NoSQL data stores.                                                                         may be undesirable. An argument can be made for
                                                                                                   considering that the specification of MapReduce jobs
                                                                                                   requires sufficient expertise to fall outside of the typ-
                                                                                                   ical utilization of the platform, making it available

                                                                             8
through a low-level extension.                                    Figure 7 summarizes the supported query con-
   A more general query construction approach is to            struction paradigms per data store, presenting either
allow arbitrary operators to be chained in a linear            the method or concept that provides the capability.
pipeline. The first operator reads data, often with               Redis supports only a key-value interface, there-
an implicit selection and projection. Each operator            fore, having no query construction capabilities. Ama-
does some transformation, such as grouping and ag-             zon’s DynamoDB supports only simple queries by pri-
gregation, and sends data to the next. An example of           mary key (or a range thereof) with optional filtering
this approach is provided by MongoDB’s aggregation             applied just before the data is returned.
pipeline. Figure 6 demonstrates how this approach
can be used over a denormalized schema, in which all
                                                               3.3    Discussion
data for a store is kept in a single document:
   While this is not the most natural way to express           The main conclusion of the analysis is that the rela-
this query in MongoDB, it is however a simple ex-              tional model is to a large extent able to map enough of
ample of an aggregation pipeline. The query first              NoSQL data stores to fit the OutSystems’ low code
filters stores located in “Braga" and projects only the        platform. This means that the current aggregation
required fields. Then it unwinds the “sells" array,            model, designed primarily for relational data, can be
producing a document for each item sold. It then               adequately used to describe most query operators and
projects the required fields and filters widgets with          construction schemas found in NoSQL data stores.
color “red". This approach has several advantages.             This is a consequence of two aspects. First, of the
First, it can express a large subset of all queries pos-       completeness of the relational model, that can map
sible with SQL. In fact, the linear chain of operators         most operators and query construction approaches
precludes only JOIN operations where more than one             found in NoSQL data stores, requiring only an ex-
data source would be required, and some sub-queries,           tension to support nested data. Second, of the trend
i.e., mainly those that are equivalent to JOIN oper-           for NoSQL data stores to increasingly provide SQL
ations. Second, the pipeline transformation of data            support, in particular, by using unwind/unnest oper-
is intuitive, as the same paradigm is used in various          ations to deal with denormalized and nested data.
forms in computer interfaces. Finally, data stores                An exception to this is the ability to directly de-
supporting this paradigm also tend to include query            scribe a MapReduce operation (MongoDB, Couch-
optimizers, that reorder operators to produce an effi-         base, HBase). The MapReduce paradigm fits in this
cient execution plan. This means that a query can be           vision as a background workhorse to pre-process data,
incrementally built without being overly concerned             e.g. creating secondary indexes or materialized views,
with its efficiency.                                           to endow NoSQL data stores with sorting, grouping
   The most general approach is to allow arbitrary             or aggregation operations required for approximating
data query expressions that translate to operator              SQL support. Some data stores provide this func-
trees. This allows expressing relational JOIN op-              tionality automatically, which others require user in-
erations between data resulting from any two sub-              tervention. In any case, as long as the results of a
queries. It also allows a general use of sub-queries           MapReduce job can be converted to a nested tabular
whose results are used as inputs to other operators.           format, these can be consumed in the context of the
The main advantage of this approach is that it is what         relational model. When not supported by the data
has traditionally been done with SQL databases.                store, some SQL features can be implemented with
Ironically, NoSQL data stores therefore increasingly           client-side processing or a middleware solution. For
provide SQL-like dialects and query optimizers. This           example, projection can be fully supported in Redis
allows existing low-code query builders designed for           with client-side filtering before returning data to the
SQL databases to be reused with minor modifica-                platform user. A second exception is needed to take
tions, mainly, to what functions and operators are             advantage of specific NoSQL features. For example,
available.                                                     to take advantage of features such as the added query-

                                                           9
{ aggregate: "stores", pipeline: [
        { $match: { location: { $eq: "Braga" } } },
        { $project: { id: 1, location: 1, sells: 1 } },
        { $project: { expr000: "$sells", expr001: "$id" } },
        { $unwind: "$expr000" },
        { $project: { sid: "$expr001", ITEM: "$expr000.widget.color" } },
        { $match: { ITEM: { $eq: "red" } } },
        { $project: { sid: 1 } }
    ], cursor: { batchSize: 4095
    }, allowDiskUse: true, $readPreference: { mode: "secondaryPreferred"
    }, $db: "storesdb"
}

                                   Figure 6: Pipeline query construction.

ing features the graph model brings Neo4j, more ex-        is a formal description that exposes the underlying
pertise than intended for using a low-code platform        structure of data.
may be required. Still, this can be supported by en-          Currently, for integration with a data source, the
abling advanced users to edit the query itself. On a       platform must be able to list available databases, list
different note, Redis exposes its data as simple data      tables per database and to retrieve a schema for the
structures. It might make sense to allow the user to       data. Providing a schema to the platform is key to
compose several of these structures into a material-       the low-code approach for data integration, as pre-
ized view, providing, in fact, an external schema for      senting the schema to the user, along with data sam-
the stored data.                                           ples, is required for queries to be defined visually.
   To summarize the expected development effort:              For relational databases, the schema consists of re-
                                                           lations or entities (typically represented as tables),
 1. There is no need to modify (any of these) NoSQL        their attributes (including data types) and, poten-
    data stores for integration with the OutSystems        tially, the cardinality associated to the referential
    platform.                                              relationships between them. For NoSQL databases
                                                           based on a document model, using MongoDB as an
 2. Client-side or in-middleware computation should
                                                           example, the schema for a database should include
    be limited to: (a) when needed, converting re-
                                                           existing collections, document fields for each collec-
    sults to a tabular format (b) when needed, filling
                                                           tion, including data types and, potentially, referential
    in for operators that are either missing from the
                                                           relationships. However, while relational schemas are
    data store or cannot be used ad-hoc; and (c) con-
                                                           flat, strict and deterministic, the flexibility of adding
    vert SQL to native query languages or SQL-like
                                                           documents with different fields to the same collection,
    dialects.
                                                           having fields with the same name holding values of
 3. Add support for nesting and unnesting opera-           different types and the ability to define nested struc-
    tions to the OutSystems platform (see Figure 4).       tures, make it harder to elicit a complete and accurate
                                                           schema of the data.

4    Schema Discovery                                      4.1    Third-party tools
Offering a uniform view of unstructured data requires Figure 8 summarizes the schema-related metadata
data types to be well defined and the definition of a that can be retrieved or inferred from each of the
schema to which the data must conform. A schema analysed data stores, using currently available tools.

                                                      10
Cells marked N/A indicate there is no comparable
                                                                                                        concept in the store’s data model. Partial schema
                                                                                                        inference refers to cases where nested structures are
                                                                                                        opaque.
                        MapReduce                      Pipeline                     Tree                   A wide range of NoSQL data stores explicitly store
 MongoDB         db.collection.mapReduce()    db.collection.aggregate()               no                and provide metadata, a confirmation of the JSON
 Couchbase                "views"                          no                       N1QL
                                                                                                        data model, that allows nested data collections, as a
                                                                                                        common trend for data structuring and representa-
 Redis                       no                            no                         no
                                                                                                        tion. There are however minor differences in what
 Cassandra                   no                            no                       CQL                 data types can be attached to values and even how
 HBase           co-processors(no grouping)                no                         no                much nesting is allowed.
 DynamoDB                    no                   execute, then filter                no                   Second, NoSQL data stores vary widely in which
                                                                                                        metadata can be obtained. On one end, systems
 Elasticsearch               no                     “aggregations”                    no
                                                                                                        such as Cassandra, Hive, and Cloudera Impala store
                        MapReduce                       Pipeline                    TreeQL
 Neo4j                     no                             no                     Cypher
                                                                                                        and enforce a user-defined schema that unambigu-
MongoDB
Hive/Cloudera    db.collection.mapReduce()
                              no              db.collection.aggregate()
                                                          no                          no
                                                                                     SQL                ously describes data. On the other end, systems
                                                                                                   .
Couchbase                 "views"                            no                     N1QL                such as HBase do not provide any support for setting
Figure
 Redis     7: Query noconstruction paradigms no             in NoSQL
                                                                   no                                   or enforcing a schema. As they store just arbitrary
data   stores.
 Cassandra      list databases
                       no      list tables/collections
                                             no        retrievable schema
                                                                  CQL
                                                                                                        byte sequences, the schema cannot also be easily in-
MongoDB                     yes                        yes               inferred, probabilistic
                                                                                                        ferred. They are however frequently used with third-
HBase            co-processors(no grouping)                  no                       no
Couchbase                   yes                       yes              inferred, probabilistic
                                                                                                        party query mechanisms, that provide the missing
DynamoDB
Redis
                             no
                            yes
                                                  execute, then filter
                                                   yes (keys)
                                                                                    no
                                                                           generally N/A
                                                                                                        metadata. HBase, for example, is often used in the
Elasticsearch                no                     “aggregations”                     no               Hadoop stack with Hive. In between, systems such as
DynamoDB                    N/A                       yes                   explicit, partial
Neo4j                        no
                                                                                                        MongoDB, Couchbase, Elasticsearch and Neo4j allow
Cassandra                   yes                        yes no                    Cypher QL
                                                                                explicit
                                                                                                        the schema to be inferred from currently stored data
HBase
Hive/Cloudera               N/A
                             no                        yes no                      noSQL
                                                                                                        and even provide optional mechanisms to enforce it.
 Hive/Cloudera              yes                        yes                      explicit
                                                                                                        For example, using MongoDB it is possible to define
 ElasticSearch              N/A                  yes (indexes)              inferred, partial
                                                                                                        a set of rules over the attributes of documents be-
 Neo4J                       N/A                 yes (nodes/rels.)          inferred, partial
                       list databases         list tables/collections     retrievable schema            longing to a given collection. Validation is defined
MongoDB                     yes                        yes               inferred, probabilistic        at the collection level and conditions can require a
Couchbase                   yes                        yes               inferred, probabilistic        document to contain a given set of attributes, with
Redis                       yes                    yes (keys)                generally N/A              a given type, set attribute-level boundary conditions,
DynamoDB                    N/A                        yes                  explicit, partial           or require values to match a given regular expression.
Cassandra                   yes                        yes                       explicit               In effect, introducing validation limits the variability
HBase                       N/A                        yes                         no                   of the structure of JSON documents.
Hive/Cloudera               yes                        yes                       explicit                  Mechanisms used to infer schema from MongoDB
ElasticSearch               N/A                  yes (indexes)              inferred, partial           and Couchbase can, in principle, be applied to other
Neo4J                       N/A                 yes (nodes/rels.)           inferred, partial
                                                                                                        schema-less data stores, by adapting existing tools or
                                                                                                   .    by implementing similar ones. In short, tools such
                                                                                                        as mongodb-schema1 analyze a sample of documents
Figure 8: Retrievable schema-related metadata from
                                                                                                        stored in a given collection and provide, as outcome,
each NoSQL data store.
                                                                                                        a probabilistic schema. Fields are annotated with a
                                                                                                        probability according to how frequently these occur
                                                                                                        in that collection’s sampled documents. Fields are
                                                                                                          1 https://github.com/mongodb-js/mongodb-schema

                                                                                                   11
also annotated with a set of data types: each data          port for nested data structures, integrating schema
type is itself annotated with a probability according       inference/extraction techniques, as proposed and de-
the mapping’s occurrence in the sampled documents.          scribed in Section 5.1.
There is also mongodrdl2 , a tool for inferring a rela-
tional schema from a MongoDB database or collec-            4.2    Improving current tools
tion, which, however, in our experiments, fell short of
accurately representing some relationships between          Here, we briefly consider research results, for which
unnested array elements and top-most attributes.            runnable systems are not generally available, but
   A similar concern holds for using data sources with      which nonetheless can contribute relevant ideas and
polyglot query engines, as these enable expressing          techniques. Generic schema inference/discovery fo-
data processing operations over systems that expose         cuses on discovering, for a set of semi-structured
multiple native data models and query languages.            data, how these can be generically represented as
   Dremio OSS performs schema inference, but treats         sets of attributes (entities) of a given type (or set
nested structures as opaque and, therefore, does not        thereof), optionally identifying relationships between
completely support low-code construction of unnest-         entities (e.g., references) and constraints (e.g. that a
ing operations, in the sense that the user still needs      given attribute is required to be non-null). Proposals
to explicitly handle these. Still, it provides the abil-    are typically motivated by the necessity of designing
ity to impose a table schema ad-hoc or flexibly adapt       client applications that can take advantage of a semi-
data types which is a desirable feature for overriding      structured data source for which a schema is unknown
incorrect schema inference.                                 with earlier work focused mainly on XML data. One
   With PostgreSQL FDW, it is possible to declare           approach, is to involve the user in schema discovery
tables for which query and manipulation operations          by exposing generated relational views of the JSON
are delegated on adapters. The wrapper interface in-        data to users so that these can help clusters records
cludes the ability to either impose or import a schema      as collections and mark or validate relationships be-
for the foreign tables. Imposing a schema requires the      tween entities [22]. While the goal of this partic-
user to declare data types and structure and it is up       ular work is to ultimately provide a flat relational
to the wrapper to make it fit by using automatic type       schema of the JSON data, concerns such as minimis-
conversions as possible. If this automatic process is       ing the number of collections in the final schema and
not successful the user will need to change the spec-       introducing relationships between entities might have
ified data type to provide a closer type match. The         a significant impact in the effectiveness of querying
wrapper can also (optionally) advertise the possibil-       JSON data, an aspect that is not assessed by the
ity of importing a schema. In this case, the user sim-      authors. This type of approach seems to be better
ply instructs PostgreSQL to import meta-data from           suited for data exploration than for integrating data
the wrapper and use it for further operations. This         sources with applications in a low code setting.
capability is provided by the wrapper and currently,           It has also been proposed that machine learning
this is only supported for SQL databases, for which         can be used to generate a relational schemas from
the schema can be easily queried. Furthermore, Post-        semi-structured data [6]. However, while the au-
greSQL FDW can export the schema of the created             thors did perform a preliminary assessment of the
foreign tables. Both for Dremio and PostgreSQL,             query performance on the generated schemas, queries
limitations in schema imposition/inference do not im-       were performed on the data loaded onto a relational
pact querying capabilities, only the required talent        database. Results from this assessment do not nec-
to use the system. For PostgreSQL FDW, this can             essarily hold when queries (operations) are pushed
be mitigated by extending adapters to improve sup-          down to a NoSQL store.
                                                               A commonly identified pattern is that the same
   2 https://docs.mongodb.com/bi-connector/current/         conceptual object can be represented by documents
reference/mongodrdl/                                        that differ in a subset of fields, or have fields with

                                                       12
different types, as a consequence of the schema-less         of abstractions that have to be learned by the devel-
nature of semi-structured data. This effect can be           opers to fully use it. Moreover, these abstractions
captured as coalescing these slightly different effective    change with support for additional NoSQL systems
schemas as different versions of the same document           and are not universally applicable. In fact, support
schema. In [18], the authors propose a method based          for different NoSQL systems would be very different,
on model-driven engineering to do just that.                 making it difficult to use the same know-how to de-
   A significantly different approach for schema dis-        velop applications on them all. Finally, building and
covery is to analyse application source code to dis-         maintaining the platform itself would require a lot
cover the schema implicitly imposed by the applica-          of talent and effort in the long term, as support for
tion on schema-less data sources. In [4], the authors        additional systems could not be neatly separated in
propose such a method for applications that use re-          plugins with simple, abstract interfaces.
lational and NoSQL data sources. While currently                On the other hand, we can map all data in different
out-of-scope, it might be interesting to offer this type     NoSQL systems to a relational schema with standard
of capability to ease the migration of applications to       types and allow queries to be expressed in SQL. This
the OutSystems platform.                                     results in a mediator/wrapper architecture that al-
                                                             lows the same queries to be executed over all data
                                                             regardless of its source, even if by the query engine
5    Architecture                                            at the mediator layers.
                                                                This approach also has drawbacks. First, mapping
Considering the conclusions from surveying a vari-           NoSQL data models to a relational schema requires
ety of NoSQL systems, in particular, regarding their         developer intervention to extract the view that is ade-
supported data and query models, we describe the             quate to the queries that are foreseen. This will most
architecture for a polyglot data access layer for a low-     likely require NoSQL-specific talent to write target
code application platform, and then discuss a proof-         queries and conversion scripts. Moreover, query ca-
of-concept implementation based on existing open             pabilities in NoSQL systems will remain largely un-
source components.                                           used, as only simple filters and projections are pushed
   Our proposal is based on two main criteria. First,        down, meaning the bulk of data processing would
how it contributes to the vision of NoSQL data inte-         need to be performed client-side.
gration in the low-code platform outlined in Section 1          Our proposal is a compromise between these two
and how it fits the low-code approach in general.            extreme approaches, that can be summed up as: sup-
Second, the talent and effort needed for developing          port for nested data and its manipulation in the ab-
such integrations and then, later, for each additional       stractions shown to the low-code developer, along
NoSQL system that needs to be supported.                     with the ability to push aggregation operations down
   We can consider two extreme views. On the one             to NoSQL stores from a mediator query engine, will
hand, we can enrich the abstractions that are ex-            account for the vast majority of use cases. In addi-
posed to the developer to encompass the data and             tion, the ability to embed native query fragments in
query processing models. This includes: data types           queries will allow fully using the NoSQL store when
and structures, such as nested tuples, arrays, and           talent is available, without disrupting the overall inte-
maps; query operations, ranging from general pur-            gration. The result is a polyglot query engine, where
pose data manipulation (e.g., flattening a nested            SQL statements are combined with multiple foreign
structure) to domain-specific operations (e.g., regard-      languages for different NoSQL systems.
ing search terms in a text index); and finally, where           The proposed architecture is summarized in Fig-
applicable, query composition (e.g., with MapReduce          ure 9, highlighting the proposed NoSQL data ac-
or a pipeline).                                              cess layer. To the existing OutSystems platform, en-
   This approach has however several drawbacks.              compassing development tools and runtime compo-
First, it pollutes the low-code platform with a variety      nents, we add a new Polyglot connector, using the

                                                        13
times. The Job Scheduler enables periodically re-
                                                           freshing materialized views by re-executing their cor-
                                                           responding queries.

                                                           5.1     Implementation
                                                           We base our proof-of-concept implementation on
                                                           open source components. In this section we start
                                                           by describing how we selected those components and
                                                           then describe the additional development needed to
                                                           make it fit the proposed architecture. We base
                                                           our proof-of-concept implementation on open source
                                                           components.

                                                           5.1.1   Component selection
                                                           The main component to select is the SQL query
                                                           engine used as the mediator. Besides its features
                                                           as a query engine, we focus on: the availability of
                                                           wrappers for different NoSQL systems and the talent
                                                           needed to implement additional features; the com-
          Figure 9: Architecture overview                  patibility of the open source license with commer-
                                                           cial distribution; the maturity of the code-base and
                                                           supporting open source community; and finally, on
                                                           its compatibility with the OutSystems low-code plat-
Database Integration API to connect to the Polyglot
                                                           form. We consider two options.
Query Engine (QE) through standard platform APIs.
                                                              PostgreSQL with FDW[17]. It is an option as
The Polyglot QE acts as a mediator. It exposes an
                                                           it supports foreign data wrappers according to the
extended relational database schema for connected
                                                           SQL/MED standard (ISO/IEC 9075-9:2008). The
NoSQL stores and is able to handle SQL and poly-
                                                           main attractive for PostgreSQL is that it is a very
glot queries.
                                                           mature open source product, with a business friendly
   For each NoSQL Store, there is a Wrapper, com-          license, a long history of deployment in production,
posed of three sub-components: metadata extraction,        and an unparalleled developer and user community.
responsible for determining the structure of data in       There is also support for .NET and Java client appli-
the corresponding store using an appropriate method        cation platforms. In terms of features, PostgreSQL
and mapping it to the extended SQL data model of           provides a robust optimizer and an efficient query
the Polyglot QE; a query push-down component, able         engine, that has recently added parallel execution,
to translate a subset of SQL query expressions, to         with excellent support for SQL standards and mul-
relay native query fragments, or produce a combina-        tiple useful extensions. It supports nested data
tion of both in a store-specific way; and finally, the     structures both with the json/jsonb data types, as
cursor, able to iterate on result data and to translate    well as by natively supporting arrays and composite
and convert it as required to fit the common extended      types. It has extensive support for traversing and
SQL data model.                                            unnesting them. Regarding support for foreign data
   The Polyglot QE makes use of Local storage for the      sources, besides simple filters and projections, the
configuration of NoSQL store adapters and for hold-        PostgreSQL Foreign Data Wrapper (FDW) interface
ing materialized views of data to improve response         can interact with the optimizer to push down joins

                                                      14
and post-join operations such as aggregations. With          but treats nested structures as opaque and, therefore,
PostgreSQL FDW, it is possible to declare tables for         does not completely support low-code construction of
which query and manipulation operations are dele-            unnesting operations, in the sense that the user still
gated on adapters. The wrapper interface includes            needs to explicitly handle these. Still, it provides the
the ability to either impose or import a schema for          ability to impose a table schema ad-hoc or flexibly
the foreign tables. Imposing a schema requires the           adapt data types which is a desirable feature for over-
user to declare data types and structure and it is up        riding incorrect schema inference. Also, Dremio OSS
to the wrapper to make it fit by using automatic type        adds a distributed parallel execution engine, based on
conversions as possible. If this automatic process is        the Arrow columnar format, and a convenient way to
not successful the user will need to change the spec-        manage materialized views (a.k.a., “reflections”), that
ified data type to provide a closer type match. The          are automatically used in queries. Unfortunately, one
wrapper can also (optionally) advertise the possibil-        cannot define or use indexes on theses views, which
ity of importing a schema. In this case, the user sim-       reduces their usefulness in our target application sce-
ply instructs PostgreSQL to import meta-data from            narios.
the wrapper and use it for further operations. This             Although Calcite has a growing user and devel-
capability is provided by the wrapper and currently,         oper community, its maturity is still far behind Post-
this is only supported for SQL databases, for which          greSQL. The variety of adapters for different NoSQL
the schema can be easily queried. Furthermore, Post-         systems is also lagging behind PostgreSQL FDW, al-
greSQL FDW can export the schema of the created              though some are highly developed. For instance, the
foreign tables. In addition to already existing wrap-        MongoDB adapter in Dremio OSS is able to exten-
pers for many NoSQL data sources, with variable fea-         sively translate SQL queries to MongoDB’s aggrega-
tures and maturity, the Multicorn3 framework allows          tion pipeline syntax, thus being able to push down
exposing the Python scripting language to the devel-         much of the computation and reduce data transfer.
oper, to complement SQL and express NoSQL data               The talent and effort needed for exploiting this in ad-
manipulation operations.                                     ditional data wrappers is, however, substantial. Both
   In terms of our goals, PostgreSQL falls short on          for Dremio and PostgreSQL, limitations in schema
automatically using existing materialized views in           imposition/inference do not impact querying capa-
queries. The common workaround is to design queries          bilities, only the required talent to use the system.
based on views and later decide whether to materi-           For PostgreSQL FDW, this can be mitigated by ex-
alize them, which is usable in our scenario. Another         tending adapters to improve support for nested data
issue is that schema inference is currently offered for      structures, integrating schema inference/extraction
relational data sources only. The workaround is for          techniques. Finally, the main drawback of this option
the developer to explicitly provide the foreign table        is that, as we observed in preliminary tests, resource
definition.                                                  usage and response time for simple queries is much
   Calcite[1] (in Dremio OSS[7]). The Calcite                higher than for PostgreSQL.
SQL compiler, featuring an extensible optimizer, is             Choosing PostgreSQL with FDW. In the end,
used in a variety of modern data processing systems.         we found that our focus on interactive operational
We focus on Dremio OSS as its feature list most              applications and the maturity of the PostgreSQL op-
closely matches our goal. Calcite is designed from           tion, outweigh, for now, the potential advantages
scratch for data integration and focuses on the abil-        from Calcite’s extensibility.
ity to use the optimizer itself to translate parts of the       Additional development Completing a proof-of-
query plan to different back end languages and APIs.         concept implementation based on PostgreSQL as a
It also supports nested data types and corresponding         mediator requires additional development in the low-
operators. Dremio OSS performs schema inference,             code platform itself, an external database connector,
                                                             and in the wrappers. As examples, we describe sup-
  3 https://github.com/Kozea/Multicorn                       port for two NoSQL systems. The first is Cassandra,

                                                        15
You can also read