Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Page created by Paula Porter
 
CONTINUE READING
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra
Dell Reference Configuration for
DataStax Enterprise powered by
Apache Cassandra
A Quick Reference Configuration Guide

Kris Applegate – kris_applegate@dell.com
Solution Architect
Dell Solution Centers

Dave Jaffe – dave_jaffe@dell.com
Solution Architect
Dell Solution Centers

Armando Acosta – armando_acosta@dell.com
Big Data Product Manager
Dell Revolutionary Cloud and Big Data Group

Rob Wilbert – robert_wilbert@dell.com
Solution Architect
Dell Solution Centers
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra
Executive Summary
      This document details the configuration set-up for DataStax Enterprise (DSE) software on
      the PowerEdge R-Series servers. The intended audiences for this document are customers
      and solution architects looking for information on configuring DSE clusters within their
      information technology environment for “always on” transaction processing.

      The reference configuration introduces the server set-ups that can run the DataStax
      Enterprise stack. The document will only focus on configuration; it will not go into detail
      about DSE or Apache Cassandra solution software components or resiliency, performance,
      or software considerations. This document does not focus on best practices or complete
      architecture for a DSE Solution. Additional DataStax Enterprise installation, administration,
      and optimization guides are available on the websites referenced below.

      Dell developed this document to help streamline configuration for the DataStax Enterprise
      software.

      THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN
      TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS,
      WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND.
      © 2014 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without
      the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
      Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Intel and Xeon are registered
      trademarks of Intel Corp. Red Hat is a registered trademark of Red Hat Inc. Linux is a registered
      trademark of Linus Torvalds. Other trademarks and trade names may be used in this document to
      refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any
      proprietary interest in trademarks and trade names other than its own.

      2             Dell Reference Configuration for DataStax Cassandra
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra
Introduction
       In the age of Big Data, applications operate on a global scale, and they must meet the
       always-on demands of their developers and their users. DataStax Enterprise is uniquely
       suited to address the database demands of continuously available, globally distributed on-
       line applications.

       Over the last two to three years customers have utilized Hadoop as a tool to help analyze
       large volumes of structured, semi-structured, and unstructured data. Hadoop is a valuable
       tool, yet as customers use-cases evolve; new tools are starting to emerge that continue to
       add more value to the Big Data ecosystem. NoSQL database technologies are a prime
       example of a new tool being integrated with Hadoop that allow low-latency read/write
       access to data. Apache Cassandra is one such NoSQL database, and by rewriting the laws
       of database architecture, Cassandra provides a single database distributed geographically
       over multiple data centers providing unprecedented levels of reliability. Cassandra’s
       efficient architecture to capture data at extremely high ingest rates is valuable for Internet-
       of-Things applications that capture large quantities of time-series data that is then
       analyzed to provide value to the community of users. DataStax Enterprise enhances the
       capabilities of Apache Cassandra providing management services to facilitate cluster
       operations and maintenance.

       DataStax Enterprise and Hadoop are very complimentary. There are a number of use-
       cases where NoSQL databases such as DataStax Enterprise, serve as the real-time
       read/write, always-available database while Hadoop serves as the backend engine to help
       users analyze large volumes of structured, semi-structured, and unstructured data in more
       of a batch methodology. Within this integrated data hub, customers can run algorithms on
       integrated disparate data from relationship database management, enterprise data
       warehouses, and other sources. Additionally, a data science workbench may be layered on
       top to provide analytics tools transforming the results into actionable information using
       search, data visualizations, and reporting/analysis. These new environments are applicable
       across multiple vertical markets, including Government Intelligence, Healthcare, Financials,
       Manufacturing, Telco/Media, Retail, Web 2.0, and more.

       To help support this customer use-case, Dell is partnering with DataStax Enterprise to
       execute a reference configuration in the Dell Solution Center.

       DataStax Enterprise is a NoSQL big data platform powered by production-certified Apache
       Cassandra that is architected for today's line-of-business applications and designed to
       securely manage real-time, analytic, and search data all in the same database cluster.

       DataStax Enterprise encapsulates a peer-to-peer distributed architecture model where all
       nodes inside a cluster are the same. Data is automatically partitioned and distributed
       among all the nodes. Often, two or more data center locations are used and nodes are
       distributed among the physical locations.

       OpsCenter is a global management and monitoring tool that administers Cassandra and
       DSE clusters.

       3            Dell Reference Configuration for DataStax Cassandra
Reference Configuration

       Apache Cassandra is an open source massively scalable NoSQL (non-relational) database.
       DataStax is a Dell partner who, in addition to contributing to the Apache Cassandra
       project, offers a commercialized version in both a community and enterprise flavor.
       DataStax Enterprise is available for multiple distributions of Linux. This initial configuration
       will target deployment on bare-metal servers running DataStax Enterprise 3.2.1 on Redhat
       Linux 6.4.

       DataStax Enterprise can be used to provide a mechanism to rapidly ingest transactional
       data to facilitate a variety of emerging workloads. These workloads share a common need
       to provide a continuously-available, distributed, read/write capable database that does not
       have any single point of failure.

       Use Cases for NoSQL Online Data Ingestion:
           Time-series data
           Device/Sensor/Data “exhaust” systems
           Distributed applications
           Media streaming
           Online Web retail (transactional, shopping carts, etc.)
           Online gaming
           Recommendation engines
           Real-time data analytics
           Social media capture and analysis
           Web click-stream analysis
           Write-intensive transactional systems

       The Cassandra ring topology allows multiple nodes to service both read and write requests
       with a tunable consistency mechanism (both the number of replicas and at what point to
       acknowledge the write).

       4            Dell Reference Configuration for DataStax Cassandra
Figure 1.    Logical Diagram of Cassandra Ring

                                     Data
                                     Node

                                                         Data
             Data
                                                         Node
             Node

                                  Replicate
                                   n times

                                                                                  Application
                                                                Read / Write       Server(s)
                    Data                         Data
                    Node                         Node

Server Roles

Cassandra Data Node(s) – The data nodes conduct the principle functions in a Cassandra
cluster (a cluster contains multiple nodes). In order to provide rapid response times during
data ingestion, these nodes are configured to allow for rapid input/output (IO) to disk. As
IO arrives the following process commences:

    1.      Incoming data is assigned to a data node, using a data key determined by hashing
            the incoming data. Each data node own a specific hash range, and the incoming
            data is assigned to the data node that owns the hash range the data key falls into

    2. IO is written to a disk-based commit log on the assigned node

    3. IO is also simultaneously written to a table in memory

    4. Steps 1-3 are repeated on one or more additional data nodes in order to meet
       replication/durability requirements, if any

    5. IO is acknowledged back to the requestor

5                Dell Reference Configuration for DataStax Cassandra
This process allows the cluster to maintain a tunable number of replicas across nodes,
racks, and datacenters. Since the IO isn’t acknowledged until it is written to a disk-based
commit log, the commit log should reside on high-performance storage, such as solid-
state drives (SSD). SSDs are common for read-heavy workloads, since reads involve
potentially many random IOs. Performance may be increased by adding additional data
nodes to the cluster/ring since Cassandra is linearly scalable.

Application Server(s) – Application servers reside on the outer edge of the cluster/ring.
They are the interface between the Cassandra ring and the outside world. Data may be
streamed from an application server programmatically (via APIs for all the popular
languages) or through Cassandra’s built-in query language (CQL).

DataStax OpCenter Node – The DataStax OpCenter Node runs the management
interface. In a production environment, the OpCenter server may need to run on a
dedicated physical node; however, for the purposes of this document’s testing, OpCenter
was installed on a virtual machine (VM).

Figure 2.   DataStax OpCenter Interface

Node Count Recommendations

Dell recognizes that use-cases for Cassandra range from early-stage development and
testing clusters through large multi-datacenter installations. Dell and DataStax have
services that can help appropriately size a cluster based on customer budget,
performance, security, and data consistency requirements. All node-count
recommendations are for the Data Nodes only. DataStax OpCenter, application servers,
and additional infrastructure services may be needed to complete the environment.

6              Dell Reference Configuration for DataStax Cassandra
As a starting point, three cluster configurations can be defined for typical use:

DataStax Recommended Starter Cluster– The low-tier configuration is targeted at basic
usage for online database applications, and in some cases, may even be built from existing
equipment; however, the performance of these types of clusters can be significantly
increased if SSD drives are added. For this configuration, only a single processor is defined.
If more services (such as DataStax Search) are added, performance may suffer.

DataStax Recommended Standard Cluster – This configuration is a good starting spot for
clusters that have the potential to scale. This configuration includes dual processor to
improve performance using DataStax’s search capabilities.

DataStax Recommended Professional Cluster – This configuration represents the top-tier
of hardware recommended to run Cassandra. Adding additional performance to individual
nodes (e.g. four processers, additional memory, etc.) will result in diminishing benefit.
Rather, adding additional nodes yields a greater return on investment when scaling the
cluster.

Table 1.       Recommended Cluster Sizes

                           DataStax               DataStax                 DataStax             Dell Tested
                        Recommended            Recommended             Recommended             Configuration
                        Starter Cluster3      Standard Cluster3      Professional Cluster3

    Server Model1        (5) PowerEdge          (5) PowerEdge        (5) PowerEdge R620      (5) PowerEdge R720
                              R320                   R420

    Processor(s)        Single Intel Xeon    Dual Intel Xeon E5-     Dual Intel Xeon E5-     Dual Intel Xeon E5-
                           E5-2420 v2              2430 v2                 2650 v2                  2650

    RAM                      64 GB                 128 GB                  256 GB                  128 GB
           2
    Storage           (4) 1 TB SATA Drives   (6) Intel 3700 Series   (6) Intel 3700 Series   (6) Intel 3700 Series
                         Read Intensive       SSD 400GB 3Gbps         SSD 400GB 6Gbps         SSD 800GB 6Gbps
                           Application

    Network             (2) Intel X520 DP      (2) Intel X520 DP       (2) Intel X520 DP      (2) Intel X520 DP
    Cards                10GbE DA/SFP+          10GbE DA/SFP+          10GbE DA/SFP+          10GbE DA/SFP+

    Data Switches      (2) Dell Networking    (2) Dell Force 10        (2) Dell Force 10      (2) Dell Force 10
                       8164F 10GbE SFP+      S4810 10GbE SFP+         S4810 10GbE SFP+       S4810 10GbE SFP+

    Management         (2) Dell Networking   (2) Dell Networking      (2) Dell Networking     (2) Dell Force 10
    Switches                   6248                  6248                     6248                S60 1GbE

    Rack Units                 9U                     9U                      9U                     14U

    DataStax           DataStax Enterprise   DataStax Enterprise     DataStax Enterprise     DataStax Enterprise
    Edition                 Standard                Pro                     Max                   Standard
    1
      Any Dell server that is capable of running the supported OSs should work. Selection
    of these specific models was due to their targeted price brackets
    2
      SSDs only should be considered for any high-ingestion use-cases
    3
      The recommended hardware is for Data Nodes only. DataStax OpCenter, application
    servers, and additional infrastructure services may be needed to complete the
    environment.

7                  Dell Reference Configuration for DataStax Cassandra
Figure 3.   Physical networking diagram

Tested Configuration

For the purposes of this document, a small DataStax cluster was deployed as shown in
Table 1. The specific software revisions used in the test are shown in Table 2. The
hardware listed should be used as initial guidance only. Additional configurations are
possible and will likely be required as each customer’s environment and use-case is
unique. Customers should consult with DataStax Professional Services to come up with an
optimal design that has been customized to their use-case. Common parameters that
could differ include:

1.   Node Count – Adding nodes is the best way to scale capacity and performance for a
     Cassandra cluster. The benefits for adding additional nodes usually outweighs most
     other efforts to increase disk size and memory amounts in most cases

2. Disks – SSD technology is critical for maintaining the performance necessary to ingest
   data at a high rate. Keeping both the initial commit log and the sorted string table (SST)
   disk space on SSDs is strongly recommended

3. Memory – Memory should be sized relative to the use-case. the cluster will benefit
   from additional memory when using DataStax Solr Search or other memory-intensive
   features

4. Processors – Data ingestion is not particularly CPU intensive in of itself. However,
   additional processing power is required as additional capability is added (e.g. Solr
   Serach, etc.) or as the workload on a DataStax cluster increases

8              Dell Reference Configuration for DataStax Cassandra
Table 2.     Software Revisions (As Tested)

    Component                                 Revision

    Redhat Enterprise Linux                   6.4

    DataStax Enterprise                       3.2.4

    Cassandra Version                         1.2.13.2

Integration with Other Solutions

For customers interested in using DataStax Cassandra to compliment other Big Data
solutions, DataStax Cassandra can act as a low-latency point of ingestion for data which
can later be fed to other tools including data warehouses and Dell’s Apache Hadoop
solutions for running deep and heavy analytics.

Displaying data directly from Cassandra is also possible via Dell’s robust tool-belt of data
visualization tools like Dell Kitenga Analytics Suite and the Dell Quest TOAD BI Suite.

Figure 4.    Physical networking diagram

9               Dell Reference Configuration for DataStax Cassandra
Dell Solution Centers

The Dell Solution Centers are a global network of connected labs that allow Dell to help
customers architect, validate and build solutions. With multiple footprints in every region,
they help customers understand anything from simple hardware platforms, to more
complex solutions. These engagements range from an informal 30-60 minute briefing,
through a longer half-day workshop, and on to a proof-of-concept that allow customers
to kick the tires of their solution prior to signing on the dotted line. Customers may engage
with their account team and have them submit a request to take advantage of these free
services.

Links

DataStax Enterprise Cassandra – http://DataStax.com/
Planet Cassandra Community – http://planetcassandra.org/
Apache Cassandra Open Source Project - http://cassandra.apache.org/

10           Dell Reference Configuration for DataStax Cassandra
You can also read