5 SIGNS YOU MIGHT BE OUTGROWING YOUR MYSQL DATA WAREHOUSE* - *AND WHY VERTICA MAY BE THE RIGHT FIT

Page created by Matthew Cummings
 
CONTINUE READING
5 SIGNS YOU MIGHT BE OUTGROWING YOUR MYSQL DATA WAREHOUSE* - *AND WHY VERTICA MAY BE THE RIGHT FIT
Whitepaper

5 Signs You Might Be
Outgrowing Your MySQL
Data Warehouse*

*And Why Vertica May Be the Right Fit
Like Outgrowing Old Clothes...

Most of us remember a favorite pair of pants or shirt we had as kids that seemed to fit
fine one day, and the next time we put it on, we realized that they were suddenly much
too small. You might let the hems out, or cut the arm holes, but you knew that it was
soon going to be time to put it in the “too small” pile, and a trip to the store with your
mom was around the corner. Outgrowing things was a way of life back then, an
inevitable step in the grand scheme and one that always seemed to lead to the next
favorite shirt or toy. This is not an attempt to trivialize data warehouse and data mart
systems, but they too evolve and mature, and one day you might wake up and realize
that the MySQL data warehouse that you have so faithfully supported and maintained is
just too small for your current analytics needs. Data volumes keep increasing, new data
sources are added to the system and performance starts to degrade to the point that
your users are reporting that queries are taking too long or never returning. Or maybe
your users are starting to run more and more sophisticated queries that you (and the
database) weren’t quite ready for. Nobody wants to get to that point, so it is useful to
know a few signs that you are starting to outgrow your current system so you can start
planning the transition to a new system. This paper details the five most common signs
that it may be time to consider replacing a MySQL system.

1. You are considering implementing sharding/partitioning.

Your big tables are getting REALLY big, and you’ve started to look at sharding as a way
to spread out the load over multiple machines and eek out the most performance you
can get. Sharding can be a useful tool; however, the process to manage this exercise
can soon outweigh the gains being made. According to the MySQL Performance Blog,
the complexity comes down to two factors. First, the application developer will have to
write more code to be able to make use of the sharding logic. You will need to rewrite
most of your application and queries to point them to the correct data. Second,
operational issues become more difficult (backing up, adding indexes, changing
schema). It can take a significant amount of work to build an application that works
correctly when you are rolling through an upgrade where the schema will not be the
same on all nodes. Many of these tasks remain only semi-automated, so from an
operations perspective, there can often be a lot more work to be done. (Tocker, 2009)

Vertica implements a fundamentally better paradigm to sharding called segmentation.
Segmentation allows you to distribute contiguous pieces of your physical data, called
segments, for fact and large dimension tables across database nodes. This maximizes
database performance by distributing the load. But unlike MySQL, this is managed
completely by the Vertica engine. When you create your physical tables, you specify if
you want to segment, and Vertica does the rest. Queries do not need to be aware of
the segments, so no changes to your existing SQL are necessary. Without introducing
any maintenance headaches, segmentation can be used to provide high availability for
your system. Redundant physical storage can be configured to provide performance
optimization for different query types. Then, the distribution is modified so that
segments which contain the same data are distributed to different nodes. This ensures
that if a node goes down, all of the data is available on the remaining nodes. Again, this
is managed automatically by the Vertica engine and only requires a single keyword in
the table creation DDL.

2. File sizes are too large.

In MySQL, all database interactions are managed at the file system level. Eventually,
the size of the files in MySQL becomes too large for the machine to manage effectively.
There is more and more I/O required to sift through the data in the file, and forget being
able to load them into memory. Depending on your operating system and file store
choice, the file size may be limiting the size of your tables. Now, you are being forced to
make some fundamental architecture decisions. Maybe you are considering moving to
InnoDB, enabling Large File Support on MyISAM, or even having to more to a different
operating system. All of these options have expensive price tags in terms of time and
DBA resources.

Wouldn’t it be nice if there was some way that bringing more data into a system didn’t
cause database structures and files to bloat? Well, Vertica engineers thought so too.
Vertica automatically compresses each column using one of fifteen different methods,
depending on the data type and distribution. Customers see 10 – 60x data
compression rate as they load their raw data into Vertica. The engine is fully aware of
these compression algorithms, and can process compressed data until the last possible
moment. This gives you a double bang for your hardware buck. You use less disk
space to store the data, and less CPU and memory to process the data. As far as
actual file size goes, Vertica continuously monitors file structures to remove and “merge
out” deleted data and reorganize the file for maximum space efficiency. Tables can be
broken up into smaller storage units (called partitions), usually by some business
construct like month or year. That way, data can be easily rotated out by dropping
individual partitions, or utilized during query execution for “pruning” for specific data or
to improve parallelism.

3. The number and size of the indexes is beginning to get cumbersome.

Indexes are good, right? They are to a point, but eventually you are going to find that
you are using the majority of your disk space for these adjunct structures. And more
disk space means less availability for growth, more complicated (read: expensive)
maintenance, and the need for more and larger hardware. MySQL loads indexes into
memory at execution time, so if your indexes no longer fit, the performance benefit of
having them is no longer there, and can spell longer query run times. Again, possible
solutions are smaller indexes, meaning smaller tables or more memory. Getting this
free database up and running strong is starting to look very expensive.

Vertica doesn’t have indexes. It doesn’t need them. Data is physically stored in
compressed and sorted columns called projections, which essentially act as a traditional
index would, but without the extra I/O overhead required for performing lookups.
Projections can use all the columns in a table, or just a subset. They can be sorted
differently to provide optimization for different types of queries. Since they actually store
the physical data, not a pointer, having multiple projections on a table means they can
be used to support high availability, since they will either be replicated or segmented
and offset on each node (see #1 above). And don’t forget about the compression
explained in #2; this means that even with multiple copies of the data, you are still
storing a smaller amount than the actual raw data.

4. Tables are getting wider.

It’s bound to happen. Users are doing more complicated analysis, and ask for pre-
computed columns to be added to the fact table. Or, you are bringing in another data
source, so your dimension tables start getting wider. MySQL is a row-based database,
so every time a query asks for just one column in a table, all the other columns in the
table need to come along for the ride. This can get very expensive in I/O and overall
query efficiency.

Vertica is a native column-store database. Column stores offer significant gains in
performance, I/O, storage footprint, and efficiency when it comes to analytic workloads.
Why read and retrieve all columns in the database if you don’t need them? Unlike
traditional database vendors who struggle to retrofit columnar storage into their legacy
code for marginal gains, Vertica’s columnar orientation was deliberately designed into
the core platform from day one. This means that all Vertica components are columnar-
aware so that it delivers superior compression and encoding, better and more efficient
relational join performance, and the engine is able to operate on compressed columnar
data without having to unpack it.

5. You keep maxing out your servers.

Dan Khasis, a leading MySQL performance and scaling expert, says he sees clients
“reaching the threshold (of MySQL) when there are a few billion rows and people want
reports (or queries) instantly, with slicing dicing and drill down, sorting and grouping.
Their servers start running out of ram and start writing to disk or temp tables.” Adding
more and more hardware can get expensive. Even though you are saving in license
fees with MySQL, you are sinking a lot of money into your infrastructure/cloud
resources.

We have discussed Vertica’s pervasive use of column compression as one was of
beating the data bloat on other RDBMS. Combine that with Vertica’s truly shared
nothing MPP architecture, customers see better than linear scalability when adding new
servers to the cluster (see diagram below). And this isn’t proprietary hardware or an
appliance. Any well spec’d Linux server will do just fine. Vertica’s built-in high
availability also reduces the need for redundant hardware, because even if any node in
the Vertica cluster goes down, the database will still be available and active, with
minimal performance impact to user queries and data loads. Looking at the total cost of
ownership of your data warehouse as it grows, including hardware and technical
resources to manage that hardware should be an important factor to any long-term
maintenance plan. Using a commercial RDBMS that can fully utilize all the hardware to
the maximum extent might be the better financial choice moving forward.

“So, I may be showing some signs of outgrowing my current data warehouse database,”
you might say, “but migrating a production data warehouse is no trivial matter. I would
rather go back to clothes shopping with my mom when I was in junior high.” But it
doesn’t have to be. Vertica has many features that make a migration project a lot easier
than you might think. Vertica is ANSI-99 compliant, which means that your DDL and
current reports will run with little changes needed. In most customer engagements, all
the needed table DDL and query SQL is converted within hours. Vertica also has a
built-in Database Designer that, once pointed to your logical schema, some sample data
            and the queries, will tell you exactly what projections (the Vertica physical storage
            mechanism) need to be built to the get optimal performance out of your new database,
            as well as the DDL needed to build them. Adding new hardware as your system
            continues to grow won’t be an issue either. A single command adds a new node to the
            Vertica cluster and automatically rebalances the system for performance and high
            availability. As of April, 2011, Vertica’s largest deployment was on 230 nodes managing
            over 1.5 petabytes of data, growing by a terabyte each month. Rest assured, you won’t
            need a new data warehouse for a long, long time.

                                  About Vertica
                                  Vertica, an HP Company, is the leading provider of next-generation analytics platforms enabling customers to monetize ALL of their
                                  data. The elasticity, scale, performance, and simplicity of the Vertica Analytics Platform are unparalleled in the industry, delivering
                                  50x-1000x the performance of traditional solutions at 30% the total cost of ownership. With data warehouses and data marts
                                  ranging from hundreds of gigabytes to multiple petabytes, Vertica’s 600+ customers are redefining the speed of business and
                                  competitive advantage. Vertica powers some of the largest organizations and most innovative business models globally including
                                  Zynga, Groupon, Twitter, Verizon, Guess Inc., Admeld, Capital IQ, Mozilla, AT&T, and Comcast.

Vertica, An HP Company 8 Federal Street, Billerica, MA 01821 +1.978.600.1000 TEL +1.978.600.1001 FAX                          www.vertica.com
© Vertica 2012. All rights reserved. All other company, brand and product names may be trademarks or registered trademarks of their respective holders.
You can also read