CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation

Page created by Andrea Patel
 
CONTINUE READING
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
CAN RAID TECHNOLOGY CONTINUE TO
PROTECT OUR DATA IN THE YEAR 2020?
     Bruce Yellin
     Advisory Technology Consultant
     Bruce.Yellin@emc.com
     EMC Corporation
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
Table of Contents
What’s the Issue? ...................................................................................................................... 4

The Danger ................................................................................................................................ 8

Tackling the Issue Head On ......................................................................................................13

   Erasure Coding .....................................................................................................................17

   What About Solid State Disks?..............................................................................................19

   Triple parity ...........................................................................................................................22

   RAID 1 ..................................................................................................................................22

   Big Data ................................................................................................................................23

Conclusion ................................................................................................................................25

Appendix – Background on How A Disk Handles I/O Requests ................................................27

Footnotes ..................................................................................................................................28

Disclaimer: The views, processes, or methodologies published in this article are those of the
author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.

2012 EMC Proven Professional Knowledge Sharing                                                                                              2
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
The digital data tsunami is coming. It is not a once in a decade storm, but a continuous daily
event that can overwhelm the largest of data centers. Like a giant surge of rising water, data
continues to grow at 60% a year and you need to be prepared. IDC
reports over a zettabyte (one trillion gigabytes) of information was
created and replicated worldwide in 2010, growing to 1.8 zettabytes
just a year later. Data has increased by a factor of nine in just the
last five years, and that the number of files will grow 8-fold over the next five years1. Over 90%
of that data surge is unstructured, and “big data” is by far the largest producer of unstructured
data.

One of the challenging issues is how to store and protect all that data. No one is suggesting
you’ll have a zettabyte or even an exabyte, but in the next eight years, you could easily be
storing a petabyte of data. Will you use RAID to protect it?

RAID is the “Swiss Army” tool invented in 19882 to protect and improve the
performance of sub-gigabyte drives. Meanwhile, drive capacities have doubled
every year or two since then according to Kryder’s Law3. Similar to Moore’s
Law, Mark Kryder’s work suggest that by 2015, 6TB drives will be common and by 2020, we will
have inexpensive 14TB4 hard drives. Is RAID still the right tool to use?

The numbers can be scary. Years ago, 50-100TB seemed like a lot of storage. These days,
250TB of usable capacity minimally needs 320 x 1TB drives using RAID 6 (14+2) protection. At
the projected growth rate, you will have a petabyte of usable capacity and almost 1,300 of those
drives three years from now. RAID would be protecting over 13,300 x 1TB drives by 2020.
Clearly, that is a lot of drives and eight years is beyond their useful drive life, so it is a good
thing those larger 6TB drives are coming, right? Well, maybe. Whether you believe data will
grow at 40%, 50%, or 60%, or if drive sizes double every year or two or three, one of the issues
facing the storage industry is whether today’s parity RAID concepts are up to the task of
protecting those large drives. Many have their doubts.

Drives have gotten larger because of the advances in areal density.
As shown to the right, areal density is the number of bits in a square
inch of magnetic media. The original IBM RAMAC in 19565 had a
density of 2,000 bits/in2 and today 625 gigabits fit in a square inch of
a 4TB drive6. The downside to increased areal density is the
potentially harmful increase in bit error rate (BER).

2012 EMC Proven Professional Knowledge Sharing                                                   3
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
Larger drives also have longer rebuild times. We also know if we consolidate them into even
larger ones, such as two 1TB drives combined into a single 2TB drive, the performance profile
significantly decreases. Solid-state drives (SSDs) can offset the performance deficiencies of
mechanical drives, but they have their own unique reliability issues. With that in mind, this
Knowledge Sharing article examines how RAID holds up.

What’s the Issue?
Mechanical hard disk drives, which will be called HDD throughout this article, use a magnetically
coated platter with “flying” disk heads, while SSDs use
memory chips. Drive platters are made from aluminum or
glass/ceramic that resists expansion and contraction when
they get hot or cold. That material is ground down and
base coated for an ultra flat surface. Additional polishing is
done before various magnetic layers are deposited on the
platter through processes such as “sputtering”7, all before
a protective coating is applied.

The disk drive housing contains one or more platters, a read/write head for each platter surface,
a head positioning coil, a motor to spin the platters, and electronics to translate the disk
controller’s commands and administer the buffer space. The head, literally flying over the
spinning platter at a fraction of an inch at speeds greater than 150 M.P.H. (15,000 RPM drive),
imparts a magnetic field or detects the existing magnetic field on the platter. The positioning coil
receives instructions from the drive electronics as to where to position the head over the platter.
Before a single bit of data is stored on a drive, it is first low-level formatted at the factory and
high-level formatted by your storage frame’s operating system. This allows any “bad” sectors to
be remapped to spare sectors.

With the drive operational, when a server issues a write to a disk sector, the data is broken up
into 512 or 4,096 byte sectors. Extra Error Correcting Code (ECC) check bits are calculated at
the same time and written along with the sector to the platter. For example, on EMC VMAX®8
and VNX® arrays, a sector is 520 bytes in size with 512 bytes for data and an 8 byte CRC (cyclic
redundancy check) checksum to maintain the integrity of the sector’s data. When the drive
subsequently reads that sector, it also reads the ECC which together detects if the correct data
was read. Data are Run-Length Limited (RLL) encoded to maximize the number of bits stored
on the media, well beyond the coding of just a zero or a one bit.

2012 EMC Proven Professional Knowledge Sharing                                                    4
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
With data packed so densely, ECC frequently corrects the data given the fluctuating signal
strengths and electronic noise of a flying head over a rotating platter. It is when ECC is unable
to correct the data that your operating system displays a disk read error message.

As mentioned earlier, drive capacity is a factor of its
areal density, and the denser you can pack the
zeroes and ones on the magnetic media, the greater
the BER and likelihood of read failures. In this chart,
areal density is shown to be limited by the BER,
especially    when        heat,      commonly            called   a
                                                     9
thermodynamic effect, becomes a factor.

Some manufacturers are tackling the BER problem
by formatting drives with larger 4,096 byte sectors
and 100 bytes of ECC. Larger ECCs distribute the
calculation overhead from eight sectors to one sector
thereby reducing the error checking overhead
caused by 512 byte sectors on dense drives10. The
4KB ECC helps address the BER problem that are in
some cases approaching 1 bit of error for every 12.5TB read – more on this in a later section.

Another benefit of the 4KB sector is a
reduction    in    the     usable         capacity
overhead      on         the      drive.      The
improvement ranges from “…seven to
eleven   percent     in        physical    platter
space” and “The 4K format provides
enough space to expand the ECC field
from 50 to 100 bytes to accommodate
new ECC algorithms.”11

Some systems proactively test drive reliability before failures occur by “sniffing” or “scrubbing”
them and making sure all the data can be “test read” without error. This happens during idle
moments and has very little performance impact on the rest of the system.

2012 EMC Proven Professional Knowledge Sharing                                                   5
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
Monitoring the well-being of every zero and one on a disk and noting the quantity of recoverable
read errors is just one way of predicting a possible future disk failure. These statistics,
depending on the system, allow it to proactively move data from troubled disk areas to safe
locations on other drives before serious problems arise. This monitoring is just one way a
system can perform “self-healing”.

Another tool used to predict failure is S.M.A.R.T., or Self-Monitoring Analysis and Reporting
Technology. S.M.A.R.T. tracks drive conditions and alerts the user to pending disk failures with
the goal of giving you enough time to back up your data before it is lost. This tool incorporates
metrics such as reallocated sector counts, temperature, raw read errors, spin-up time, and
others. In theory, if the temperature rises or any other triggering event occurs, data can be
preemptively moved off the suspect drive or all the data can be copied to a spare. Copying is
much faster than parity rebuilding. However, in a Google study of over 100,000 drives, there
was “…little predictive value of S.M.A.R.T.“ and “…in the 60 days following the first scan error
on a drive, the drive is, on average, 39 times more likely to fail than it would have been had no
such error occurred.”12 For more on S.M.A.R.T., please read the article “Playing It S.M.A.R.T.”13.

As a disk controller issues read and write drive requests, it may find a sector to be unreliable –
i.e., the magnetic coding is not retaining data or is reporting the wrong information. For example,
after a write, the drive reads the information again to make sure it was written properly. If it can’t
fix the error through ECC, the data may be unrecoverable, in which case the sector can be
“marked” as unreliable and mapped to one of many spare sectors on a drive. This is called bad
block remapping.

Some drive “failures” are caused by extreme heat or vibration, so some systems can spin down
a drive, wait a small amount of time, and then spin it up again. This can sometimes allow the
drive to continue without the need for rebuilding. Clearly, any drive that is “revived” is also
reported to the storage array system so you can proactively replace it. One study found
“between 15-60% of drives considered to have failed at the user site are found to have no
defect by the manufacturers upon returning the unit.”14 In other words, the failure could be
environmental.

2012 EMC Proven Professional Knowledge Sharing                                                      6
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
Should these approaches fail to correct the problem,
                                                              RAID can also improve performance. With
you could restore lost or mangled data from a data            processors getting faster and more scalable with
                                                              multiple cores, mechanical disks have struggled
backup. However, if your backup is from last midnight         to provide adequate performance for read and
                                                              write requests. RAID 1 can help improve
and it is now 3 P.M., you probably are guaranteed to          performance by offering independent read
                                                              functionality, and RAID 5 and 6 to a lesser degree
lose valuable data. That’s where RAID comes in. RAID          through distribution of the request across the
                                                              drives in the RAID group. SSDs provide
has the ability to rebuild a drive that has unreadable        dramatically faster response than HDDs.
sectors or even if the entire drive stops spinning.

There are three popular forms of RAID – mirrored, single parity, and double
parity. Mirrored RAID or RAID-1 maintains an exact copy of one drive on
another such that if one needs to be replaced, the “good” drive is copied to a
spare drive.

Single parity or RAID 5, is a single dimensional parity written as a diagonal stripe to distribute it
and balance performance among the drives in the RAID group. It is often written as R5(4+1)
meaning four data drives and one parity drive, although the number of data drives is up to each
vendor’s implementation. In the event of a single rebuild activity, single parity plus the surviving
“good” elements can restore the failed element to a spare drive.

Double parity or RAID 6 is like RAID 5 plus a dedicated second parity drive enabling it to rebuild
two unique errors at the same time. It is commonly expressed as R6(6+2), or six data drives
and two parity drives. Like RAID 5, the number of data drives is up to the vendor. Notice in the
diagram above that RAID 5 uses the diagonal from lower left to upper right and RAID 6 uses
two such stripes.

2012 EMC Proven Professional Knowledge Sharing                                                           7
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
Parity rebuilding certainly comes
in handy when data becomes
unreadable. Using data from the
surviving drives plus the parity,
unreadable data is reconstituted
to a spare drive (if available). The
rebuilding does take up disk
controller resources and many systems have priority rebuilding schemes to reduce or lengthen
the rebuild times. The larger the drive and the greater the number of them in the group, the
longer the rebuild takes. Conversely, the less data on the drive, the faster the rebuild occurs.

An extreme example of a problem that can happen is when a flying disk head crashes into a
disk platter. This is most likely to occur when a drive is just starting up, there is shock or
vibration to the drive, or if a microscopic particle gets caught between the head and the platter.
With the platters rotating at high speed, if the disk head scratches the surface, it can not only
destroy the magnetic media and the data in those sectors, but magnetic material can be kicked
up and contaminate the rest of the drive. Fortunately, head crashes are rare, with some studies
indicating it is 30 times more likely to have a media failure than a disk crash15, but should it
happen, the drive and the data needs to be replaced. The issue is how fast can a drive be
replaced and might another error appear before then?

The Danger
What happens when another drive fails before the first is repaired? How about the impact of a
drive rebuild on other RAID group members, other drives on the same disk controller, or even
the entire storage frame? It is hard to predict rebuild times which makes it difficult to assess the
risk. Rebuild factors include:
    1. drive attributes such as sizes and speeds (i.e., larger and slower drives take longer)
    2. each storage vendor offers different priority schemes
    3. the failed drive and its replacement spare may not be on the same controller
    4. the architecture of the storage frame and the version of operating code

                                       Hours to repair a failed drive based on rebuild priority with spare sync
Example: With an EMC VNX                          300GB 600GB         1TB    2TB    4TB      8TB 12TB 16TB
running OE 31, a 600GB 15K             Size         10K       15K 7.2K 7.2K 7.2K 7.2K 7.2K 7.2K
                                       Low            52       69     230    460     921 1,842 2,763 3,684
drive protected by RAID 5(4+1)         Medium         28       37     122    245     490     980 1,470 1,960
                                       High           14       19      63    127     253     507     760 1,014
                                       ASAP             2 2             8     15      31      61      92     123

2012 EMC Proven Professional Knowledge Sharing                                                                8
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
can be rebuilt in just over two hours at the ASAP priority, including the time to replace the
defective drive with a new one and having the hot spare sync its contents to it16. SSD rebuild
times would be on par with 15K HDDs. If your system uses permanent spares such as an EMC
VMAX, once the drive is rebuilt, the data stays on that drive—i.e., there is no need to return that
drive to a spare drive pool, which can in some cases cut the rebuild time almost in half.
Example: Seagate’s Cheetah 15K.7 600GB drive17 has an average sustained transfer rate of
163MB/s. However, trying to achieve that rate during a rebuild would probably bring your
storage unit to its knees since the specification of a SAS disk controller loop is 400-600MB/s. In
other words, 27-40% of the bandwidth would be consumed by just a single disk rebuild, never
mind servicing possibly hundreds of other I/O requests for unaffected drives on the same loop.

Strategies that reduce the impact of a rebuild on a disk controller include the use of smaller
drive loops. Small loops will become more practical as controllers leverage faster multi-core
technology. Data paths for 12Gb SAS 3.018 are also on the drawing board and will allow the use
of higher rebuild priorities while having a lower impact on the storage frame’s other work. Until
that happens, a more conservative calculation that is kinder to the rest of the storage frame is to
pick a priority that yields a drive rebuild of 30-50MB/s.

From our earlier example,
                                500   Rebuild Time in Hours by Priority       4,000   Rebuild Time in Hours by Priority
setting the rebuild priority    450
                                            Low                               3,500            Low
                                400
                                            Medium                            3,000            Medium
to “low” equates to 69          350
                                            High                              2,500            High
                                300
                                            ASAP                                               ASAP
hours or almost 3 days to       250                                           2,000
                                200                                           1,500
rebuild that 600GB drive.       150
                                                                              1,000
                                100
                                                                               500
Increase the priority to         50
                                  0                                              0
“medium” and the rebuild               300GB
                                        10K
                                                   600GB
                                                    15K
                                                                 1TB
                                                                7.2K
                                                                       2TB
                                                                       7.2K
                                                                                        4TB
                                                                                        7.2K
                                                                                                      8TB
                                                                                                      7.2K
                                                                                                                  12TB
                                                                                                                  7.2K
                                                                                                                         16TB
                                                                                                                         7.2K
                                                     Drive Size/RPM                                    Drive Size/RPM
time is reduced to 37
hours. Use “high” priority and it takes 19 hours. Notice that as the size of the drive increases, so
does the rebuild time. A 2TB drive takes 127 hours or more than 5 days at “high”! If this
progression continues, 8TB drives in the year 2015 might take 21 days to rebuild and perhaps
impact the storage frame’s response time during that time.

No one wants to wait days to rebuild a drive, but you also don’t want to impact other
applications running on your storage frame. When the drive is running at its maximum transfer
rate, it basically uses so much of the disk controller’s loop capacity that other I/O processing to
surviving drives is effectively reduced to a trickle. That is why using the ASAP priority for

2012 EMC Proven Professional Knowledge Sharing                                                                                  9
CAN RAID TECHNOLOGY CONTINUE TO PROTECT OUR DATA IN THE YEAR 2020? - Bruce Yellin Advisory Technology Consultant EMC Corporation
rebuilds is often the exception and not the rule. Many companies use a “high” priority to rebuild
a failed drive, since it has some impact on overall performance yet does an adequate job with
smaller 300GB and 600GB drives. The previous chart has a suggested or “acceptable” rebuild
rate highlighted in white on black. Keep in mind that the larger the RAID group, the longer the
rebuild process can take – i.e., RAID 5(8+1) takes longer to rebuild than RAID 5(4+1).
You should not focus on the precise number of hours or days to rebuild a drive since it depends
on so many factors. Nor should you use these to time your own rebuild rates. Each storage
vendor will have similar specific drive timings given the specifications from Seagate, Western
                                                                       IBM/NetApp Rebuild Extrapolation
Digital, Hitachi, etc. are so similar. Instead, use the general
                                                                         Drive      RPM        Hours
rebuilding concepts being discussed. For example, IBM says                146        15         1.5
                                                                          300        15         3.1
the RAID 6(14+2) rebuild of 16 x 146GB 15K takes 90 minutes,
                                                                          300        10         4.6
or a rate of 27MB/s19, which allows you to make drive                     450        15         4.6
                                                                          600        15         6.2
protection decisions based on your risk exposure. Extrapolating
                                                                          450        10         6.9
IBM’s analysis to larger drives, the table to the right can be            600        10         9.2
                                                                         1000        7.2        21.4
constructed20. There was no mention of what priority scheme
                                                                         2000        7.2        42.8
was used, but a 3TB drive might take more than 2 ½ days to               3000        7.2        64.2
rebuild.

The larger the drive, the longer it takes to rebuild and the greater the risk of a second error
occurring during that time. For example, with RAID 5, a second drive failure in a group that is
actively being rebuilt likely leads to data loss. That is the main reason why RAID 6 was
introduced in 198921. RAID 6 can tolerate two drive failures in the same group and rebuild them
without data loss, or a single drive failure and a read error on a second drive. As you can
imagine, rebuilding two drives concurrently, while not taking twice the time, is nonetheless a
lengthy process. And the longer it takes, the greater the probability of yet another loss or a
system that would run very slowly with all the rebuild activity going on.

On top of slow rebuild times, storage frames these days can easily have a thousand or more
disks, perhaps of a similar manufacturing age. If older HDDs begin to fail, there could be an
increased likelihood that others of the same vintage will experience the same fate. This is
sometimes called “bit rot”22, or the corruption of data caused by the decay of magnetic media
over time. While unlikely, it is possible that rebuilding a failed drive can produce a bit rot failure
in a good drive in the same RAID group since they all have the same usage patterns. To
preemptively battle bit rot, manufacturers use continuous fault detection through sniffing and

2012 EMC Proven Professional Knowledge Sharing                                                      10
scrubbing, and then employ self-healing to safely move data out of danger before the media
fails.

The       dramatic     worldwide     Amt of usable TB in your    2012    2013   2014 2015 2016 2017 2018 2019 2020
                                      frame, growth at 60%/yr     250     400    640 1,024 1,638 2,621 4,194 6,711 10,737
increase in the volume of data       # of 1TB drives R6(14+2)     320     512    800 1,280 2,048 3,280 5,232 8,368 13,392
                                     # of 2TB drives R6(14+2)     160     256    400   640 1,024 1,648 2,624 4,192 6,704
dictates that storage frames         # of 3TB drives R6(14+2)     112     176    272   432   688 1,104 1,744 2,800 4,464
                                     # of 4TB drives R6(14+2)      80     128    208   320   512   832 1,312 2,096 3,360
will need to hold many more          # of 5TB drives R6(14+2)      64     112    160   256   416   656 1,056 1,680 2,688
                                     # of 6TB drives R6(14+2)      64      96    144   224   352   560   880 1,408 2,240
drives.     From     our   earlier   # of 7TB drives R6(14+2)      48      80    128   192   304   480   752 1,200 1,920
example, assume your 2012            # of 8TB drives R6(14+2)      48      64    112   160   256   416   656 1,056 1,680
                                     # of 9TB drives R6(14+2)      48      64     96   144   240   368   592   944 1,488
storage frame had 250TB of           # of 10TB drives R6(14+2)     32      64     80   128   208   336   528   848 1,344

usable data and it grew at 60% a year. In three years, you would need 1,280 x 1TB drives and
by 2020, the frame would need over 13,000 x 1TB drives! Fortunately, drive capacity increases
every year or so, and staying on top of the data explosion means leveraging these multi-
terabyte drives along with techniques such as compression and deduplication. With 6TB drives
expected to be common by 2015, you would still need 224 of them to hold a petabyte of
capacity.

Keep in mind that trying to replace a 6TB drive protected with RAID 6(14+2) at an acceptable
priority could take weeks, and the risk of a second or third simultaneous failure increases. That
is why the 1988/1989 concepts of RAID protection may be coming to an end. By 2020, with
10TB drives and 10 petabytes of usable capacity, the idea of more than two mechanical failures
among 1,344 drives could easily lead to crippling data loss.

There are many variables that factor into                Seagate                                          Unrecoverable
                                                          Drive          Size/speed      MTBF      AFR     read errors
HDD reliability, including the drive itself,
                                                      Cheetah           300GB FC 15K    1,600,000 0.55%         116
temperature, humidity, vibration, cables,             Constellation 3TB SAS 7.2K        1,200,000 0.73%         115
controllers, firmware, operating system,              Barracuda XT 3TB SATA 7.2K          750,000 0.34%         114

and others. In an effort to advise the marketplace on disk reliability, manufacturers like Seagate
publish statistical specifications such as Mean Time Between Failure (MTBF) and Annualized
Failure Rate (AFR)23. MTBF is a calculation of how many hours a drive will continue to operate
before there is a failure, estimated for a large population of drives versus an individual drive. For
example, a Cheetah drive with a MTBF of 1,600,000 hours indicates that with a large number of
these model drives, half will fail in 1,600,000 hours or 183 years. If your storage system has
1,000 drives, the results look a lot different and a failure can occur in just 67 days. Drive
reliability can be calculated by this formula:

2012 EMC Proven Professional Knowledge Sharing                                                                        11
With “half” failing, it is reasonable to expect some will have problems earlier or later than this
calculation. It also says with 1,000 Barracuda drives, you can experience 12 failures a year or
more than twice as often as the Cheetah because the MTBF is lower.

Manufacturers also offer an AFR or Annualized Failure Rate calculation. It is based on the
MTBF and relates the likelihood of failure to the number of hours the device is used in a year.
For example, the Cheetah has a MTBF of 1,600,000 hours and an AFR of 0.55%, so at first
glance it appears the AFR of the Barracuda should be much higher given its MTBF is less than
half at 750,000. It turns out that Seagate rates the Barracuda against 2,400 hours a year and
not 8,760 used by the Cheetah, or a 27% duty cycle.

So if you used your Barracuda for a full year, 8,760 hours, its AFR would increase to 1.17%.

You can expect the AFR to increase every year it is used, so a Barracuda XT used for a full
8,760 hours/year would have an AFR of 2.34% by year two, 3.5% by year three, and if you used
it for five years, almost 6%. And if you have hundreds of them, you have a problem on your
hands.

Some studies have shown that spec sheet MTBF and AFR numbers actually underestimate real
world observations, in some cases, by orders of magnitude. A 2006 study at Carnegie Mellon
University of 70,000 drives found the “annual disk replacement rates exceed 1%, with 2-4%
common and up to 12% observed on some systems.”24. However, it should be noted that a
small number of drives in their study were seven years old and beyond any reasonably
expected useful service life.

In an attempt to gain a practical understanding of MTBF and AFR, Google did their own study25
in 2007 and also found the AFR for drives in their
environment to be significantly higher during the
drive’s initial use, settling down to 1.7% after the first
year, but still roughly 3X higher than Seagate’s
claim. The AFR spiked to 8% and higher in years two

2012 EMC Proven Professional Knowledge Sharing                                                 12
through five. They also found the “MTBF could degrade by as much as 50% when going from
operating temperatures of 30C to 40C”. The Google study also found “…after their first scan
error, drives are 39 times more likely to fail within 60 days than drives with no such errors.” They
were referring to a sniffing or scrubbing scan that tries to detect errors before they become
catastrophic failure.

Some might question the validity of Google’s report since the study involved consumer and not
enterprise grade drives, used models from various manufacturers, and may not have been
mounted or used in accordance with manufacturer’s specifications. For example, in this picture
of a Google server26, each drive is attached to the server using Velcro straps rather than
mounting screws, which
brings     up       vibration
questions. The study was
also done six years ago
against drives that were
already five years old.
Nevertheless,             the
question of reliability is a
concern    as    drives   get
older    and    larger,   and
storage frames have more
of them – i.e., your AFR “mileage may vary”.

Tackling the Issue Head On
What is the magnitude of the problem? Are there useful, practical approaches to deal with it?
For example, a petabyte of usable storage with 1TB drives and RAID 5 protection requires
1,400 drives. If the AFR is 0.5%, then seven drives statistically could fail per year. If the Google
analysis has any merit and the AFR of a two year old storage frame is 8%, then 112 drives
could fail per year. Somewhere between these calculations lies your impact. As discussed
earlier, the rebuild time for this size drive without impacting a large portion of the storage frame
would be 122 hours at medium priority, or five days. With seven failures per year, rebuild is
running 35 days a year. With 112 failed drives, you might have to give up because you would
have almost two years of drive repairs for every year of operation. And that assumes you don’t
get a second error while repairing a RAID group. Older drives would dramatically increase the

2012 EMC Proven Professional Knowledge Sharing                                                   13
percent of rebuilding per year that was occurring, and in the case of Google’s findings, could
mean a significant amount of storage resources were spent just on rebuilding. That is why the
useful life of a disk drive is usually less than five years.

Multiple drive failures in a RAID group participating in       1    1    1    1    1    1

a thin provisioned storage layout could also lead to a         2    2    2    2    2    2
                                                               3    3    3    3    3    3
multi-day outage given the failed drive could contain          4    4    4    4    4    4

logical members from an enormous number of                     5    5    5    5    5    5
                                                                                             Thick
                                                                                             LUNs
servers. With traditional thick LUNs, the number of            6    6    6    6    6    6

                                                               7    7    7    7    7    7
servers using a RAID group was limited to the size of
                                                               8    8    8    8    8    8
the LUN divided into the usable capacity of the                9    9    9    9    9    9
                                                               10   10   10   10   10   10
group. In some cases, the LUN could be spread out
across multiple LUNs, but again, to support a
relatively limited number of servers. Thin provisioning
increases        utilization   and   performance,      but
architecturally, the slices used in the RAID group are                                       Thin
                                                                                             LUNs

tiny. For example, with a 300GB RAID 5(4+1)
yielding a little more than a terabyte of usable
capacity, traditional LUNs of 50GB or so would mean
about 20 disk slices in that group. With thin
provisioning, the same five drives would create over a thousand slices and could easily be
logically part of each and every virtual and physical server on that storage frame. Larger drives
in a thin pool increase the impact of data loss. If there was a double failure, such as one bad
drive and a second one with a read error, RAID 5 would lose data and would impact just those
hosts in a thick LUN arrangement, versus nearly every host in your data center which had a thin
member on the failed drive. RAID 6 would afford you more protection, but there is a limit to that
protection as well. With thin provisioning RAID data loss, the time to restore the drive from tape
could be excessive and severely impact your company’s business.

A 4TB drive takes 4X as long to rebuild as a 1TB drive, so with RAID 5(4+1) and 490 hours to
rebuild it at medium priority (20 days), the storage frame could risk serious data loss. Even
without lost data, there could be a performance impact should other drives also need
replacement during that time. Fortunately, RAID 6 addresses the issue of a second
simultaneous failure. You should never use RAID 5 with large drives – the risk of data loss is
just too high.

2012 EMC Proven Professional Knowledge Sharing                                                       14
With larger drives, the impact of losing a disk drive could have even more far reaching
implications when administrators use data compression and deduplication on them in their
archiving tier. Compressed, a 1TB drive could easily contain 2TB of data. That means the drive
is twice as likely to be in demand. With deduplication, even more data is kept on that critical
drive.

The issue here is not just MTBF and AFR, it is the         Drive             Bits          BER           Bits
                                                           1TB           8,796,000,000,000 1^14    100,000,000,000,000
number of unrecoverable read errors. In the                10TB         87,960,000,000,000 1^15 10,000,000,000,000,000

earlier Seagate chart, the rate can be anywhere from 1014 to 1015 depending on the quality of
the drive, or statistically an error trying to read 100,000,000,000,000 to 1,000,000,000,000,000
bits. A 1TB drive contains 8,796,000,000,000 bits and a futuristic 10TB drive has
87,960,000,000,000 bits. This roughly equates to every ten full reads of a 1TB drive produces a
single uncorrectable BER with a commercial grade drive. With an enterprise quality drive, you
can expect a hundred such operations with a 1TB drive before you get an unrecoverable error.

                                                             The RAID-5 problem with large drives in a nutshell
When the drive is ten times larger, you will
                                                       1TB     1TB     1TB     1TB     1TB     1TB     1TB     1TB     1TB     1TB
only need to read the entire drive once to            BER 114 BER 114 BER 114 BER 114 BER 114 BER 114 BER 114 BER 114 BER 114 BER 114

produce a single error on a commercial quality
                                                      Example: With 1TB drives and RAID-5 (8+1) protection, to rebuild the
disk. In fact, if there needed to be a rebuild        failed “red” 1TB drive, the surviving “blue” drives must be completely
                                                      read to recreate the data to the spare “white” drive. If other failures
with 10TB drives, you would be guaranteed             occurs before the “white” drive is rebuilt, data can be lost.

multiple read errors because every sector on          The MTBF specifications of a 1TB SATA drive is between 750,000 to
                                                      1,200,000 hours, but studies have this closer to 400,000 hours. If the
every disk must be read to rebuild the image,         “white” drive rebuilds at 30MB/s, it takes 10 hours to rebuild with
                                                      permanent sparing. With a 400,000 MTBF and a rebuild of 9 drives,
thereby increasing the chances of another             there is a 0.025% chance of another drive failure – pretty low risk.

read error (perhaps recoverable, perhaps not).        A BER of 1014 means 1 error every 12.5TB. With 8 drives being fully
                                                      read, there is a 8/12.5 or 64% chance of an unreadable disk. That
At that point, it will likely interrupt the rebuild   means every 1 ½ rebuilds will fail. If the BER improves to 1015, the
                                                      probability drops to 6.4%. Larger drives increase the probability.
unless additional logic is employed. Worst
case, you would restart the rebuild session and hope for the best. If you are still not convinced
of the severity of a BER failure, think about the encryption of the data on the drive – if you
cannot read each bit, you will not decrypt that information.

The BER is also significantly different for fibre-channel and SAS drives compared to SATA. In
the chart comparing the Seagate Cheetah to the Barracuda XT, the Cheetah has a BER of
10^15, or 10 times better than the Barracuda at 10^14. Their Constellation series of SAS drives,
available in 2TB or 3TB capacities, also have a BER of 10^15, and would be better choices for
higher capacity drives than their SATA Barracuda, albeit at a higher price.

2012 EMC Proven Professional Knowledge Sharing                                                                           15
A great deal has been written about the statistical Mean Time To Data Loss (MTTDL)
calculation over the last 20 years, and like most estimations, it is accompanied by arguments
that either serve as an affirmation or a negation of the reliability of disk storage. One thing is
certain – the risk of data loss increases as HDDs get larger simply because their performance
has not kept up with Moore’s Law. In other words, if drive rebuilds were super-fast or perhaps
instantaneous, the drive size would not be a factor. Given drives fail, all that can be done is to
provide adequate protection, either through RAID or another copy of the data, or both.

As already discussed, it is risky to have hundreds or thousands of large drives protected with
RAID 5. RAID 6 is much safer, but far from perfect “…with 1 petabyte of storage and RAID 5 …
the odds of annual data loss are around 80%. RAID 6 extends the window and is fine for 1TB
drives, but with 2TB drives in a 1,000 disk system there’s a 5% chance of annual data loss. That
figure increases to 40% with 8TB drives, and eventually the change of annual data loss will
approach 100%.”27 With the data explosion and limited budgets, you will simply have to use
these very large drives.

So why not use RAID 6 for all your disks? The tradeoff you may need to make when choosing
between RAID 5 and RAID 6 is MTTDL versus performance. From a performance perspective,
depending on how your vendor implements RAID 6, it can have a 10-20% higher write overhead
than RAID 5. That means your RAID 6 system does fewer useful write I/Os per second than
RAID 5 – i.e., each write takes longer to complete. Therefore, RAID 6 should be evaluated in
small block random write-intensive environments. For read-intensive environments, RAID 5 and
RAID 6 have equivalent performance.

Losing data is obviously a serious problem.                 $40,000,000

                                                            $35,000,000   Hourly Compensation Costs Due To Data Loss
Rebuilding a drive is a fact of life, but losing
                                                            $30,000,000
                                                                                                                                   1 hour @$25
data because of a second or third failure or                $25,000,000                                                            4 hours @$25
                                                                                                                                   8 hours @$25
trying to rebuild it too fast and causing                   $20,000,000
                                                                                                                                   1 hour @$50

severe performance issues can lead to a                     $15,000,000                                                            4 hours @$50
                                                                                                                                   8 hours @$50
                                                            $10,000,000
loss   of   employee       productivity.        If    an                                                                           1 hour @$100
                                                             $5,000,000                                                            4 hours @$100
employer of 20,000 workers loses just one                           $0
                                                                                                                                   8 hours @$100
                                                                          1,000

                                                                                  10,000

                                                                                               20,000

                                                                                                        30,000

                                                                                                                 40,000

                                                                                                                          50,000

hour   of   productivity    because        of        data
                                                                                           Number of Employees
unavailability, it could cost them a million
dollars in salary and other benefits with a loaded cost of $50/hour/employee. After four hours,
the company would have spent $4M if a key system is unavailable. For example, if Delta

2012 EMC Proven Professional Knowledge Sharing                                                                                                     16
Airlines is unable to process reservations for eight hours while two drives are being rebuilt from
backup tape, they could lose a portion of their customer base. If the business is an online
reseller, there could also be a loss of web revenue. A military operation such as mid-air jet
refueling could put lives or even a nation at risk.

Another approach, while clearly more expensive, is to make multiple full copies of data.
Snapshots probably won’t help since if there are multiple drive failures, the loss of primary data
could render snapshots worthless.

Let’s examine some other approaches.

Erasure Coding
While RAID 5 and RAID 6 have their place in a storage frame, using them with 4TB and larger
drives dramatically increases the risk of data loss. Can “erasure codes” provide better
protection? Invented by Irving Reed and Gustave Solomon in 1960, their Reed-Solomon
erasure code was popular with telecommunications workloads, especially on long distance
transmissions and noisy circuits. A form of Forward Error Correction (FEC), their approach grew
in popularity with the entertainment industry’s introduction of the compact disc in 1982 as
erasure coding became “…the first example of the introduction of Reed-Solomon codes in a
consumer product.”28 DVDs came to market in 1995 and also used Reed-Solomon erasure
coding29 to allow us to enjoy music and movies, even when the discs get scratched.

Erasure coding breaks the data into fragments and adds extra blocks capable of rebuilding any
defective blocks such as those caused by an unreadable sector. They say you can literally drill a
hole in a DVD and it will play fine because of Reed-Solomon coding. A disk drive rebuild, which
might have taken days with RAID 5 or RAID 6, can now be accomplished in minutes. Here is an
example from Permabit’s RAIN-EC (Erasure Coding) white paper on how it works:30

     now is the time for all good men to come to the aid of their country

        A                      B               C                     D              P     Q
 now is the time     for all good men t o come to the aid     of their country
A chunk can be rebuilt from any 4 remaining chunks.
   P=A+B+C+D
   Q = A + 2B + 3C + 4D
Example: If A=1, B=2, C=3 and D=4, then P=10 and Q=30. If A and B are lost, then:
  A = 2P - Q + C +2D = 1
  B=P-A-C-D=2

2012 EMC Proven Professional Knowledge Sharing                                                 17
In this simple example, the data “now is the time
for all good men to come to the aid of their
country” is broken into four data chunks with two
additional protection chunks. If “now is the time ”
and “for all good men t” is lost, they can be quickly
rebuilt from calculations using the P and Q chunks
without a lengthy rebuild of an entire drive or two.

The Reed-Solomon data protection approach is
used in solutions from EMC (Atmos®31 and
Isilon®32), CleverSafe33, Amplidata34, Permabit and
others. With Atmos, GeoParity35 breaks data into chunks (or fragments), adds erasure codes,
and geographically separates them on different disks and different frames. In this GeoParity
example, data is broken into nine chunks and protected by adding three more chunks, allowing
there to be three simultaneous drive failures without data loss. Protection is written as 9/12.36

Isilon’s FlexProtect is another Reed-Solomon erasure coding method that protects against
multiple drive or node failures and can employ a 14/16 scheme, the same efficiency as RAID 6
(14+2). Their OneFS® operating system breaks a file into 8KB blocks, and the block is striped
across however many nodes are in the protection scheme. FlexProtect can rebuild the failed
chunk quickly without hot spares by writing the new chunk to another drive that has free space.
It can also handle multiple failures up to the degree of protection specified by the administrator.
With very small files, OneFS uses mirrored protection.

Cleversafe spreads the chunks over many geographically separated nodes without overlap.
They add physical and node location metadata to form a dispersal pattern for the data, all
derived from Reed-Solomon. They use a 10/16 arrangement that if spread amongst three data
centers, could construct the data if one data center was unavailable.

The downside to erasure coding is that it needs linear algebra calculations which take longer to
compute than simple RAID 6 double parity. Also, the overhead is high for small writes. When
used with geographically separated storage frames, WAN latency also becomes an issue. It is a
good fit for microsecond response time object-based systems, such as social networking and
some forms of cloud computing, but not practical for millisecond response time-oriented

2012 EMC Proven Professional Knowledge Sharing                                                      18
applications like OLTP. As multi-core processors gain power, erasure coding may become
universally popular.

What About Solid State Disks?
Up to this point, we’ve been discussing
HDDs. SSDs use memory chips, and while
they don’t have any moving parts, they can
also be subject to problems reading data. In
some companies, the introduction of SSDs
has led to the mothballing of their HDDs.
SSDs    are     delivering   10X-100X     higher
performance without the issues of motors
that lock up or heads that scratch a platter’s
surface. However, they have a different set of issues – again, no free lunch. SSDs are known
for having a limited lifetime of write activity, so SSD controllers avoid premature failure by
spreading out write activity to excess capacity in the drive using a “wear-leveling” algorithm.
From a performance standpoint, the SSD has a higher write I/O rate when the drive is newer,
not because the drive is wearing out, but because of the drive’s reserve capacity
management37.

The nature of Single-Level Cell (SLC) and Multi-Level Cell (MLC) designs result in an increasing
BER as the drive handles more write activity38. Enterprise-quality SSDs have a BER of 1015 or
even 1016, so their reliability is on par if not better than fibre-channel and SAS drives, at least
when they are new. SLCs, which are twice as the cost of MLCs, have around 10-20 times the
expected life as MLCs39. For example, STEC’s small card MLC SSD is rated at 10,000
program/erase operations while the SLC version is 100,000. Companies such as STEC, using
wear leveling, extend a drive’s life to 2 million program/cycles40.

SSDs are also packaged as PCIe server plug-in          4KB Random Reads                                  IOPS
                                                       15K RPM HHD                                    175-210
cards. Without the host bus adapter packetizing
                                                       Intel X25-M 3.5” SATA                           35,000
overhead      for   use   with   the   fibre-channel   OCZ 2x SuperScale Z-Drive R4 PCI-Express SSD 1,200,000

transport, they leverage the PCI bus and deliver even more impressive results than “disk form
factor” SSDs. For example, a 15K HDD can deliver 175-210 IOPS with 4KB random reads,

2012 EMC Proven Professional Knowledge Sharing                                                            19
while an Intel X25-M 3.5” SATA drive can deliver 35,000 4KB random reads, and a OCZ 2x
SuperScale Z-Drive R4 PCI-Express SSD can deliver 1,200,000 4KB random reads41.

But PCIe SSDs are often not used in pairs, and even if they did, the PCIe bus clearly does not
have the capabilities of a RAID controller. However, it doesn’t rule out a software RAID
service42. Currently, the only way to protect the device is to incorporate extra components on
the single card. Clearly, if the server fails while the card is being written to, or the card itself has
a serious malfunction, there could be data loss. It is also not immune to BER issues.

In storage arrays, because of their performance profile, SSDs can often be seen doing the bulk
of the I/O intensive work. For example, if you want your Oracle transaction databases to “fly”,
put your redo logs and indexes on SSD. However, as these devices handle more and more
write activity, their error rate increases and they fail more frequently. Every write brings them
one step closer to their end-of-life. Some businesses, such as those that process stock
transactions, require 1-2 ms SSD response time. They could be in big trouble and possibly lose
billions of dollars if these devices fail and are unable to process a trade.

When used with traditional RAID and thin provisioning, data is striped evenly across the SSDs
in the RAID group. That means as a group, their error rate collectively increases, and given they
are likely of the same vintage, will all tend to wear out at the same time. So while it was a good
practice to level out the I/O load with HDDs, it is the worst thing you could do with SSDs. You
want your SSD in a RAID group to have uneven wear so they all don’t fail at the same time.

One way being explored to deal with this
issue    is   called    Differential   RAID43.
Differential RAID attempts to understand
the age of each drive in terms of write
cycles and unevenly distribute that activity
so all the drives don’t have unrecoverable
errors at the same time. It balances the
high BER of older devices in the group with
low BER of newer drives. As drives are replaced using hot spares, it redistributes the data to
minimize the risk of a multiple drive failure in the group. It may also move the parity to different
locations since the parity portion of the drive by definition gets involved in every write in a RAID
group. This chart shows the efficacy of differential RAID over RAID 5.

2012 EMC Proven Professional Knowledge Sharing                                                       20
Charts from the same differential RAID study also show it can extend the useful life of SSDs,
and require less error correction overhead:

MLC devices have it worse than SLC devices. While MLC is significantly cheaper than SLC, the
MLC’s lifespan of write activity is often found to be only 10% of the write cycles of SLC. That is
why most storage vendors opt for SLCs in their storage frames – they cost more, but are far
more reliable.

As the industry creates denser SSDs
(Moore’s Law), the BER is also increasing.
This places a bigger demand on error
correction   algorithms    to   ensure    data
integrity. SandForce’s SSD controller with
RAISE44 (Redundant Array of Independent
Silicon Elements) technology achieves a
BER of 1029. RAISE uses chips on the
circuit board in a similar manner to drives
in a RAID group. Micron also raised the
protection bar with ClearNAND45. Instead of a smarter controller, Micron put the ECC controller
on the drive itself. Again, if you lose the circuit board, you lose data, so additional protection is
still required, but the device itself is more robust.

When faced with a design choice, using four x 100GB SSDs collectively have a lower BER
impact on reliability than a single 400GB SSD if utilized the same way, in addition to getting 4X

2012 EMC Proven Professional Knowledge Sharing                                                    21
the performance at less than 4X the price. Rebuild times will be lower and four drives will give
you the ability to have some type of RAID protection.

Triple parity
The idea behind triple parity is simple. Single parity RAID 5 does a fine job with small to medium
size HDDs. Double parity RAID 6 offers improved protection for larger drives. Continuing this
very efficient paradigm, a triple parity
design can make sense with 4TB
and larger drives, especially with
their even longer rebuild times.
Adam Leventhal in “Triple-Parity
RAID and Beyond”46 went as far as
calling it RAID 7. While there is no “official” RAID 7 standard, it is designed to survive three
simultaneous errors without losing data. Mr. Leventhal’s RAID 7 chart, which uses the
nomenclature RAID-7.3 for triple parity, shows it provides RAID 6 levels of protection through
2019 when factoring in future drive capacities and ever present bit error rates.

RAID 7 will not shorten rebuild times, and it may even increase them a little. However, triple
parity allows a very risk-free future for HDDs. Clearly, three parity drives impose a high space
overhead, but as disk prices continue to fall, perhaps reaching $40 for a 14TB47 disk, it might be
a fair trade-off. Triple parity is a good fit for very large RAID groups, such as RAID 7(16+3),
where the overhead of three parity drives can be better amortized. When used on small RAID
7(4+3) layouts, it imposes a high parity overhead for just four data disks, and perhaps that data
would be better served by RAID 1 by adding one more drive.

RAID 1
Keeping two copies of data is the best way to ensure reliability. If one drive begins to fail or
completely fails, 100% of the data can be copied very quickly from the surviving good drive to a
spare or replacement drive. Copying data is much faster than rebuilding a drive from parity.
Plus, there is the hidden benefit that many storage systems offer – parallel disk reading. For
example, the Dynamic Mirror Service Policy of an EMC VMAX allows it to retrieve data from
whichever disk whose heads are “closest” to the requested data. This advanced feature can
literally increase throughput by up to 25%48. Disk writes can also occur faster since there is no
parity to calculate. The downside to RAID 1 is you need double the capacity, along with twice

2012 EMC Proven Professional Knowledge Sharing                                                 22
the power, cooling, footprint, and so on. However, with drive prices continuing to fall, the
reliability can’t be beat and the only way to get better performance is with SSDs.

There are two basic ways of implementing RAID 1; physical mirroring of one drive to another
and distribution of mirrored members throughout a system. Mirroring of one drive to another is
straight-forward while distribution can be accomplished in many ways. If a disk pool is used,
chunks of data can be spread between members of that RAID 1 type. For example, IBM’s XIV
breaks a block of data into 1MB chunks and protects them with RAID 1. The XIV goal is to
“…randomly distribute each mirrored data “chunk” across data modules, making sure that no
mirrored copy of a chunk exists within the same module as the original.”49

Most of these intra-frame dispersal chunk approaches do well when the storage frame is
underutilized. When they become full, it can be difficult for the storage frame to find free
capacity on other disk spindles. As a result, the dispersal pattern tends to lump logical pieces on
the same drive.

Whether physical mirroring or dispersal is used, you are balancing RAID 1’s better protection
and performance against the frugality and lesser protection of parity RAID.

Big Data
Big data is all about processing large, unstructured datasets that often fall outside the traditional
relational database environment. Unstructured data is the catch-all for e-mail, Word-Excel-
PowerPoint, blog contents, sensor data, searches, social networking, charts and pictures, voice
data, video, and so forth, while structured data is typically transactional and stored in a relational
database. Unstructured data is usually ingested into your storage frame with NAS/Ethernet and
structured often enters with block mode/fibre channel, so not only will data types change, so will
the reliance on network topologies.

With most of the growth coming from big data, does this mean everything will be unstructured?
No, of course not. Structured data, databases, and block mode fibre channel still provides
unmatched performance and will continue to be the backbone of business transactions. Even
with drives getting larger every year, higher performance “smaller” 10K and 15K RPM drives will
continue to be manufactured and play a significant role in storing structured data. SSDs will also
help with their unparallel transactional I/O rates and by 2015-2020, they will become ubiquitous
in servers and storage frames.

2012 EMC Proven Professional Knowledge Sharing                                                     23
Forrester Research offers a graphic50 that ties the big data definition to multiple criteria. They
point out that a large data volume by itself
does not necessarily classify it as big data,
but when you couple it with other factors, it
soon becomes clear the data cannot be
processed in a traditional manner. A big
data example is from a large, futuristic
department store. Let’s say a shopper with
a GPS smart phone or loyalty shopper
card spends twenty minutes browsing
before     purchasing       bath     towels.
Transactional data captured in a traditional
database is the sale of the towels. Big data
captured their GPS location as they traversed aisles and noted they also spent ten minutes in
front of the frying pan display. Big data could be used automatically to summon a clerk to
demonstrate the frying pan or send the shopper an immediate, electronic, personal sale coupon
for it.

A pioneer in big data, Google analyzes an enormous amount of
Internet data every few days to update search engine indices.
This type of problem doesn’t fit into a typical relational database,
so they created MapReduce to break the voluminous task into
parallel efforts on hundreds or thousands of customized,
inexpensive servers with as few as two disk drives as shown earlier. Two disks fit a RAID 1
scenario yielding high performance and high availability.

When big data is processed by highly
scalable, massively parallel architectures
such as EMC’s Greenplum™ Data Computing
Appliance (DCA), big data can still leverage
RAID since the underlying data protection
relies on relatively small drives and dispersal.
Greenplum uses twelve 600GB 10K drives in
a segment server initially protected with RAID

2012 EMC Proven Professional Knowledge Sharing                                                 24
5(5+1). The data is then mirrored elsewhere to perhaps dozens of their Data Computing
Appliances51.

When your big data is petabytes in size, it practically and economically mandates the use of the
largest disk drives available. As we have seen, the rebuild times of big drives in large RAID 6
groups increase the risk of data loss and over time would render RAID 6 impractical. If you
needed to store one petabyte of data with RAID 6(14+2), you would need 320 x 4TB drives.
With ten petabytes of usable data, you need 3,200 drives. There would almost be a guarantee
of data loss. EMC’s Isilon can hold 15 petabytes in a single file system and protect it, as we’ve
seen earlier, without the need for traditional RAID.

Conclusion
Since 1988, RAID has stood watch over our data, guarding it from loss. Mirrored RAID 1, single
parity RAID 5, and double parity RAID 6 have done a yeoman’s job providing superior
performance and great protection technology. But the battle is shifting, and the 2020 storage
world will look completely different. So can we continue to wage tomorrow’s storage protection
battles with yesterday’s technology?

IDC says an organization in 2012 has about 30% of
its data classified as structured data with the rest
being unstructured52. Structured data needs fast
response time while unstructured is not transactional
in nature. Unstructured data can be stored on larger,
slower drives since it usually has less stringent response time requirements. IDC predicts
structured data will represent an even smaller percentage of stored data, dropping to 23% by
2014.

With IDC’s prediction of digital information growing annually by a
factor of 44 through 202053, 90% of your data could be
unstructured. As discussed, unstructured data can be economically
stored on very large, slower, compressed, and even deduplicated
drives. Today that is a 4TB drive, but over time, the standard will be
8TB, 12TB, and even 16TB drives. And as we’ve seen, protecting many very large drives is
RAID’s Achilles heel. With today’s BER and 4TB drives protected by RAID 6, the risk of data

2012 EMC Proven Professional Knowledge Sharing                                                25
loss ranges from low to medium. The chances of annual data loss with larger drives can exceed
40%.

RAID has evolved over the last 24 years from single parity RAID 5 protection to double parity
RAID 6. Along the way, additional technologies such as Reed-Solomon FEC have helped RAID
evolve. RAID may not totally disappear by 2020 if storage arrays can offer multiple protection
mechanisms. Mirrored RAID would still prove an effective way to boost I/O performance and
maintain a low risk profile. Parity RAID should still excel with structured data on small to
moderate size drives since it is fast and reliable. As storage processors become faster and
bandwidth to the drives increases, larger drives will be rebuilt faster and with less impact on the
rest of the storage frame. Protection technologies such as a dispersed RAID, which requires
more mathematical computations to store and retrieve data than simple parity calculations,
could find their way into the mainstream. Perhaps by then, a storage system will automatically
apply “classic” RAID or dispersed RAID based on the nature of the data. Stay tuned!

2012 EMC Proven Professional Knowledge Sharing                                                  26
You can also read