Lessons from Giant-Scale Services

Page created by Stephanie Wheeler
 
CONTINUE READING
Lessons from Giant-Scale Services
Copyright 2001, IEEE Computer

 Scalable Internet Services
                                                                    Do not distribute.
                                                                    See http://www.computer.org/internet/ic2001/w4046abs.htm

                                              Lessons from
                                              Giant-Scale Services
                                                           Giant Web services require new tools and methods for issues
                                                           of scale, availability, and evolution.

                                                                                                                     The Basic Model
                                                           W
Eric A. Brewer                                                          eb portals and ISPs such as
University of California, Berkeley,                                     AOL, Microsoft Network, and                  I focus on “infrastructure services” —
and Inktomi Corporation                                                 Yahoo have grown more than                   Internet-based systems that provide
                                                           tenfold in the past five years. Despite                   instant messaging, wireless services such
                                                           their scale, growth rates, and rapid evo-                 as iMode, and so on. These services pri-
                                                           lution of content and features, these sites               marily reside remotely from the user,
                                                           and other “giant-scale” services like                     although they might include local access
                                                           instant messaging and Napster must be                     software, such as a browser. My discus-
                                                           always available. Many other major Web                    sion is limited primarily to single-site,
                                                           sites such as eBay, CNN, and Wal-Mart,                    single-owner, well-connected clusters,
                                                           have similar availability requirements. In                which may be part of a larger service, as
                                                           this article, I look at the basic model for               in the case of e-mail servers. I do not
                                                           such services, focusing on the key real-                  cover wide-area issues such as network
                                                           world challenges they face — high avail-                  partitioning, low or intermittent band-
                                                           ability, evolution, and growth — and                      width, or multiple administrative
                                                           developing some principles for attacking                  domains. There are many other important
                                                           these problems.                                           challenges that I do not address here,
                                                              This is very much an “experience” arti-                including service monitoring and config-
                                                           cle. Few of the points I make here are                    uration, network quality of service (QoS),
                                                           addressed in the literature, and most of my               security, and logging and log analysis.
                                                           conclusions take the form of principles and                  This article is essentially about bridg-
                                                           approaches rather than absolute quantita-                 ing the gap between the basic building
                                                           tive evaluations. This is due partly to my                blocks of giant-scale services and the real-
                                                           focus on high-level design, partly to the                 world scalability and availability they
                                                           newness of the area, and partly to the pro-               require. It focuses on high availability and
                                                           prietary nature of some of the information                the related issues of replication, graceful
                                                           (which represents 15-20 very large sites).                degradation, disaster tolerance, and online
                                                           Nonetheless, the lessons are easy to under-               evolution.
                                                           stand and apply, and they simplify the                       Database management systems (DBMs)
                                                           design of large systems.                                  are an important part of many large sites,

46                            JULY • AUGUST 2001   http://computer.org/internet/   1089 -7801/01/$10.00 ©2001 IEEE                            IEEE INTERNET COMPUTING
Lessons from Giant-Scale Services
Giant-Scale Services

                                                                                         Client                    Client

but I intentionally ignore them here because they                Client                                                             Client
are well studied elsewhere and because the issues                                                 IP network
in this article are largely orthogonal to the use of
databases.

Advantages
                                                              Single-site server
The basic model that giant-scale services follow                                                    Load
provides some fundamental advantages:                                                              manager
                                                                                                                            Optional
                                                                                                                            backplane
■    Access anywhere, anytime. A ubiquitous infra-
     structure facilitates access from home, work,
     airport, and so on.
■    Availability via multiple devices. Because the
     infrastructure handles most of the processing,
                                                                                          Persistent data store
     users can access services with devices such as
     set-top boxes, network computers, and smart
     phones, which can offer far more functionali-       Figure 1.The basic model for giant-scale services. Clients connect via
     ty for a given cost and battery life.               the Internet and then go through a load manager that hides down
■    Groupware support. Centralizing data from           nodes and balances traffic.
     many users allows service providers to offer
     group-based applications such as calendars, tele-
     conferencing systems, and group-management          assumes that queries drive the service. This is true
     systems such as Evite (http://www.evite.com/).      for most common protocols including HTTP, FTP,
■    Lower overall cost. Although hard to measure,       and variations of RPC. For example, HTTP’s basic
     infrastructure services have a fundamental cost     primitive, the “get” command, is by definition a
     advantage over designs based on stand-alone         query. My third assumption is that read-only
     devices. Infrastructure resources can be multi-     queries greatly outnumber updates (queries that
     plexed across active users, whereas end-user        affect the persistent data store). Even sites that we
     devices serve at most one user (active or not).     tend to think of as highly transactional, such as e-
     Moreover, end-user devices have very low uti-       commerce or financial sites, actually have this
     lization (less than 4 percent), while infrastruc-   type of “read-mostly” traffic1: Product evaluations
     ture resources often reach 80 percent utiliza-      (reads) greatly outnumber purchases (updates), for
     tion. Thus, moving anything from the device         example, and stock quotes (reads) greatly out-
     to the infrastructure effectively improves effi-    number stock trades (updates). Finally, as the side-
     ciency by a factor of 20. Centralizing the          bar, “Clusters in Giant-Scale Services” (next page)
     administrative burden and simplifying end           explains, all giant-scale sites use clusters.
     devices also reduce overall cost, but are harder       The basic model includes six components:
     to quantify.
■    Simplified service updates. Perhaps the most        ■   Clients, such as Web browsers, standalone e-
     powerful long-term advantage is the ability to          mail readers, or even programs that use XML
     upgrade existing services or offer new services         and SOAP initiate the queries to the services.
     without the physical distribution required by       ■   The best-effort IP network, whether the public
     traditional applications and devices. Devices           Internet or a private network such as an
     such as Web TVs last longer and gain useful-            intranet, provides access to the service.
     ness over time as they benefit automatically        ■   The load manager provides a level of indirection
     from every new Web-based service.                       between the service’s external name and the
                                                             servers’ physical names (IP addresses) to preserve
Components                                                   the external name’s availability in the presence
Figure 1 shows the basic model for giant-scale               of server faults. The load manager balances load
sites. The model is based on several assumptions.            among active servers. Traffic might flow through
First, I assume the service provider has limited             proxies or firewalls before the load manager.
control over the clients and the IP network.             ■   Servers are the system’s workers, combining
Greater control might be possible in some cases,             CPU, memory, and disks into an easy-to-repli-
however, such as with intranets. The model also              cate unit.

IEEE INTERNET COMPUTING                                                            http://computer.org/internet/              JULY • AUGUST 2001   47
Scalable Internet Services

                                                  Clusters in Giant-Scale Services
                                                                                                       software components. Transient
           Table A. Example clusters for giant-scale services.                                         hardware failures and software faults
                                                                                                       due to rapid system evolution are
     Service                          Nodes          Queries        Nodes
                                                                                                       inevitable, but clusters simplify the
     AOL Web cache                    >1,000         10B/day        4-CPU DEC 4100s                    problem by providing (largely)
     Inktomi search engine            >1,000         >80M/day       2-CPU Sun Workstations             independent faults.
     Geocities                        >300           >25M/day       PC Based                       ■   Incremental scalability. Clusters should
     Anonymous Web-based e-mail       >5,000         >1B/day        FreeBSD PCs                        allow for scaling services as needed to
                                                                                                       account for the uncertainty and
 Clusters are collections of commodity                   work service must scale to support a          expense of growing a service. Nodes
 servers that work together on a single                  substantial fraction of the world’s           have a three-year depreciation lifetime,
 problem. Giant-scale services use clusters              population. Users spend increasingly          for example, and should generally be
 to meet scalability requirements that                   more time online each day, and                replaced only when they no longer
 exceed those of any single machine.Table                analysts predict that about 1.1 billion       justify their rack space compared to
 A shows some representative clusters and                people will have some form of                 new nodes. Given Moore’s law, a unit
 their traffic.                                          infrastructure access within the next         of rack space should quadruple in
      Other examples include some of Exo-                10 years.                                     computing power over three years,
 dus Communications’ data centers, which             ■   Cost and performance. Although a              and actual increases appear to be
 house several thousand nodes that support               traditionally compelling reason for           faster due to improvements in
 hundreds of different services; and AOL’s               using clusters, hardware cost and             packaging and disk size.
 recently added US$520 million data center,              performance are not really issues for
 which is larger than three football fields and          giant-scale services. No alternative to   Given these advantages, we view commodi-
 filled almost entirely with clusters.                   clusters can match the required scale,    ty nodes as the basic building blocks for
 Although I assume that all giant-scale ser-             and hardware cost is typically dwarfed    giant-scale services. The goal is then to
 vices use clusters, it is useful to review the          by bandwidth and operational costs.       exploit these advantages, particularly inde-
 driving forces behind their use.                    ■   Independent components. Users expect      pendent faults and incremental scalability,
                                                         24-hour service from systems that         into our higher-level goals of availability,
 ■    Absolute scalability. A successful net-            consist of thousands of hardware and      graceful degradation, and ease of evolution.

                       ■    The persistent data store is a replicated or par-         approach balances load well, it does not hide
                            titioned “database” that is spread across the             inactive servers; a client with a down node’s
                            servers’ disks. It might also include network-            address will continue to try to use it until the
                            attached storage such as external DBMSs or                DNS mapping expires, which might take several
                            systems that use RAID storage.                            hours. (Short “time-to-live” settings have short-
                       ■    Many services also use a backplane. This                  er outages, but require the client to make more
                            optional system-area-network handles inter-               DNS lookups.) Many browsers exacerbate the
                            server traffic such as redirecting client queries         problem by mishandling DNS expiration.
                            to the correct server or coherence traffic for the           Fortunately, many vendors now sell “layer-4”
                            persistent data store.                                    switches to solve load management (for example,
                                                                                      Cisco’s CSS 11800 content switch, F5 Networks’
                       Nearly all services have several other service-spe-            Big-IP enterprise switch, and Foundry Networks’
                       cific auxiliary systems such as user-profile data-             ServerIron Web switch). Such transport-layer
                       bases, ad servers, and site management tools, but              switches understand TCP and port numbers, and
                       we can ignore these in the basic model.                        can make decisions based on this information.
                                                                                      Many of these are “layer-7” (application layer)
                       Load Management                                                switches, in that they understand HTTP requests
                       Load management has seen rapid improvement                     and can actually parse URLs at wire speed. The
                       since around 1995. The original approach used                  switches typically come in pairs that support hot
                       “round-robin DNS,” which distributes different IP              failover — the ability for one switch to take over
                       addresses for a single domain name among                       for another automatically — to avoid the obvious
                       clients in a rotating fashion. Although this                   single point of failure, and can handle throughputs

48     JULY • AUGUST 2001           http://computer.org/internet/                                                           IEEE INTERNET COMPUTING
Giant-Scale Services

                                                                                         Client                    Client

above 20 Gbits per second. They detect down                      Client                                                             Client
nodes automatically, usually by monitoring open                                                   IP network
TCP connections, and thus dynamically isolate
down nodes from clients quite well.
   Two other load-management approaches are
typically employed in combination with layer-4
                                                              Single-site server
switches. The first uses custom “front-end” nodes                                                                        Round-
that act as service-specific layer-7 routers (in soft-                                                                 robin DNS
ware).2 Wal-Mart’s site uses this approach, for
example, because it helps with session manage-
ment: Unlike switches, the nodes track session
information for each user.
   The final approach includes clients in the load-
management process when possible. This general
                                                                                         Simple replicated store
“smart client” end-to-end approach goes beyond
the scope of a layer-4 switch.3 It greatly simplifies
switching among different physical sites, which in       Figure 2. A simple Web farm. Round-robin DNS assigns different
turn simplifies disaster tolerance and overload          servers to different clients to achieve simple load balancing. Persis-
recovery. Although there is no generic way to do         tent data is fully replicated and thus all nodes are identical and can
this for the Web, it is common with other systems.       handle all queries.
In DNS, for instance, clients know about an alter-
native server and can switch to it if the primary
                                                                                       Program                 Program
disappears; with cell phones this approach is
implemented as part of roaming; and application                Program                                                             Program
servers in the middle tier of three-tier database
                                                                                                  IP network
systems understand database failover.
   Figures 2 and 3 illustrate systems at opposite
ends of the complexity spectrum: a simple Web farm
and a server similar to the Inktomi search engine
cluster. These systems differ in load management,           Single-site server
                                                                                                    Load
use of a backplane, and persistent data store.                                                     manager
   The Web farm in Figure 2 uses round-robin                                                                                 Myrinet
                                                                                                                            backplane
DNS for load management. The persistent data
store is implemented by simply replicating all con-
tent to all nodes, which works well with a small
amount of content. Finally, because all servers can
handle all queries, there is no coherence traffic
and no need for a backplane. In practice, even                                           Partitioned data store
simple Web farms often have a second LAN (back-
plane) to simplify manual updates of the replicas.       Figure 3. Search engine cluster. The service provides support to other
In this version, node failures reduce system capac-      programs (Web servers) rather than directly to end users.These pro-
ity, but not data availability.                          grams connect via layer-4 switches that balance load and hide faults.
   In Figure 3, a pair of layer-4 switches manages       Persistent data is partitioned across the servers, which increases
the load within the site. The “clients” are actually     aggregate capacity but implies there is some data loss when a server
other programs (typically Web servers) that use the      is down. A backplane allows all nodes to access all data.
smart-client approach to failover among different
physical clusters, primarily based on load.
   Because the persistent store is partitioned           URLs, but some systems, such as clustered Web
across servers, possibly without replication, node       caches, might also use the backplane to route
failures could reduce the store’s effective size and     requests to the correct node.4
overall capacity. Furthermore, the nodes are no
longer identical, and some queries might need to         High Availability
be directed to specific nodes. This is typically         High availability is a major driving requirement
accomplished using a layer-7 switch to parse             behind giant-scale system design. Other infra-

IEEE INTERNET COMPUTING                                                            http://computer.org/internet/              JULY • AUGUST 2001   49
Scalable Internet Services

                                                                              For example, to see if a component has an MTBF
                                                                              of one week requires well more than a week of
                                                                              testing under heavy realistic load. If the compo-
                                                                              nent fails, you have to start over, possibly repeat-
                                                                              ing the process many times. Conversely, measur-
                                                                              ing the MTTR takes minutes or less and achieving
                                                                              a 10-percent improvement takes orders of magni-
                                                                              tude less total time because of the very fast debug-
                                                                              ging cycle. In addition, new features tend to reduce
                                                                              MTBF but have relatively little impact on MTTR,
                                                                              which makes it more stable. Thus, giant-scale sys-
                                                                              tems should focus on improving MTTR and simply
                                                                              apply best effort to MTBF.
                                                                                 We define yield as the fraction of queries that
                                                                              are completed:

                     Figure 4. 100-node 200-CPU cluster. Key design             yield = queries completed/queries offered          (2)
                     points include extreme symmetry, internal disks,
                     no people, no monitors, and no visible cables.           Numerically, this is typically very close to uptime,
                                                                              but it is more useful in practice because it directly
                                                                              maps to user experience and because it correctly
                     structures — such as the telephone, rail, and water      reflects that not all seconds have equal value.
                     systems — aim for perfect availability, a goal that      Being down for a second when there are no queries
                     should apply to IP-based infrastructure services as      has no impact on users or yield, but reduces
                     well. All these systems plan for component failures      uptime. Similarly, being down for one second at
                     and natural disasters, but information systems           peak and off-peak times generates the same
                     must also deal with constantly evolving features         uptime, but vastly different yields because there
                     and unpredictable growth.                                might be an order-of-magnitude difference in load
                        Figure 4 shows a cluster designed for high            between the peak second and the minimum-load
                     availability: It features extreme symmetry, no peo-      second. Thus we focus on yield rather than uptime.
                     ple, very few cables, almost no external disks, and         Because these systems are typically based on
                     no monitors. Each of these design features reduces       queries, we can also measure query completeness
                     the number of failures in practice. In addition, Ink-    — how much of the database is reflected in the
                     tomi manages the cluster from offsite, and con-          answer. We define this fraction as the harvest of
                     tracts limit temperature and power variations.           the query:

                     Availability Metrics                                       harvest = data available/complete data             (3)
                     The traditional metric for availability is uptime,
                     which is the fraction of time a site is handling traf-   A perfect system would have 100 percent yield
                     fic. Uptime is typically measured in nines, and tra-     and 100 percent harvest. That is, every query
                     ditional infrastructure systems such as the phone        would complete and would reflect the entire data-
                     system aim for four or five nines (“four nines”          base. Similarly, we can extend this idea to the
                     implies 0.9999 uptime, or less than 60 seconds of        fraction of features available; systems such as
                     downtime per week). Two related metrics are mean-        eBay can have some features, such as seller pro-
                     time-between-failure (MTBF) and mean-time-to-            files, unavailable while the rest of the site works
                     repair (MTTR). We can think of uptime as:                perfectly.
                                                                                 The key insight is that we can influence
                          uptime = (MTBF – MTTR)/MTBF                   (1)   whether faults impact yield, harvest, or both.
                                                                              Replicated systems tend to map faults to reduced
                     Following this equation, we can improve uptime           capacity (and to yield at high utilizations), while
                     either by reducing the frequency of failures or          partitioned systems tend to map faults to reduced
                     reducing the time to fix them. Although the for-         harvest, as parts of the database temporarily dis-
                     mer is more pleasing aesthetically, the latter is        appear, but the capacity in queries per second
                     much easier to accomplish with evolving systems.         remains the same.

50   JULY • AUGUST 2001          http://computer.org/internet/                                                  IEEE INTERNET COMPUTING
Giant-Scale Services

DQ Principle                                              database sizes, and hardware and software
The DQ Principle is simple:                               upgrades. Overall, DQ normally scales linearly
                                                          with the number of nodes, which means a small
    Data per query × queries per second → constant        test cluster is a good predictor for DQ changes on
                                                (4)       the production system. For example, Inktomi has
                                                          been able to use four-node clusters to predict per-
This is a principle rather than a literal truth, but it   formance improvements on 100-node clusters due
is remarkably useful for thinking about giant-scale       to software optimizations. The DQ impact can and
systems. The intuition behind this principle is that      should be evaluated for all proposed hardware and
the system’s overall capacity tends to have a par-        software changes prior to deployment. Similarly,
ticular physical bottleneck, such as total I/O band-      the best case under multiple node faults is a linear
width or total seeks per second, which is tied to         reduction in overall DQ.
data movement. The DQ value is the total amount               We can translate future traffic predictions into
of data that has to be moved per second on aver-          future DQ requirements and thus into hardware
age, and it is thus bounded by the underlying             and software targets. This lets us convert traffic
physical limitation. At the high utilization level        predictions into capacity planning decisions, tak-
typical of giant-scale systems, the DQ value              ing into account any expected hardware or soft-
approaches this limitation.                               ware improvements.
    Various systems have used versions of this                In terms of availability, these principles are espe-
principle for a long time. For example, the query         cially valuable for analyzing fault impact. As I
optimizer in System R used the number of I/O              mentioned, the best we can do is
operations as the metric of optimization.5 The DQ         to degrade DQ linearly with the
principle focuses more on bandwidth than on the           number of (node) faults. The              Absolute value is
number of I/Os (or seeks) — mainly because I              design goal for high availability
have found that it correlates better with capaci-         is thus to control how DQ reduc-          not typically that
ty and seems easier to affect gracefully. On a            tions affect our three availability
more intuitive level, service work corresponds to         metrics. The obvious limit of this        important, but
the amount of data moved because of data copy-            principle is that it is for data-
ing, presentation layers, and the many indepen-           intensive sites. Computation-             relative value is
dent connections that tend to make these sys-             bound sites (such as simulation
tems more network-bound than disk-bound.                  engines) or sites with long-laten-        very predictive.
Also, the workload distribution tends to be fair-         cy external actions (such as chat
ly stable overall because of the scale (we are            sites) behave differently. The vast
aggregating over thousands of connections),               majority of the top 100 sites are data intensive,
which means the ratio of bandwidth to I/Os is             however, which is not surprising given that the
fairly consistent. Thus, either would be a good           Internet is primarily a communication medium.
metric for most systems.
    The DQ value is measurable and tunable. Adding        Replication vs. Partitioning
nodes or implementing software optimizations              Replication is a traditional technique for increas-
increases the DQ value, while the presence of faults      ing availability, and I will compare it to parti-
reduces it. The absolute value is not typically that      tioning from the perspective of DQ and our
important, but the relative value under various           availability metrics. Consider a two-node clus-
changes is very predictive. Thus, the first step is to    ter: The replicated version has traditionally been
define the specific DQ metric for the service by          viewed as “better” because it maintains 100 per-
defining the target workload and using a load gen-        cent harvest under a fault, whereas the parti-
erator to measure a given combination of hardware,        tioned version drops to 50 percent harvest. Dual
software, and database size against this workload.        analysis shows that the replicated version drops
Every node has a DQ capacity that depends on this         to 50 percent yield, however, while the parti-
combination. Different nodes might have different         tioned version remains at 100 percent yield. Fur-
values if they were bought separately; knowing the        thermore, both versions have the same initial DQ
relative DQ values tells you how unevenly to do           value and lose 50 percent of it under one fault:
partitioning or load balancing.                           Replicas maintain D and reduce Q (and thus
    Given the metric and load generator, it is then       yield), while partitions keep Q constant and
easy to measure the (relative) DQ impact of faults,       reduce D (and thus harvest).

IEEE INTERNET COMPUTING                                                     http://computer.org/internet/      JULY • AUGUST 2001   51
Scalable Internet Services

              Table 1. Overload due to failures.
                                                                              key data. We still lose 1/n of the data, but it will
 Failures      Lost capacity     Redirected load         Overload factor      always be one of the less important partitions. This
 1                   1                 1                         n            combined approach preserves the key data, but
                     n               n-1                     n-1              also allows us to use our “replicated” DQ capacity
                                                                              to serve other content during normal operation.
 k                   k                  k                        n               Finally, we can exploit randomization to make
                     n               n −k                    n -k             our lost harvest a random subset of the data, (as
                                                                              well as to avoid hot spots in partitions). Many of
                                                                              the load-balancing switches can use a (pseudo-
                                                                              random) hash function to partition the data, for
                        The traditional view of replication silently          example. This makes the average and worst-case
                     assumes that there is enough excess capacity to          losses the same because we lose a random subset
                     prevent faults from affecting yield. We refer to this    of “average” value. The Inktomi search engine uses
                     as the load redirection problem because under            partial replication; e-mail systems use full replica-
                     faults, the remaining replicas have to handle the        tion; and clustered Web caches use no replication.
                     queries formerly handled by the failed nodes.            All three use randomization.
                     Under high utilization, this is unrealistic.
                        Table 1 generalizes this analysis to replica groups   Graceful Degradation
                     with n nodes. Losing two of five nodes in a replica      It would be nice to believe that we could avoid sat-
                     group, for example, implies a redirected load of 2/3     uration at a reasonable cost simply by good design,
                     extra load (two loads spread over three remaining        but this is unrealistic for three major reasons.
                     nodes) and an overload factor for those nodes of 5/3
                     or 166 percent of normal load.                           ■   The peak-to-average ratio for giant-scale sys-
                        The key insight is that replication on disk is            tems seems to be in the range of 1.6:1 to 6:1,
                     cheap, but accessing the replicated data requires            which can make it expensive to build out
                     DQ points. For true replication you need not only            capacity well above the (normal) peak.
                     another copy of the data, but also twice the DQ          ■   Single-event bursts, such as online ticket sales
                     value. Conversely, partitioning has no real savings          for special events, can generate far above-aver-
                     over replication. Although you need more copies              age traffic. In fact, moviephone.com added 10x
                     of the data with replication, the real cost is in the        capacity to handle tickets for “Star Wars: The
                     DQ bottleneck rather than storage space, and the             Phantom Menace” and still got overloaded.
                     DQ constant is independent of whether the data-          ■   Some faults, such as power failures or natural
                     base is replicated or partitioned. An important              disasters, are not independent. Overall DQ
                     exception to this is that replication requires more          drops substantially in these cases, and the
                     DQ points than partitioning for heavy write traf-            remaining nodes become saturated.
                     fic, which is rare in giant-scale systems.
                        According to these principles, you should             Given these facts, mechanisms for graceful degra-
                     always use replicas above some specified through-        dation under excess load are critical for delivering
                     put. In theory, you can always partition the data-       high availability. The DQ principle gives us new
                     base more thinly and avoid the extra replicas, but       options for graceful degradation: We can either
                     with no DQ difference, it makes more sense to            limit Q (capacity) to maintain D, or we can reduce
                     replicate the data once the partitions are a conve-      D and increase Q. We can focus on harvest through
                     nient size. You will enjoy more control over har-        admission control (AC), which reduces Q, or on
                     vest and support for disaster recovery, and it is        yield through dynamic database reduction, which
                     easier to grow systems via replication than by           reduces D, or we can use a combination of the two.
                     repartitioning onto more nodes.                          Temporarily cutting the effective database size in
                        We can vary the replication according to the          half, for instance, should roughly double our
                     data’s importance, and generally control which           capacity. Some giant-scale services have just start-
                     data is lost in the presence of a fault. For example,    ed applying the latter strategy.
                     for the cost of some extra disk space we can repli-         The larger insight is that graceful degradation
                     cate key data in a partitioned system. Under nor-        is simply the explicit process for managing the
                     mal use, one node handles the key data and the           effect of saturation on availability; that is, we
                     rest provide additional partitions. If the main node     explicitly decide how saturation should affect
                     fails, we can make one of the other nodes serve the      uptime, harvest, and quality of service. Here are

52   JULY • AUGUST 2001          http://computer.org/internet/                                                  IEEE INTERNET COMPUTING
Giant-Scale Services

some more sophisticated examples taken from real        remaining sites must handle 50 percent more traf-
systems:                                                fic. This will almost certainly saturate the site,
                                                        which will employ graceful-degradation tech-
■    Cost-based AC. Inktomi can perform AC based        niques to recover. For Inktomi, the best plan would
     on the estimated query cost (measured in DQ).      be to dynamically reduce D by 2/3 (to get 3/2 Q)
     This reduces the average data required per         on the remaining replicas. The current plan actu-
     query, and thus increases Q. Note that the AC      ally reduces D to 50 percent for any disaster, which
     policy affects both D and Q. Denying one           is simpler, but not as aggressive.
     expensive query could thus enable several             Load management presents a harder problem for
     inexpensive ones, giving us a net gain in har-     disaster recovery. Layer-4 switches do not help
     vest and yield. (Although I know of no exam-       with the loss of whole clusters, so the external site
     ples, AC could also be done probabilistically      name might become unavailable during a disaster.
     — perhaps in the style of lottery scheduling,      In that case, you must resort to DNS changes or
     so that retrying hard queries eventually           smart clients to redirect traffic to the other repli-
     works.)                                            cas. As I mentioned, however, DNS might have a
■    Priority- or value-based AC. Datek handles         very slow failover response time of up to several
     stock trade requests differently from other        hours. With a smart-client approach, clients can
     queries and guarantees that they will be exe-      perform the higher-level redirection automatical-
     cuted within 60 seconds, or the user pays no       ly and immediately, which seems the most com-
     commission. The idea is to reduce the required     pelling reason to use smart clients.
     DQ by dropping low-value queries, indepen-
     dently of their DQ cost.                           Online Evolution and Growth
■    Reduced data freshness. Under saturation, a        One of the traditional tenets of highly available
     financial site can make stock quotes expire less   systems is minimal change. This directly conflicts
     frequently. This reduces freshness but also        with both the growth rates of giant-scale services
     reduces the work per query, and thus increas-      and “Internet time” — the practice of frequent
     es yield at the expense of harvest. (The cached    product releases. For these services, we must plan
     queries don’t reflect the current database and     for continuous growth and frequent functionality
     thus have lower harvest.)                          updates. Worse still, the frequency of updates
                                                        means that in practice, software is never perfect;
As you can see, the DQ principle can be used as         hard-to-resolve issues such as slow memory leaks
a tool for designing how saturation affects our         and nondeterministic bugs tend to remain
availability metrics. We first decide which met-        unfixed. The prevailing design philosophy aims to
rics to preserve (or at least focus on), and then we    make the overall system tolerant of individual
use sophisticated AC to limit Q and possibly            node failures and to avoid cascading failures.
reduce the average D. We also use aggressive            Thus, “acceptable” quality comes down to soft-
caching and database reduction to reduce D and          ware that provides a target MTBF, a low MTTR,
thus increase capacity.                                 and no cascading failures.
                                                           We can think of maintenance and upgrades as
Disaster Tolerance                                      controlled failures, and the “online evolution”
Disaster tolerance is a simple combination of man-      process — completing upgrades without taking
aging replica groups and graceful degradation. A        down the site — is important because giant-scale
disaster can be defined as the complete loss of one     services are updated so frequently. By viewing
or more replicas. Natural disasters can affect all      online evolution as a temporary, controlled
the replicas at one physical location, while other      reduction in DQ value, we can act to minimize
disasters such as fires or power failures might         the impact on our availability metrics. In gener-
affect only one replica.                                al, online evolution requires a fixed amount of
   Under the giant-scale services model, the basic      time per node, u, so that the total DQ loss for n
question is how many locations to establish and         nodes is:
how many replicas to put at each. To examine the
load redirection problem, I return to Table 1. With       ∆DQ = n · u · average DQ/node = DQ · u              (5)
two replicas at each of three locations, for exam-
ple, we expect to lose 2/6 of the replicas during a     That is, the lost DQ value is simply the total DQ
natural disaster, which implies that each of the        value multiplied by the upgrade time per node.

IEEE INTERNET COMPUTING                                                    http://computer.org/internet/      JULY • AUGUST 2001   53
Scalable Internet Services

                                  Ideal                                             be compatible because they will coexist. Exam-
                           DQ                                                       ples of incompatible versions include changes
                          value                   Rolling upgrade                   to schema, namespaces, or (intracluster) proto-
                                                                     Big flip       cols. Neither of the other two approaches has
                                             Fast reboot
                                                                                    this restriction.
                                                         Time                   ■   Big flip. The final approach is the most com-
                                                                                    plicated. The basic idea is to update the clus-
                     Figure 5.Three approaches to upgrades.The shad-                ter one half at a time by taking down and
                     ed regions map DQ loss (down from the ideal                    upgrading half the nodes at once. During the
                     value) over time for a four-node cluster. Note that            “flip,” we atomically switch all traffic to the
                     the area of the three regions is the same.                     upgraded half using a layer-4 switch. After
                                                                                    waiting for connections to the old half to dis-
                                                                                    sipate (the switch only affects new connec-
                        Upgrade time is essentially recovery time, and              tions), we then upgrade the second half and
                     depends on the kind of change. Software upgrades               bring those nodes back into use. As with fast
                     are most common and can be handled quickly                     reboot, only one version runs at a time. Note
                     with a staging area, which allows both the new                 that the 50 percent DQ loss can be translated
                     and old versions to coexist on the node. System                into either 50 percent capacity for replicas
                     administrators can do a controlled reboot (at                  (which might be 100 percent off-peak yield) or
                     worst) to upgrade the node within the normal                   50 percent harvest for partitions.
                     MTTR. Without sufficient workspace, the down-
                     time is much greater because the administrator             The big flip is quite powerful and can be used for
                     must move the new version onto the node while              all kinds of upgrades, including hardware, OS,
                     the system is down. Staging also simplifies return-        schema, networking, and even physical relocation.
                     ing to the old version, which is helpful because live      When used for relocating a cluster, the big flip
                     traffic tends to expose new bugs. Other upgrades,          must typically use DNS changes, and the upgrade
                     such as hardware, operating system, database               time includes physically moving the nodes, which
                     schema, or repartitioning, require significantly           means the window of 50 percent DQ loss might
                     more downtime.                                             last several hours. This is manageable over a week-
                        Three main approaches are used for online               end, however, and Inktomi has done it twice: first,
                     evolution:                                                 when moving about 20 nodes from Berkeley to
                                                                                Santa Clara, California, (about an hour apart) using
                     ■    Fast reboot. The simplest approach is to quick-       DNS for the flip; and second, when moving about
                          ly reboot all nodes into the new version. The         100 nodes between two cages in the same data
                          fast reboot is straightforward, but it guarantees     center using a switch for the flip.
                          some downtime. Thus, it is more useful to mea-           All three approaches have the same DQ loss (for
                          sure lost yield rather than downtime. By              a given upgrade), so we can plot their effective DQ
                          upgrading during off-peak hours, we can               level versus time, which is shown in Figure 5. The
                          reduce the yield impact. In practice, this            three regions have the same area, but the DQ loss is
                          requires a staging area because the upgrades all      spread over time differently. All three approaches
                          occur at the same time and thus need to be            are used in practice, but rolling upgrades are eas-
                          highly automated.                                     ily the most popular. The heavyweight big flip is
                     ■    Rolling upgrade. In this approach, we upgrade         reserved for more complex changes and is used
                          nodes one at a time in a “wave” that rolls            only rarely. The approaches differ in their handling
                          through the cluster. This minimizes the overall       of the DQ loss and in how they deal with staging
                          impact because only one node is down at a             areas and version compatibility, but all three ben-
                          time. In a partitioned system, harvest will be        efit from DQ analysis and explicit management of
                          reduced during the n upgrade windows, but in          the impact on availability metrics.
                          a replicated system we expect 100 percent yield
                          and 100 percent harvest because we can update         Conclusions
                          one replica at a time, and we probably have           From my experience with several giant-scale ser-
                          enough capacity (in off hours) to prevent lost        vices, I have defined several tools for designing
                          yield. One disadvantage with the rolling              and analyzing giant-scale services. Starting with
                          upgrade is that the old and new versions must         the basic architectural model leads to novel ways

54   JULY • AUGUST 2001              http://computer.org/internet/                                               IEEE INTERNET COMPUTING
Giant-Scale Services

to think about high availability, including the DQ       lution and graceful degradation, for example,
principle and the use of harvest and yield. Expe-        could be greatly automated. Similarly, future sys-
rience has helped us develop several ways to con-        tems could provide more direct ways to integrate
trol the impact of faults through combinations of        simple application information, such as a query’s
replication and partitioning. These tools are use-       cost or value, into availability and load manage-
ful for graceful degradation, disaster tolerance,        ment mechanisms. Finally, better use of smart
and online evolution.                                    clients could simplify disaster recovery, online
   I’ve learned several major lessons for develop-       evolution, and graceful degradation.
ing giant-scale services:
                                                         References
■    Get the basics right. Start with a professional     1. A. Wolman et al., “On the Scale and Performance of
     data center and layer-7 switches, and use sym-         Cooperative Web Proxy Caching,” Proc. 17th Symp.
     metry to simplify analysis and management.             Operating Systems Principles 1999, ACM Press, New
■    Decide on your availability metrics. Everyone          York, 1999, pp. 16-31.
     should agree on the goals and how to measure        2. A. Fox et al., “Cluster-Based Scalable Network Services,”
     them daily. Remember that harvest and yield            Proc. 16th Symp. Operating System Principles, ACM Press,
     are more useful than just uptime.                      New York, 1997, pp. 78-91.
■    Focus on MTTR at least as much as MTBF.             3. C. Yoshikawa et al., “Using Smart Clients to Build Scalable
     Repair time is easier to affect for an evolving        Services,” Proc. Usenix Annual Technical Conf., Usenix
     system and has just as much impact.                    Assoc., Berkeley, Calif., Jan. 1997.
■    Understand load redirection during faults. Data     4. Traffic Server Technical Specification, Inktomi Corpora-
     replication is insufficient for preserving uptime      tion, Foster City, Calif., 2001; http://www.inktomi.com/
     under faults; you also need excess DQ.                 products/cns/products/tscclass.html.
■    Graceful degradation is a critical part of a        5. M.M. Astrahan et al., “System R: A Relational Approach to
     high-availability strategy. Intelligent admission      Database Management,” ACM Trans. Database Systems,
     control and dynamic database reduction are             ACM Press, New York, June 1976, p. 97.
     the key tools for implementing the strategy.
■    Use DQ analysis on all upgrades. Evaluate all       Eric A. Brewer is founder and Chief Scientist of Inktomi and an
     proposed upgrades ahead of time, and do                  associate professor of computer science at UC Berkeley. He
     capacity planning.                                       received a PhD in computer science from MIT. His research
■    Automate upgrades as much as possible.                   interests include mobile and wireless computing, scalable
     Develop a mostly automatic upgrade method,               servers, and application- and system-level security. Brew-
     such as rolling upgrades. Using a staging area           er founded the Federal Search Foundation in 2000, which
     will reduce downtime, but be sure to have a              helped create the first U.S. government-wide portal, First-
     fast, simple way to revert to the old version.           Gov.gov, for which he won the 2001 Federal CIO Council’s
                                                              Azimuth Award.
These techniques seem to get to the core issues in
practice and provide ways to think about avail-          Readers can contact the author at brewer@cs.berkeley.edu.
ability and fault tolerance that are predictable,
analyzable, and measurable.                              For further information on this or any other computing topic,
   Current solutions are essentially ad hoc appli-           please visit our Digital Library at http://computer.org/
cations of these ideas. Techniques for online evo-           publications/dlib/.

    Coming in 2002 ...
                    • Peer-to-Peer Networking
                    • Usability and the World Wide Web
                    • Internet Telephony
    Calls for papers are available at http://computer.org/call4ppr.htm

IEEE INTERNET COMPUTING                                                         http://computer.org/internet/        JULY • AUGUST 2001   55
You can also read