Wide-area cooperative storage with CFS

Page created by Edwin Fuller
 
CONTINUE READING
Wide-area cooperative storage with CFS

                  Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica
                                                         MIT Laboratory for Computer Science
                                                                   chord@lcs.mit.edu
                                                             http://pdos.lcs.mit.edu/chord/

Abstract                                                                             none solves all of them. CFS (the Cooperative File System) is a
The Cooperative File System (CFS) is a new peer-to-peer read-                        new design that meets all of these challenges.
only storage system that provides provable guarantees for the ef-                       A CFS file system exists as a set of blocks distributed over the
ficiency, robustness, and load-balance of file storage and retrieval.                available CFS servers. CFS client software interprets the stored
CFS does this with a completely decentralized architecture that can                  blocks as file system data and meta-data and presents an ordinary
scale to large systems. CFS servers provide a distributed hash table                 read-only file-system interface to applications. The core of the CFS
(DHash) for block storage. CFS clients interpret DHash blocks as                     software consists of two layers, DHash and Chord. The DHash
a file system. DHash distributes and caches blocks at a fine granu-                  (distributed hash) layer performs block fetches for the client, dis-
larity to achieve load balance, uses replication for robustness, and                 tributes the blocks among the servers, and maintains cached and
decreases latency with server selection. DHash finds blocks using                    replicated copies. DHash uses the Chord [31] distributed lookup
the Chord location protocol, which operates in time logarithmic in                   system to locate the servers responsible for a block. This table
the number of servers.                                                               summarizes the CFS software layering:
   CFS is implemented using the SFS file system toolkit and runs
                                                                                             Layer   Responsibility
on Linux, OpenBSD, and FreeBSD. Experience on a globally de-                                 FS      Interprets blocks as files; presents a file system in-
ployed prototype shows that CFS delivers data to clients as fast                                     terface to applications.
as FTP. Controlled tests show that CFS is scalable: with 4,096                               DHash   Stores unstructured data blocks reliably.
servers, looking up a block of data involves contacting only seven                           Chord   Maintains routing tables used to find blocks.
servers. The tests also demonstrate nearly perfect robustness and
unimpaired performance even when as many as half the servers
fail.                                                                                   Chord implements a hash-like operation that maps from block
                                                                                     identifiers to servers. Chord assigns each server an identifier drawn
                                                                                     from the same 160-bit identifier space as block identifiers. These
1.     Introduction                                                                  identifiers can be thought of as points on a circle. The mapping
   Existing peer-to-peer systems (such as Napster [20], Gnu-                         that Chord implements takes a block’s ID and yields the block’s
tella [11], and Freenet [6]) demonstrate the benefits of cooperative                 successor, the server whose ID most closely follows the block’s ID
storage and serving: fault tolerance, load balance, and the ability to
harness idle storage and network resources. Accompanying these                       tains at each server a table with information about
                                                                                                                                               
                                                                                     on the identifier circle. To implement this mapping, Chord main-
                                                                                                                                                       other
benefits are a number of design challenges. A peer-to-peer archi-
tecture should be symmetric and decentralized, and should operate
                                                                                     servers, where
                                                                                     sends messages to
                                                                                                         
                                                                                                         is the total number of servers. A Chord lookup
                                                                                                                      servers to consult their tables. Thus
well with unmanaged volunteer participants. Finding desired data                     CFS can find data efficiently even with a large number of servers,
in a large system must be fast; servers must be able to join and leave               and servers can join and leave the system with few table updates.
the system frequently without affecting its robustness or efficiency;                   DHash layers block management on top of Chord. DHash pro-
and load must be balanced across the available servers. While the                    vides load balance for popular large files by arranging to spread the
peer-to-peer systems in common use solve some of these problems,                     blocks of each file over many servers. To balance the load imposed
 University of California, Berkeley                                                  by popular small files, DHash caches each block at servers likely to
                                                                                     be consulted by future Chord lookups for that block. DHash sup-
This research was sponsored by the Defense Advanced Research                         ports pre-fetching to decrease download latency. DHash replicates
Projects Agency (DARPA) and the Space and Naval Warfare Sys-                         each block at a small number of servers, to provide fault tolerance.
tems Center, San Diego, under contract N66001-00-1-8933.                             DHash enforces weak quotas on the amount of data each server can
                                                                                     inject, to deter abuse. Finally, DHash allows control over the num-
                                                                                     ber of virtual servers per server, to provide control over how much
                                                                                     data a server must store on behalf of others.
Permission to make digital or hard copies of all or part of this work for               CFS has been implemented using the SFS toolkit [16]. This
personal or classroom use is granted without fee provided that copies are            paper reports experimental results from a small international de-
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to      ployment of CFS servers and from a large-scale controlled test-bed.
republish, to post on servers or to redistribute to lists, requires prior specific   These results confirm the contributions of the CFS design:
permission and/or a fee.                                                                    an aggressive approach to load balance by spreading file
SOSP ’01, October 21–24, 2001, Banff, Canada.
Copyright 2001 ACM XXX1-58113-411-10/01/0008 ...$5.00.                                       blocks randomly over servers;
   download performance on an Internet-wide prototype de-              Mojo Nation [19] is a broadcast query peer-to-peer storage sys-
         ployment as fast as standard FTP;                                tem which divides files into blocks and uses a secret sharing algo-
        provable efficiency and provably fast recovery times after
                                                                          rithm to distribute the blocks to a number of hosts. CFS also divides
                                                                          files into blocks but does not use secret sharing.
         failure;
        simple algorithms to achieve the above results.
                                                                          2.3 Anonymous Storage
                                                                             Freenet [5] uses probabilistic routing to preserve the anonym-
   CFS is not yet in operational use, and such use will likely prompt     ity of clients, publishers, and servers. The anonymity requirement
refinements to its design. One potential area for improvement is          limits Freenet’s reliability and performance. Freenet avoids associ-
the ability of the Chord lookup algorithm to tolerate malicious           ating a document with any predictable server, and avoids forming
participants, by verifying the routing information received from          any globally coherent topology among servers. The former means
other servers. Another area that CFS does not currently address           that unpopular documents may simply disappear from the system,
is anonymity; it is expected that anonymity, if needed, would be          since no server has the responsibility for maintaining replicas. The
layered on top of the basic CFS system.                                   latter means that a search may need to visit a large fraction of the
   The remainder of this paper is organized as follows. Section 2         Freenet network. As an example, Hong shows in his Figure 14-12
discusses related work. Section 3 outlines the design and design          that in a network with 1,000 servers, the lookup path length can
goals of CFS. Section 4 describes the Chord location protocol. Sec-       exceed 90 hops [23]. This means that if the hop count is limited
tion 5 presents the detailed design of CFS. Section 6 describes im-       to 90, a lookup may fail even though the document is available.
plementation details, and Section 7 presents experimental results.        Because CFS does not try to provide anonymity, it can guarantee
Section 8 discusses open issues and future work. Finally, Section 9       much tighter bounds on lookup cost; for example, in a 4,096-node
summarizes our findings and concludes the paper.                          system, lookups essentially never exceed 10 hops.
                                                                             CFS’s caching scheme is similar to Freenet’s in the sense that
                                                                          both leave cached copies of data along the query path from client
2.       Related Work                                                     to where the data was found. Because CFS finds data in signifi-
   CFS was inspired by Napster [20], Gnutella [11], and particu-          cantly fewer hops than Freenet, and CFS’ structured lookup paths
larly Freenet [6]. CFS uses peer-to-peer distributed hashing similar      are more likely to overlap than Freenet’s, CFS can make better use
in spirit to a number of ongoing research projects [26, 29, 35]. In       of a given quantity of cache space.
comparison to existing peer-to-peer file sharing systems, CFS of-            Like Freenet, Publius [34] focuses on anonymity, but achieves
fers simplicity of implementation and high performance without            it with encryption and secret sharing rather than routing. Pub-
compromising correctness. CFS balances server load, finds data            lius requires a static, globally-known list of servers; it stores each
quickly for clients, and guarantees data availability in the face of      share at a fixed location that is predictable from the file name. Free
server failures with very high probability. CFS, as a complete sys-       Haven [7] uses both cryptography and routing (using re-mailers [4])
tem, has individual aspects in common with many existing systems.         to provide anonymity; like Gnutella, Free Haven finds data with a
The major relationships are summarized below.                             global search.
                                                                             CFS does not attempt to provide anonymity, focusing instead on
2.1 Naming and Authentication                                             efficiency and robustness. We believe that intertwining anonymity
   CFS authenticates data by naming it with public keys or content-       with the basic data lookup mechanism interferes with correctness
hashes, as do many other distributed storage systems [9, 6, 7, 13,        and performance. On the other hand, given a robust location and
29, 34]. The use of content-hashes to securely link together differ-      storage layer, anonymous client access to CFS could be provided
ent pieces of data is due to Merkle [18]; the use of public keys to       by separate anonymizing proxies, using techniques similar to those
authentically name data is due to the SFS system [17].                    proposed by Chaum [4] or Reiter and Rubin [27].
   CFS adopts naming, authentication, and file system structure
ideas from SFSRO [9], which implements a secure distributed read-         2.4 Peer-to-Peer Hash Based Systems
only file system—that is, a file system in which files can be modi-          CFS layers storage on top of an efficient distributed hash lookup
fied only by their owner, and only through complete replacement of        algorithm. A number of recent peer-to-peer systems use simi-
the file. However, SFSRO and CFS have significant differences at          lar approaches with similar scalability and performance, including
the architectural and mechanism levels. SFSRO defines protocols           CAN [26], PAST [28, 29], OceanStore [13, 35], and Ohaha [22]. A
and authentication mechanisms which a client can use to retrieve          detailed comparison of these algorithms can be found in [31].
data from a given server. CFS adds the ability to dynamically find           The PAST [29] storage system differs from CFS in its approach
the server currently holding the desired data, via the Chord location     to load balance. Because a PAST server stores whole files, a server
service. This increases the robustness and the availability of CFS,       might not have enough disk space to store a large file even though
since changes in the set of servers are transparent to clients.           the system as a whole has sufficient free space. A PAST server
                                                                          solves this by offloading files it is responsible for to servers that do
2.2 Peer-to-Peer Search                                                   have spare disk space. PAST handles the load of serving popular
   Napster [20] and Gnutella [11] are arguably the most widely used       files by caching them along the lookup path.
peer-to-peer file systems today. They present a keyword search in-           CFS stores blocks, rather than whole files, and spreads blocks
terface to clients, rather than retrieving uniquely identified data. As   evenly over the available servers; this prevents large files from
a result they are more like search engines than distributed hash ta-      causing unbalanced use of storage. CFS solves the related problem
bles, and they trade scalability for this power: Gnutella broadcasts      of different servers having different amounts of storage space with
search queries to many machines, and Napster performs searches            a mechanism called virtual servers, which gives server managers
at a central facility. CFS as described in this paper doesn’t provide     control over disk space consumption. CFS’ block storage granu-
search, but we are developing a scalable distributed search engine        larity helps it handle the load of serving popular large files, since
for CFS.                                                                  the serving load is spread over many servers along with the blocks.
directory          inode           data block
        FS                                                                                                   block              block   H(B1)
                                                                                                      H(D)               H(F)                     B1
                                                                            public key
                                                                                         root−block
                                                                                                                D                 F

                                                                                                               ...

                                                                                                                                  ...
                                                                                                                                                data block
      DHash                    DHash                   DHash                                                                            H(B2)

                                                                                            ...
                                                                                         signature                                                B2
      Chord                    Chord                    Chord
                                                                          Figure 2: A simple CFS file system structure example. The
                                                                          root-block is identified by a public key and signed by the corre-
     CFS Client             CFS Server               CFS Server           sponding private key. The other blocks are identified by cryp-
                                                                          tographic hashes of their contents.
Figure 1: CFS software structure. Vertical links are local APIs;
horizontal links are RPC APIs.
                                                                          storage layer, and a Chord lookup layer. The client file system uses
                                                                          the DHash layer to retrieve blocks. The client DHash layer uses the
This is more space-efficient, for large files, than whole-file caching.   client Chord layer to locate the servers that hold desired blocks.
CFS relies on caching only for files small enough that distributing           Each CFS server has two software layers: a DHash storage layer
blocks is not effective. Evaluating the performance impact of block       and a Chord layer. The server DHash layer is responsible for stor-
storage granularity is one of the purposes of this paper.                 ing keyed blocks, maintaining proper levels of replication as servers
   OceanStore [13] aims to build a global persistent storage util-        come and go, and caching popular blocks. The server DHash and
ity. It provides data privacy, allows client updates, and guarantees      Chord layers interact in order to integrate looking up a block iden-
durable storage. However, these features come at a price: com-            tifier with checking for cached copies of the block. CFS servers
plexity. For example, OceanStore uses a Byzantine agreement pro-          are oblivious to file system semantics: they simply provide a dis-
tocol for conflict resolution, and a complex protocol based on Plax-      tributed block store.
ton trees [24] to implement the location service [35]. OceanStore             CFS clients interpret DHash blocks in a file system format
assumes that the core system will be maintained by commercial             adopted from SFSRO [9]; the format is similar to that of the UNIX
providers.                                                                V7 file system, but uses DHash blocks and block identifiers in place
   Ohaha [22] uses consistent hashing to map files and keyword            of disk blocks and disk addresses. As shown in Figure 2, each block
queries to servers, and a Freenet-like routing algorithm to locate        is either a piece of a file or a piece of file system meta-data, such as
files. As a result, it shares some of the same weaknesses as Freenet.     a directory. The maximum size of any block is on the order of tens
                                                                          of kilobytes. A parent block contains the identifiers of its children.
2.5 Web Caches                                                                The publisher inserts the file system’s blocks into the CFS sys-
   Content distribution networks (CDNs), such as Akamai [1],              tem, using a hash of each block’s content (a content-hash) as its
handle high demand for data by distributing replicas on multiple          identifier. Then the publisher signs the root block with his or her
servers. CDNs are typically managed by a central entity, while CFS        private key, and inserts the root block into CFS using the corre-
is built from resources shared and owned by a cooperative group of        sponding public key as the root block’s identifier. Clients name a
users.                                                                    file system using the public key; they can check the integrity of
   There are several proposed scalable cooperative Web caches [3,         the root block using that key, and the integrity of blocks lower in
8, 10, 15]. To locate data, these systems either multicast queries or     the tree with the content-hash identifiers that refer to those blocks.
require that some or all servers know about all other servers. As         This approach guarantees that clients see an authentic and inter-
a result, none of the proposed methods is both highly scalable and        nally consistent view of each file system, though under some cir-
robust. In addition, load balance is hard to achieve as the content       cumstances a client may see an old version of a recently updated
of each cache depends heavily on the query pattern.                       file system.
   Cache Resolver [30], like CFS, uses consistent hashing to evenly           A CFS file system is read-only as far as clients are concerned.
map stored data among the servers [12, 14]. However, Cache Re-            However, a file system may be updated by its publisher. This in-
solver assumes that clients know the entire set of servers; main-         volves updating the file system’s root block in place, to make it
taining an up-to-date server list is likely to be difficult in a large    point to the new data. CFS authenticates updates to root blocks by
peer-to-peer system where servers join and depart at unpredictable        checking that the new block is signed by the same key as the old
times.                                                                    block. A timestamp prevents replays of old updates. CFS allows
                                                                          file systems to be updated without changing the root block’s iden-
3.    Design Overview                                                     tifier so that external references to data need not be changed when
   CFS provides distributed read-only file storage. It is structured as   the data is updated.
a collection of servers that provide block-level storage. Publishers          CFS stores data for an agreed-upon finite interval. Publishers
(producers of data) and clients (consumers of data) layer file system     that want indefinite storage periods can periodically ask CFS for an
semantics on top of this block store much as an ordinary file system      extension; otherwise, a CFS server may discard data whose guar-
is layered on top of a disk. Many unrelated publishers may store          anteed period has expired. CFS has no explicit delete operation:
separate file systems on a single CFS system; the CFS design is           instead, a publisher can simply stop asking for extensions. In this
intended to support the possibility of a single world-wide system         area, as in its replication and caching policies, CFS relies on the
consisting of millions of servers.                                        assumption that large amounts of spare disk space are available.

3.1 System Structure                                                      3.2 CFS Properties
   Figure 1 illustrates the structure of the CFS software. Each CFS         CFS provides consistency and integrity of file systems by adopt-
client contains three software layers: a file system client, a DHash      ing the SFSRO file system format. CFS extends SFSRO by provid-
ing the following desirable desirable properties:                            Consistent hashing is straightforward to implement, with
        Decentralized control. CFS servers need not share any ad-
                                                                           constant-time lookups, if all nodes have an up-to-date list of all
                                                                           other nodes. However, such a system does not scale; Chord pro-
         ministrative relationship with publishers. CFS servers could      vides a scalable, distributed version of consistent hashing.
         be ordinary Internet hosts whose owners volunteer spare stor-
         age and network resources.
        Scalability. CFS lookup operations use space and messages
                                                                           4.2 The Chord Lookup Algorithm
         at most logarithmic in the number of servers.                          A Chord node uses two data structures to perform lookups: a
        Availability. A client can always retrieve data as long as
                                                                           successor list and a finger table. Only the successor list is required
                                                                           for correctness, so Chord is careful to maintain its accuracy. The
         it is not trapped in a small partition of the underlying net-     finger table accelerates lookups, but does not need to be accurate,
         work, and as long as one of the data’s replicas is reachable      so Chord is less aggressive about maintaining it. The following
         using the underlying network. This is true even if servers are    discussion first describes how to perform correct (but slow) lookups
         constantly joining and leaving the CFS system. CFS places         with the successor list, and then describes how to accelerate them
         replicas on servers likely to be at unrelated network locations   up with the finger table. This discussion assumes that there are no
         to ensure independent failure.                                    malicious participants in the Chord protocol; while we believe that
        Load balance. CFS ensures that the burden of storing and
                                                                           it should be possible for nodes to verify the routing information that
                                                                           other Chord participants send them, the algorithms to do so are left
         serving data is divided among the servers in rough proportion     for future work.
         to their capacity. It maintains load balance even if some data         Every Chord node maintains a list of the identities and IP ad-
         are far more popular than others, through a combination of        dresses of its  immediate successors on the Chord ring. The fact
         caching and spreading each file’s data over many servers.         that every node knows its own successor means that a node can al-
        Persistence. Once CFS commits to storing data, it keeps it
                                                                           ways process a lookup correctly: if the desired key is between the
                                                                           node and its successor, the latter node is the key’s successor; oth-
         available for at least an agreed-on interval.                     erwise the lookup can be forwarded to the successor, which moves
        Quotas. CFS limits the amount of data that any particular IP
                                                                           the lookup strictly closer to its destination.
                                                                                A new node  learns of its successors when it first joins the
         address can insert into the system. This provides a degree of
                                                                           Chord ring, by asking an existing node to perform a lookup for
         protection against malicious attempts to exhaust the system’s
                                                                            ’s successor;  then asks that successor for its successor list. The
         storage.
                                                                             entries in the list provide fault tolerance: if a node’s immedi-
        Efficiency. Clients can fetch CFS data with delay compara-        ate successor does not respond, the node can substitute the second
         ble to that of FTP, due to CFS’ use of efficient lookup algo-     entry in its successor list. All  successors would have to simulta-
         rithms, caching, pre-fetching, and server selection.              neously fail in order to disrupt the Chord ring, an event that can be

                                                                           tion should use a fixed  , chosen to be 
                                                                                                                       
                                                                           made very improbable with modest values of  . An implementa-
                                                                                                                                  for the foreseeable
  The next two sections present Chord and DHash, which together
provide these properties.                                                  maximum number of nodes .
                                                                                The main complexity involved with successor lists is in notify-
                                                                           ing an existing node when a new node should be its successor. The
4.       Chord Layer                                                       stabilization procedure described in [31] does this in a way that
   CFS uses the Chord protocol to locate blocks [31]. Chord sup-           guarantees to preserve the connectivity of the Chord ring’s succes-
ports just one operation: given a key, it will determine the node re-      sor pointers.
sponsible for that key. Chord does not itself store keys and values,            Lookups  performed only with successor lists would require an
but provides primitives that allow higher-layer software to build a
wide variety of storage systems; CFS is one such use of the Chord
                                                                           average of          message exchanges, where is the number of
                                                                           servers. To reduce the number of messages required to
                                                                                                                                           ,
primitive. The rest of this section summarizes Chord and describes         each node maintains a finger table table with            entries. The
new algorithms for robustness and server selection to support ap-          entry in the table at node  contains the identity of the first node
plications such as CFS.                                                    that succeeds  by at least  on the ID circle. Thus every node
                                                                           knows the identities of nodes at power-of-two intervals on the ID
4.1 Consistent Hashing                                                     circle from its own position. A new node initializes its finger table
   Each Chord node has a unique -bit node identifier (ID), ob-             by querying an existing node. Existing nodes whose finger table or
tained by hashing the node’s IP address and a virtual node index.          successor list entries should refer to the new node find out about it
Chord views the IDs as occupying a circular identifier space. Keys         by periodic lookups.
are also mapped into this ID space, by hashing them to -bit key                 Figure 3 shows pseudo-code to look up the successor of iden-
IDs. Chord defines the node responsible for a key to be the succes-       tifier  . The main loop is in find predecessor, which sends
sor of that key’s ID. The successor of an ID is the node with the         preceding node list RPCs to a succession of other nodes; each
smallest ID that is greater than or equal to (with wrap-around),           RPC searches the tables of the other node for nodes yet closer
much as in consistent hashing [12].                                        to  . Because finger table entries point to nodes at power-of-
   Consistent hashing lets nodes enter and leave the network with          two intervals around the ID ring, each iteration will set  to a
minimal movement of keys. To maintain correct successor map-               node halfway on the ID ring between the current   and  . Since
pings when a node  joins the network, certain keys previously as-         preceding node list never returns an ID greater than  , this process
signed to  ’s successor become assigned to  . When node  leaves         will never overshoot the correct successor. It may under-shoot, es-
the network, all of  ’s assigned keys are reassigned to its successor.
No other changes in the assignment of keys to nodes need occur.
                                                                           pecially if a new node has recently     
                                                                                                                  joined with an ID just before
                                                                              ; in that case the check for       successor ensures that

// Ask node
        
              to find ’s successor; first                                                       At each step in find predecessor(id) (Figure 3), the node doing
// finds ’s predecessor, then asks that
// predecessor for its own   successor.                                                      the lookup ( ) can choose the next hop from a set of nodes. Ini-
                   
                                                                               tially this set is the contents of  ’s own routing tables; subsequently
                                                                         
     find predecessor                                                       ;           the set is the list of nodes returned by the preceding node list RPC
   return   successor  ;                                                                  to the previous hop ( ). Node            tells  the measured latency to
                                                                                          each node in the set;       collected these latencies when it acquired
// Ask node             to find ’s predecessor.                                              its finger table entries. Different choices of next-hop node will take
                
                                                                                             the query different distances around the ID ring, but impose differ-
                ;
                  !"
                #                                                                   &#
   while           %$   successor                                                   ent RPC latencies; the following calculation seeks to pick the best
                                                                                                                                                            
    '                 
             preceding ' node list 
                                           ;                                               combination.
                      "
        )(+*,          s.t.   is alive                                                   Chord estimates the total cost : 
   return  ;                                                                                                                             of using each node in the
                                                                                             set of potential next hops:
// Ask node for a list of nodes 
                                       in its finger table or
                                                                                                                                      
                                                                                                                                             ;          
                                                                                                                                            =    BA CA
// successor list that precede  .
 -/.01
                                       "
                                                             2#.#43                                :
                                                                                                                                9

                                                                                                                                       ones 
                                                                                                                                                                    
                                                                                                                                                                    EDFG       
   return         5
                              "
                                      0
                                                5   fingers
                                                    8&
                                                                     6       successors 7             >
                                                                                                                              9
                                                                                                                                                   @                        @
     s.t.                                                  7
                                                                                                 
                                            $

                                                                                             >
                                                                                                    is an estimate of the number of Chord hops that would re-
                                                                                             main after contacting  . is node  ’s estimate of the total number
                                                                                                                       
Figure 3: The pseudo-code to find the successor node of an                                                                  
                                                                                             of Chord nodes in the system, based on the density of nodes nearby
identifier  . Remote procedure calls are preceded by the re-                               on the ID ring.             is an estimate of the number of significant
mote node.                                                                                   high bits in an ID;  and its successor are likely
                                                                                             bits, but not in less significant bits. 
                                                                                                                                               IACA
                                                                                                                                                    toEagree
                                                                                                                                                       DFG    
                                                                                                                                                               in these                 
                                                                                                                                        H@                @
                                                                                             yields just the significant bits in the ID-space distance between 
                                                                                                                                                                      
find predecessor persists until it finds a pair of nodes that straddle                       and the target key  , and the ones function counts how many bits
 .                                                                                         are set in that difference; this is approximately the number of Chord
   Two aspects of the lookup algorithm make it robust. First, an                             hops between  and  . Multiplying by  , the average latency of
RPC to preceding node list on node  returns a list of nodes that                                              
                                                                                             all the RPCs that node  has ever issued, estimates the time it would
believes are between it and the desired  . Any one of them can be                          take to send RPCs for those hops. Adding  , the latency to node 
used to make progress towards the successor of  ; they must all be                                                                                                   
unresponsive for a lookup to fail. Second, the while loop ensures                            uses the node with minimum : 
                                                                                                                                                    
                                                                                             as reported by node , produces a complete cost estimate. Chord

that find predecessor will keep trying as long as it can find any next                                                               as the next hop.
                                                                                                One benefit of this server selection method is that no extra mea-
node closer to  . As long as nodes are careful to maintain correct                         surements are necessary to decide which node is closest; the de-
successor pointers, find predecessor will eventually succeed.                                cisions are made based on latencies observed while building fin-
   In the usual case in which most nodes have correct finger table                           ger tables. However, nodes rely on latency measurements taken by
information, each iteration of the while loop eliminates half the                            other nodes. This works well when low latencies from nodes J to K ,
remaining distance to the target. This means that the hops early in                          and K to L , mean that latency is also low from J to L . Measurements
a lookup travel long distances in the ID space, and later hops travel                        of our Internet test-bed suggest that this is often true [33].
small distances. The efficacy of the caching mechanism described
in Section 5.2 depends on this observation.
   The following two theorems, proved in an accompanying techni-                             4.4 Node ID Authentication
cal report [32], show that neither the success nor the performance                              If Chord nodes could use arbitrary IDs, an attacker could destroy
of Chord lookups is likely to be affected even by massive simul-                             chosen data by choosing a node ID just after the data’s ID. With

length !9
                                       
taneous failures. Both theorems assume that the successor list has
                        . A Chord ring is stable if every node’s suc-
                                                                                             control of the successor, the attacker’s node could effectively delete
                                                                                             the block by denying that the block existed.
cessor list is correct.                                                                         To limit the opportunity
                                                                                             be of the form M
                                                                                                               N          
                                                                                                                           for this attack, a Chord node ID must
                                                                                                                  , where M is the SHA-1 hash function and is
                                                                                                                                                               N
   T HEOREM 1. In a network that is initially stable, if every
                                                                                             the node’s IP address concatenated with a virtual node index. The
node then fails with probability 1/2, then with high probability
                                                                                             virtual node index must fall between 0 and some small maximum.
find successor returns the closest living successor to the query key.
                                                                                             As a result, a node cannot easily control the choice of its own Chord
                                                                                             ID.
   T HEOREM 2. In a network that is initially stable, if every node                             When a new node  joins the system, some existing nodes may
                                                                                             decide to add  to their finger tables. As part of this process, each
find successor is           .
                                                        
then fails with probability 1/2, then the expected time to execute
                                                                                             such existing node sends a message to  ’s claimed IP address con-
                                                                                             taining a nonce. If the node at that IP address admits to having  ’s
   The evaluation in Section 7 validates these theorems experimen-                           ID, and the claimed IP address and virtual node index hash to the
tally.                                                                                       ID, then the existing node accepts  .
                                                                                                With this defense in place, an attacker would have to control
4.3 Server Selection                                                                         roughly as many IP addresses as there are total other nodes in the
   Chord reduces lookup latency by preferentially contacting nodes                           Chord system in order to have a good chance of targeting arbitrary
likely to be nearby in the underlying network. This server selection                         blocks. However, owners of large blocks of IP address space tend
was added to Chord as part of work on CFS, and was not part of                               to be more easily identifiable (and less likely to be malicious) than
Chord as originally published.                                                               individuals.
5.    DHash Layer
   The CFS DHash layer stores and retrieves uniquely identified
blocks, and handles distribution, replication, and caching of those
blocks. DHash uses Chord to help it locate blocks.
   DHash reflects a key CFS design decision: to split each file sys-
tem (and file) into blocks and distribute those blocks over many
servers. This arrangement balances the load of serving popular
files over many servers. It also increases the number of messages
required to fetch a whole file, since a client must look up each
block separately. However, the network bandwidth consumed by
a lookup is small compared to the bandwidth required to deliver
the block. In addition, CFS hides the block lookup latency by pre-
fetching blocks.
   Systems such as Freenet [6] and PAST [29] store whole files.
This results in lower lookup costs than CFS, one lookup per file           Figure 4: The placement of an example block’s replicas and
rather than per block, but requires more work to achieve load bal-         cached copies around the Chord identifier ring. The block’s ID
ance. Servers unlucky enough to be responsible for storing very            is shown with a tick mark. The block is stored at the succes-
large files may run out of disk space even though the system as            sor of its ID, the server denoted with the square. The block is
a whole has sufficient free space. Balancing the load of serving           replicated at the successor’s immediate successors (the circles).
whole files typically involves adaptive caching. Again, this may be        The hops of a typical lookup path for the block are shown with
awkward for large files; a popular file must be stored in its entirety     arrows; the block will be cached at the servers along the lookup
at each caching server. DHash also uses caching, but only depends          path (the triangles).
on it for small files.
   DHash’s block granularity is particularly well suited to serving
large, popular files, such as software distributions. For example, in         CFS could save space by storing coded pieces of blocks rather
a 1,000-server system, a file as small as 8 megabytes will produce         than whole-block replicas, using an algorithm such as IDA [25].
a reasonably balanced serving load with 8 KByte blocks. A sys-             CFS doesn’t use coding, because storage space is not expected to
tem that balances load by caching whole files would require, in this       be a highly constrained resource.
case, about 1,000 times as much total storage to achieve the same             The placement of block replicas makes it easy for a client to se-
load balance. On the other hand, DHash is not as efficient as a            lect the replica likely to be fastest to download. The result of the
whole-file scheme for large but unpopular files, though the experi-        Chord lookup for block  is the identity of the server that immedi-
ments in Section 7.1 show that it can provide competitive download         ately precedes  . The client asks this predecessor for its successor
speeds. DHash’s block granularity is not likely to affect (for bet-        list, which will include the identities of the servers holding replicas
ter or worse) performance or load balance for small files. For such        of block  as well as latency measurements between the predeces-
files, DHash depends on caching and on server selection among              sor and these servers. The client then fetches the block from the
block replicas (described in Section 5.1).                                 replica with the lowest reported latency. As with the Chord server
   Table 1 shows the API that the DHash layer exposes. The CFS             selection described in Section 4.3, this approach works best when
file system client layer uses get to implement application requests        proximity in the underlying network is transitive. Fetching from
to open files, read files, navigate directories, etc. Publishers of data   the lowest-latency replica has the side-effect of spreading the load
use a special application that inserts or updates a CFS file system        of serving a block over the replicas.
using the put h and put s calls.
                                                                           5.2 Caching
5.1 Replication                                                               DHash caches blocks to avoid overloading servers that hold pop-
   DHash replicates each block on CFS servers to increase avail-           ular data. Each DHash layer sets aside a fixed amount of disk stor-
ability, maintains the replicas automatically as servers come and          age for its cache. When a CFS client looks up a block key, it per-
go, and places the replicas in a way that clients can easily find them.    forms a Chord lookup, visiting intermediate CFS servers with IDs
   DHash places a block’s replicas at the servers immediately af-          successively closer to that of the key’s successor (see Figure 4).
ter the block’s successor on the Chord ring (see Figure 4). DHash          At each step, the client asks the intermediate server whether it has
can easily find the identities of these servers from Chord’s  -entry      the desired block cached. Eventually the client arrives at either the
successor list. CFS must be configured so that        . This place-     key’s successor or at an intermediate server with a cached copy.
ment of replicas means that, after a block’s successor server fails,       The client then sends a copy of the block to each of the servers
the block is immediately available at the block’s new successor.           it contacted along the lookup path. Future versions of DHash will
   The DHash software in a block’s successor server manages repli-         send a copy to just the second-to-last server in the path traversed by
cation of that block, by making sure that all of its successor             the lookup; this will reduce the amount of network traffic required
servers have a copy of the block at all times. If the successor            per lookup, without significantly decreasing the caching effective-
server fails, the block’s new successor assumes responsibility for         ness.
the block.                                                                    Since a Chord lookup takes shorter and shorter hops in ID space
   The value of this replication scheme depends in part on the in-         as it gets closer to the target, lookups from different clients for the
dependence of failure and unreachability among a block’s replica           same block will tend to visit the same servers late in the lookup.
servers. Servers close to each other on the ID ring are not likely to      As a result, the policy of caching blocks along the lookup path is
be physically close to each other, since a server’s ID is based on a       likely to be effective.
hash of its IP address. This provides the desired independence of             DHash replaces cached blocks in least-recently-used order.
failure.                                                                   Copies of a block at servers with IDs far from the block’s succes-
Function                       Description
                put h(block)                   Computes the block’s key by hashing its contents, and sends it to the key’s successor server
                                               for storage.
                put s(block, pubkey)           Stores or updates a signed block; used for root blocks. The block must be signed with the
                                               given public key. The block’s Chord key will be the hash of pubkey.
                get(key)                       Fetches and returns the block associated with the specified Chord key.

                                       Table 1: DHash client API; exposed to client file system software.

sor are likely to be discarded first, since clients are least likely to         is overloaded because the CFS system as a whole is overloaded,
stumble upon them. This has the effect of preserving the cached                 then automatically deleting virtual servers might cause a cascade
copies close to the successor, and expands and contracts the degree             of such deletions.
of caching for each block according to its popularity.
   While caching and replication are conceptually similar, DHash                5.4 Quotas
provides them as distinct mechanisms. DHash stores replicas in                     The most damaging technical form of abuse that CFS is likely to
predictable places, so it can ensure that enough replicas always                encounter is malicious injection of large quantities of data. The aim
exist. In contrast, the number of cached copies cannot easily be                of such an attack might be to use up all the disk space on the CFS
counted, and might fall to zero. If fault-tolerance were achieved               servers, leaving none available for legitimate data. Even a non-
solely through cached copies, an unpopular block might simply dis-              malicious user could cause the same kind of problem by accident.
appear along with its last cached copy.                                            Ideally, CFS would impose per-publisher quotas based on re-
   CFS avoids most cache consistency problems because blocks are                liable identification of publishers, as is done in the PAST sys-
keyed by content hashes. Root blocks, however, use public keys                  tem [29]. Reliable identification usually requires some form of
as identifiers; a publisher can change a root block by inserting a              centralized administration, such as a certificate authority. As a de-
new one signed with the corresponding private key. This means                   centralized approximation, CFS bases quotas on the IP address of
that cached root blocks may become stale, causing some clients                  the publisher. For example, if each CFS server limits any one IP
to see an old, but internally consistent, file system. A client can             address to using 0.1% of its storage, then an attacker would have
check the freshness of a cached root block [9] to decide whether                to mount an attack from about 1,000 machines for it to be success-
to look for a newer version. Non-root blocks that no longer have                ful. This mechanism also limits the storage used by each legitimate
any references to them will eventually be eliminated from caches                publisher to just 0.1%, assuming each publisher uses just one IP
by LRU replacement.                                                             address.
                                                                                   This limit is not easy to subvert by simple forging of IP ad-
5.3 Load Balance                                                                dresses, since CFS servers require that publishers respond to a con-
   DHash spreads blocks evenly around the ID space, since the con-              firmation request that includes a random nonce, as described in Sec-
tent hash function uniformly distributes block IDs. If each CFS                 tion 4.4. This approach is weaker than one that requires publishers
server had one ID, the fact that IDs are uniformly distributed would            to have unforgeable identities, but requires no centralized adminis-
mean that every server would carry roughly the same storage bur-                trative mechanisms.
den. This is not desirable, since different servers may have different             If each CFS server imposes a fixed per-IP-address quota, then the
storage and network capacities. In addition, even uniform distribu-             total amount of storage an IP address can consume will grow lin-

                              
tion doesn’t produce perfect load balance; the maximum storage                  early with the total number of CFS servers. It may prove desirable
burden is likely to be about            times the average due to irreg-         to enforce a fixed quota on total storage, which would require the
ular spacing between server IDs [12].                                           quota imposed by each server to decrease in proportion to the total
   To accommodate heterogeneous server capacities, CFS uses the                 number of servers. An adaptive limit of this form is possible, using
notion (from [12]) of a real server acting as multiple virtual servers.         the estimate of the total number of servers that the Chord software
The CFS protocol operates at the virtual server level. A virtual                maintains.
server uses a Chord ID that is derived from hashing both the real
server’s IP address and the index of the virtual server within the              5.5 Updates and Deletion
real server.                                                                       CFS allows updates, but in a way that allows only the publisher
   A CFS server administrator configures the server with a number               of a file system to modify it. A CFS server will accept a request to
of virtual servers in rough proportion to the server’s storage and              store a block under either of two conditions. If the block is marked
network capacity. This number can be adjusted from time to time                 as a content-hash block, the server will accept the block if the sup-
to reflect observed load levels.                                                plied key is equal to the SHA-1 hash of the block’s content. If the
   Use of virtual servers could potentially increase the number of              block is marked as a signed block, the block must be signed by a
hops in a Chord lookup. CFS avoids this expense by allowing vir-                public key whose SHA-1 hash is the block’s CFS key.
tual servers on the same physical server to examine each others’                   The low probability of finding two blocks with the same SHA-1
tables; the fact that these virtual servers can take short-cuts through         hash prevents an attacker from changing the block associated with a
each others’ routing tables exactly compensates for the increased               content-hash key, so no explicit protection is required for most of a
number of servers.                                                              file system’s blocks. The only sensitive block is a file system’s root
   CFS could potentially vary the number of virtual servers per real            block, which is signed; its safety depends on the publisher avoiding
server adaptively, based on current load. Under high load, a real               disclosure of the private key.
server could delete some of its virtual servers; under low load, a                 CFS does not support an explicit delete operation. Publishers
server could create additional virtual servers. Any such algorithm              must periodically refresh their blocks if they wish CFS to continue
would need to be designed for stability under high load. If a server            to store them. A CFS server may delete blocks that have not been
refreshed recently.                                                        // RPC handler on server . Returns block with ID key,
                                                                           // or the best next server to talk to.
   One benefit of CFS’ implicit deletion is that it automatically re-         2                   3H   
                                                                              if                       
                                                                                                             6    & 
                                                                                             "                                    '    
covers from malicious insertions of large quantities of data. Once                                                      6                     then
                                                                                  38                   
                                                                                                                    
the attacker stops inserting or refreshing the data, CFS will gradu-                         COMPLETE $                 $   !
                                                                              else if         predecessor $#" 
                                                                                              "                                 &
ally delete it.
                                                                                 38 
                                                                                             NONEXISTENT
                                                                              else if    "  "   $ first live successor &
6.    Implementation                                                             $ %&  !  first live successor
   CFS is implemented in 7,000 lines of C++, including the 3,000              else
line Chord implementation. It consists of a number of separate                   // Find highest server ' key in my finger       table or successor list.
                                                                                 $ %&  !  lookup closest pred   
programs that run at user level. The programs communicate over                (    5
                                                                                     '         "
                                                                                                   5*)
                                                                                                        $+ , 6 (   , 7 s.t. .- / %&  0 7
                                                                              return  CONTINUE $ / %&  0 $ ( 
UDP with a C++ RPC package provided by the SFS toolkit [16]. A                                                             '  &

busy CFS server may exchange short messages with a large num-
ber of other servers, making the overhead of TCP connection setup          // RPC handler to ask the Chord software to delete
unattractive compared to UDP. The internal structure of each pro-          // server idfrom the finger list and successor list.
                                                                              1 2  3
                                                                                          
gram is based on asynchronous events and callbacks, rather than
threads. Each software layer is implemented as a library with a            // Return the block associated with key, or an error.
C++ interface. CFS runs on Linux, OpenBSD, and FreeBSD.                    // Runs on the  server that invokes lookup().
                                                                           2       H2 
6.1 Chord Implementation                                                         &(3 0 
                                                                                             // A stack     to accumulate the path.
                                                                                  (3 $  &      ' **( !   
   The Chord library maintains the routing tables described in Sec-
tion 4. It exports these tables to the DHash layer, which implements          repeat
                                                                                if   ($  COMPLETE
                                                                                                                           
its own integrated version of the Chord lookup algorithm. The im-
                                                                                    return      
                                                                                                 
                                                                                                   - 
                                                                                                           0
                                                                                else if   (3  CONTINUE
plementation uses the SHA-1 cryptographic hash function to pro-                                                                   
                                                                                       4* / %&
duce CFS block identifiers from block contents. This means that                     if                      0    0
block and server identifiers are 160 bits wide.                                         // p.top knew no server other than itself.
   The Chord implementation maintains a running estimate of the                        return NONEXISTENT &
                                                                                    else if (  
                                                                                                         "   !
total number of Chord servers, for use in the server selection algo-                                                   $  ! )
                                                                                        // p.top should have had the block.
rithm described in Section 4.3. Each server computes the fraction                        ( NONEXISTENT
of the ID ring that the  nodes in its successor list cover; let that               else // explore next hop 
fraction be  . Then the estimated total number of servers in the                         ($  2- $ %&  0
                                                                                        &
system is      .                                                                         (3 $ * &  - $ %&  ! ' **(             2    

                                                                                  else if   (3  RPC FAILURE
                                                                                                                          

6.2 DHash Implementation                                                                 // Try  again at previous hop.
                                                                                           )5     50  
                                                                                              '            
   DHash is implemented as a library which depends on Chord.                             '
                                                                                                          0 
                                                                                                      '
                                                                                                E   *E )5&   '
Each DHash instance is associated with a Chord virtual server and                        '                           ' 

communicates with that virtual server through a function call inter-                          ($ $ *   E ' *(    2    

face. DHash instances on different servers communicate with one                   else
                                                                                          (58 NONEXISTENT
another via RPC.
   DHash has its own implementation of the Chord lookup algo-
rithm, but relies on the Chord layer to maintain the routing tables.
Integrating block lookup into DHash increases its efficiency. If
DHash instead called the Chord find successor routine, it would               Figure 5: The procedure used by DHash to locate a block.
be awkward for DHash to check each server along the lookup path
for cached copies of the desired block. It would also cost an un-
needed round trip time, since both Chord and DHash would end up            key. If there have been server failures, the second server of the
separately contacting the block’s successor server.                        pair may not be the key’s original successor. However, that second
   Pseudo-code for the DHash implementation of lookup(key) is              server will be the first live successor, and will hold a replica for key,
shown in Figure 5; this is DHash’s version of the Chord code               assuming that not all of its replicas have failed.
shown in Figure 3. The function lookup(key) returns the data as-              Though the pseudo-code does not show it, the virtual servers on
sociated with key or an error if it cannot be found. The func-             any given physical server look at each others’ routing tables and
tion lookup operates by repeatedly invoking the remote procedure           block stores. This allows lookups to progress faster around the
n’.lookup step(key) which returns one of three possible values. If         ring, and increases the chances of encountering a cached block.
the called server (  ) stores or caches the data associated with key,
then n’.lookup step(key) returns that data. If   does not store the      6.3 Client Implementation
data, then n’.lookup step(key) returns instead the closest predeces-          The CFS client software layers a file system on top of DHash.
sor of key (determined by consulting local routing tables on   ).        CFS exports an ordinary UNIX file system interface by acting as a
Finally, n’.lookup step returns an error if   is the true successor of   local NFS server using the SFS user level file system toolkit [16].
the key but does not store its associated data.                            A CFS client runs on the same machine as CFS server; the client
   If lookup tries to contact a failed server, the RPC machinery           communicates with the local server via a UNIX domain socket and
will return RPC FAILURE. The function lookup then backtracks               uses it as a proxy to send queries to non-local CFS servers.
to the previously contacted server, tells it about the failed server          The DHash back end is sufficiently flexible to support a number
with alert, and asks it for the next-best predecessor. At some point       of different client interfaces. For instance, we are currently imple-
lookup(key) will have contacted a pair of servers on either side of        menting a client which acts as a web proxy in order to layer its
name space on top of the name space of the world wide web.

                                                                                                               200
7.    Experimental Results
   In order to demonstrate the practicality of the CFS design, we
present two sets of tests. The first explores CFS performance on a
modest number of servers distributed over the Internet, and focuses
on real-world client-perceived performance. The second involves                                                150
larger numbers of servers running on a single machine and focuses

                                                                            Speed (KBytes/Second)
on scalability and robustness.
   Quotas (Section 5.4) were not implemented in the tested soft-
ware. Cryptographic verification of updates (Section 5.5) and                                                                                 No Server Selection
server ID authentication (Section 4.4) were implemented but not                                                100
                                                                                                                                              With Server Selection
enabled. This has no effect on the results presented here.
   Unless noted, all tests were run with caching turned off, with
no replication, with just one virtual server per physical server, and
with server selection turned off. These defaults allow the effects                                              50
of these features to be individually illustrated. The experiments in-
volve only block-level DHash operations, with no file-system meta-
data; the client software driving the experiments fetches a file by

a successor list with 
                         
fetching a specified list of  block identifiers. Every server maintains
                                    entries, as mentioned in Section 4,                                          0
to help maintain ring connectivity. While CFS does not automati-                                                     0   50            100       150           200
cally adjust the successor list length to match the number of servers,                                                        Prefetch Window (KBytes)
its robustness is not sensitive to the exact value.

7.1 Real Life                                                              Figure 6: Download speeds achieved while fetching a one-
                                                                           megabyte file with CFS on the Internet testbed, for a range of
   The tests described in this section used CFS servers running on         pre-fetch window sizes. Each point is the average of the times
a testbed of 12 machines scattered over the Internet. The machines         seen by each testbed machine. One curve includes server selec-
are part of the RON testbed [2]; 9 of them are at sites spread over        tion; the other does not.
the United States, and the other three are in the Netherlands, Swe-
den, and South Korea. The servers held a one megabyte CFS file
split into 8K blocks. To test download speed, client software on
each machine fetched the entire file. The machines fetched the file
one at a time.
   Three RPCs, on average, were required to fetch each block in                                                1.0
these experiments. The client software uses pre-fetch to overlap the
lookup and fetching of blocks. The client initially issues a window
of some number of parallel block fetches; as each fetch completes,
                                                                                                               0.8
                                                                            Cumulative Fraction of Downloads

the client starts a new one.
   Figure 6 shows the average download speeds, for a range of pre-
fetch window sizes, with and without server selection. A block
fetch without server selection averages about 430 milliseconds; this
                                                                                                               0.6
explains why the download speed is about 20 KBytes/second when
fetching one 8KByte block at a time. Increasing the amount of
pre-fetch increases the speed; for example, fetching three blocks at
a time yields an average speed of about 50 KBytes/second. Large                                                0.4
amounts of pre-fetch are counter-productive, since they can congest
                                                                                                                                                 8 KBytes, no s.s.
the client server’s network connection.
                                                                                                                                                 24 KBytes, no s.s.
   Server selection increases download speeds substantially for                                                                                  40 KBytes, no s.s.
small amounts of pre-fetch, almost doubling the download speed                                                 0.2
                                                                                                                                                 8 KBytes, w/ s.s.
when no pre-fetch is used. The improvement is less dramatic for                                                                                  24 KBytes, w/ s.s.
larger amounts of pre-fetch, partially because concentrating block                                                                               40 KBytes, w/ s.s.
fetches on just the nearest servers may overload those servers’
                                                                                                               0.0
links. The data shown here were obtained using a flawed version of                                                   0         100              200                   300
the server selection algorithm; the correct algorithm would likely
                                                                                                                               Speed (KBytes/Second)
yield better download speeds.
   Figure 7 shows the distribution of speeds seen by the downloads
from the different machines, for different pre-fetch windows, with         Figure 7: Cumulative distribution of the download speeds plot-
and without server selection. The distributions of speeds without          ted in Figure 6, for various pre-fetch windows. Results with
server selection are fairly narrow: every download is likely to re-        and without server selection are marked “w/ s.s.” and “no s.s.”
quire a few blocks from every server, so all downloads see a similar       respectively.
mix of per-block times. The best download speeds were from a ma-
chine at New York University with good connections to multiple
1.0

                                                                                                                   8

                                    0.8
 Cumulative Fraction of Transfers

                                                                                          Average Per-Query RPCs
                                                                                                                   6
                                    0.6
                                                                     8 KBytes
                                                                     64 KBytes
                                                                     1126 KBytes
                                                                                                                   4
                                    0.4

                                    0.2                                                                            2

                                    0.0
                                          0   100              200                 300                             0
                                               Speed (KBytes/sec)                                                             10            100            1000
                                                                                                                                     Number of CFS Servers
Figure 8: Distribution of download speeds achieved by ordi-
nary TCP between each pair hosts on the Internet testbed, for                            Figure 9: The average number of RPCs that a client must issue
three file sizes.                                                                        to find a block, as function of the total number of servers. The
                                                                                         error bars reflect one standard deviation. This experimental
                                                                                         data is linear on a log plot, and thus fits the expectation of a
backbones. The worst download speeds for small pre-fetch win-                            logarithmic growth.
dows were from sites outside the United States, which have high
latency to most of the servers. The worst speeds for large amounts
of pre-fetch were for fetches from cable modem sites in the United                       appropriate for controlled evaluation of CFS’ scalability and toler-
States, which have limited link capacity.                                                ance to failure.
   The speed distributions with server selection are more skewed
than without it. Most of the time, server selection improves down-
load speeds by a modest amount. Sometimes it improves them sub-
                                                                                         7.2.1 Lookup Cost
                                                                                            Looking up a block of data is expected to require
                                                                                                                                                                 
stantially, usually for downloads initiated by well-connected sites.                     RPCs. The following experiment verifies this expectation. For each
Sometimes server selection makes download speeds worse, usually                          of a range of numbers of servers, 10,000 blocks were inserted into
for downloads initiated by sites outside the United States.                              the system. Then 10,000 lookups were done, from a single server,
   To show that the CFS download speeds are competitive with                             for randomly selected blocks. The number of RPCs required for
other file access protocols, files of various sizes were transferred                     each lookup was recorded. The averages are plotted in Figure 9,
between every pair of the testbed machines using ordinary TCP.                           along with error bars showing one standard deviation.
The files were transferred one at a time, one whole file per TCP
connection. Figure 8 shows the cumulative distribution of transfer                       of logarithmic growth. The actual values are about
                                                                                                                                                
                                                                                                                                                  
                                                                                                                                                               
                                                                                            The results are linear on a log plot, and thus fit the expectation
                                                                                                                                                     
                                                                                                                                                          ; for
                                                                                                                                         F
speeds over the various machine pairs, for 8 KByte, 64 KByte, and                        example, with 4096 servers, lookups averaged  RPCs. The num-
1.1 MByte files. The wide distributions reflect the wide range of                        ber of RPCs required is determined by the number of bits in which
propagation delays and link capacities between different pairs of                        the originating server’s ID and the desired block’s ID differ [31];
                                                                                                                                                           
machines. The best speeds are between well-connected sites on the                        this will average about half of the bits, which accounts for the  .
east coast of the United States. The worst speeds for 8 KByte trans-
fers occur when both end-points are outside the United States; the                       7.2.2 Load Balance
worst for one-megabyte transfers occur when one endpoint is out-                            One of the main goals of CFS is to balance the load over the
side the United States and the other is a cable modem, combining                         servers. CFS achieves load balanced storage by breaking file sys-
high latency with limited link speed.                                                    tems up into many blocks and distributing the blocks over the
   CFS with a 40 KByte pre-fetch window achieves speeds compet-                          servers. It further balances storage by placing multiple virtual
itive with TCP, on average. The CFS speeds generally have a distri-
bution much narrower than those of TCP. This means that users are                        expect that
                                                                                                                        
                                                                                         servers per physical server, each virtual server with its own ID. We
                                                                                                                   virtual servers per physical server will be
more likely to see repeatably good performance when fetching files                       sufficient to balance the load reasonably well [31].
with CFS than when fetching files from, for example, FTP servers.                           Figure 10 shows typical distributions of ID space among 64
                                                                                         physical servers for 1, 6, and 24 virtual servers per physical server.
7.2 Controlled Experiments                                                               The crosses represent an actual distribution of 10,000 blocks over
   The remainder of the results were obtained from a set of CFS                          64 physical servers each with 6 virtual servers. The desired result
servers running on a single machine and using the local loopback                         is that each server’s fraction be 0.016. With only one virtual server
network interface to communicate with each other. These servers                          per server (i.e., without using virtual servers), some servers would
act just as if they were on different machines. This arrangement is                      store no blocks, and others would store many times the average.
You can also read