Facebook's Tectonic Filesystem: Efficiency from Exascale

Page created by Guy Gomez

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Facebook's Tectonic Filesystem: Efficiency from Exascale

Facebook’s Tectonic Filesystem:
           Efficiency from Exascale
    Satadru Pan, Facebook, Inc.; Theano Stavrinos, Facebook, Inc. and
    Princeton University; Yunqiao Zhang, Atul Sikaria, Pavel Zakharov,
    Abhinav Sharma, Shiva Shankar P, Mike Shuey, Richard Wareing,
  Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh,
  Kestutis Patiejunas, and JR Tipton, Facebook, Inc.; Ethan Katz-Bassett,
          Columbia University; Wyatt Lloyd, Princeton University
           https://www.usenix.org/conference/fast21/presentation/pan

     This paper is included in the Proceedings of the
19th USENIX Conference on File and Storage Technologies.
                           February 23–25, 2021
                              978-1-939133-20-5

                                         Open access to the Proceedings
                                        of the 19th USENIX Conference on
                                          File and Storage Technologies
                                             is sponsored by USENIX.

Facebook’s Tectonic Filesystem: Efficiency from Exascale
Satadru Pan1 , Theano Stavrinos1,2 , Yunqiao Zhang1 , Atul Sikaria1 , Pavel Zakharov1 , Abhinav Sharma1 ,
Shiva Shankar P1 , Mike Shuey1 , Richard Wareing1 , Monika Gangapuram1 , Guanglei Cao1 , Christian
Preseau1 , Pratap Singh1 , Kestutis Patiejunas1 , JR Tipton1 , Ethan Katz-Bassett3 , and Wyatt Lloyd2
1 Facebook, Inc., 2 Princeton University, 3 Columbia University

Abstract In building Tectonic, we confronted three high-level chal-
Tectonic is Facebook’s exabyte-scale distributed filesystem. lenges: scaling to exabyte-scale, providing performance isola-
Tectonic consolidates large tenants that previously used tion between tenants, and enabling tenant-specific optimiza-
service-specific systems into general multitenant filesystem tions. Exabyte-scale clusters are important for operational
instances that achieve performance comparable to the spe- simplicity and resource sharing. Performance isolation and
cialized systems. The exabyte-scale consolidated instances tenant-specific optimizations help Tectonic match the perfor-
enable better resource utilization, simpler services, and less mance of specialized storage systems.
operational complexity than our previous approach. This pa- To scale metadata, Tectonic disaggregates the filesys-
per describes Tectonic’s design, explaining how it achieves tem metadata into independently-scalable layers, similar to
scalability, supports multitenancy, and allows tenants to spe- ADLS [42]. Unlike ADLS, Tectonic hash-partitions each
cialize operations to optimize for diverse workloads. The metadata layer rather than using range partitioning. Hash
paper also presents insights from designing, deploying, and partitioning effectively avoids hotspots in the metadata layer.
operating Tectonic. Combined with Tectonic’s highly scalable chunk storage layer,
disaggregated metadata allows Tectonic to scale to exabytes
1 Introduction of storage and billions of files.
Tectonic is Facebook’s distributed filesystem. It currently Tectonic simplifies performance isolation by solving the
serves around ten tenants, including blob storage and data isolation problem for groups of applications in each tenant
warehouse, both of which store exabytes of data. Prior to with similar traffic patterns and latency requirements. Instead
Tectonic, Facebook’s storage infrastructure consisted of a of managing resources among hundreds of applications, Tec-
constellation of smaller, specialized storage systems. Blob tonic only manages resources among tens of traffic groups.
storage was spread across Haystack [11] and f4 [34]. Data
Tectonic uses tenant-specific optimizations to match the
warehouse was spread across many HDFS instances [15].
performance of specialized storage systems. These optimiza-
The constellation approach was operationally complex, re-
tions are enabled by a client-driven microservice architecture
quiring many different systems to be developed, optimized,
that includes a rich set of client-side configurations for con-
and managed. It was also inefficient, stranding resources in
trolling how tenants interact with Tectonic. Data warehouse,
the specialized storage systems that could have been reallo-
for instance, uses Reed-Solomon (RS)-encoded writes to im-
cated for other parts of the storage workload.
prove space, IO, and networking efficiency for its large writes.
A Tectonic cluster scales to exabytes such that a single
Blob storage, in contrast, uses a replicated quorum append
cluster can span an entire datacenter. The multi-exabyte ca-
protocol to minimize latency for its small writes and later
pacity of a Tectonic cluster makes it possible to host several
RS-encodes them for space efficiency.
large tenants like blob storage and data warehouse on the
same cluster, with each supporting hundreds of applications Tectonic has been hosting blob storage and data warehouse
in turn. As an exabyte-scale multitenant filesystem, Tectonic in single-tenant clusters for several years, completely replac-
provides operational simplicity and resource efficiency com- ing Haystack, f4, and HDFS. Multitenant clusters are being
pared to federation-based storage architectures [8, 17], which methodically rolled out to ensure reliability and avoid perfor-
assemble smaller petabyte-scale clusters. mance regressions.
Tectonic simplifies operations because it is a single system Adopting Tectonic has yielded many operational and effi-
to develop, optimize, and manage for diverse storage needs. It ciency improvements. Moving data warehouse from HDFS
is resource-efficient because it allows resource sharing among onto Tectonic reduced the number of data warehouse clusters
all cluster tenants. For instance, Haystack was the storage by 10⇥, simplifying operations from managing fewer clusters.
system specialized for new blobs; it bottlenecked on hard disk Consolidating blob storage and data warehouse into multi-
IO per second (IOPS) but had spare disk capacity. f4, which tenant clusters helped data warehouse handle traffic spikes
stored older blobs, bottlenecked on disk capacity but had spare with spare blob storage IO capacity. Tectonic manages these
IO capacity. Tectonic requires fewer disks to support the same efficiency improvements while providing comparable or better
workloads through consolidation and resource sharing. performance than the previous specialized storage systems.

USENIX Association 19th USENIX Conference on File and Storage Technologies 217

2 Facebook’s Previous Storage Infrastructure Tectonic cluster Tectonic cluster
geo-replication

Before Tectonic, each major storage tenant stored its data dc1:blobstore dc1:warehouse dc2:blobstore dc2:warehouse
appA appZ Metadata appZ Metadata
in one or more specialized storage systems. We focus here …
Store
…
Store

…
…
…
…
…
…
on two large tenants, blob storage and data warehouse. We Chunk Chunk
discuss each tenant’s performance requirements, their prior Store Store

storage systems, and why they were inefficient. Datacenter 1 Datacenter 2

Figure 1: Tectonic provides durable, fault-tolerant stor-
2.1 Blob Storage age inside a datacenter. Each tenant has one or more sep-
Blob storage stores and serves binary large objects. These arate namespaces. Tenants implement geo-replication.
may be media from Facebook apps (photos, videos, or mes-
sage attachments) or data from internal applications (core 2.2 Data Warehouse
dumps, bug reports). Blobs are immutable and opaque. They Data warehouse provides storage for data analytics. Data
vary in size from several kilobytes for small photos to several warehouse applications store objects like massive map-reduce
megabytes for high-definition video chunks [34]. Blob stor- tables, snapshots of the social graph, and AI training data
age expects low-latency reads and writes as blobs are often and models. Multiple compute engines, including Presto [3],
on path for interactive Facebook applications [29]. Spark [10], and AI training pipelines [4] access this data,
process it, and store derived data. Warehouse data is parti-
Haystack and f4. Before Tectonic, blob storage consisted tioned into datasets that store related data for different product
of two specialized systems, Haystack and f4. Haystack han- groups like Search, Newsfeed, and Ads.
dled “hot” blobs with a high access frequency [11]. It stored Data warehouse storage prioritizes read and write through-
data in replicated form for durability and fast reads and writes. put over latency, since data warehouse applications often
As Haystack blobs aged and were accessed less frequently, batch-process data. Data warehouse workloads tend to issue
they were moved to f4, the “warm” blob storage [34]. f4 stored larger reads and writes than blob storage, with reads averaging
data in RS-encoded form [43], which is more space-efficient multiple megabytes and writes averaging tens of megabytes.
but has lower throughput because each blob is directly acces-
sible from two disks instead of three in Haystack. f4’s lower HDFS for data warehouse storage. Before Tectonic,
throughput was acceptable because of its lower request rate. data warehouse used the Hadoop Distributed File System
(HDFS) [15, 50]. However, HDFS clusters are limited in size
However, separating hot and warm blobs resulted in poor because they use a single machine to store and serve metadata.
resource utilization, a problem exacerbated by hardware and As a result, we needed tens of HDFS clusters per datacenter
blob storage usage trends. Haystack’s ideal effective replica- to store analytics data. This was operationally inefficient; ev-
tion factor was 3.6⇥ (i.e., each logical byte is replicated 3⇥, ery service had to be aware of data placement and movement
with an additional 1.2⇥ overhead for RAID-6 storage [19]). among clusters. Single data warehouse datasets are often large
However, because IOPS per hard drive has remained steady enough to exceed a single HDFS cluster’s capacity. This com-
as drive density has increased, IOPS per terabyte of storage plicated compute engine logic, since related data was often
capacity has declined over time. split among separate clusters.
As a result, Haystack became IOPS-bound; extra hard Finally, distributing datasets among the HDFS clusters cre-
drives had to be provisioned to handle the high IOPS load ated a two-dimensional bin-packing problem. The packing
of hot blobs. The spare disk capacity resulted in Haystack’s of datasets into clusters had to respect each cluster’s capacity
effective replication factor increasing to 5.3⇥. In contrast, f4 constraints and available throughput. Tectonic’s exabyte scale
had an effective replication factor of 2.8⇥ (using RS(10,4) eliminated the bin-packing and dataset-splitting problems.
encoding in two different datacenters). Furthermore, blob stor-
age usage shifted to more ephemeral media that was stored in 3 Architecture and Implementation
Haystack but deleted before moving to f4. As a result, an in- This section describes the Tectonic architecture and imple-
creasing share of the total blob data was stored at Haystack’s mentation, focusing on how Tectonic achieves exabyte-scale
high effective replication factor. single clusters with its scalable chunk and metadata stores.
Finally, since Haystack and f4 were separate systems, each
stranded resources that could not be shared with other sys-
3.1 Tectonic: A Bird’s-Eye View
tems. Haystack overprovisioned storage to accommodate peak A cluster is the top-level Tectonic deployment unit. Tectonic
IOPS, whereas f4 had an abundance of IOPS from storing a clusters are datacenter-local, providing durable storage that is
large volume of less frequently-accessed data. Moving blob resilient to host, rack, and power domain failures. Tenants can
storage to Tectonic harvested these stranded resources and build geo-replication on top of Tectonic for protection against
resulted in an effective replication factor of ~2.8⇥. datacenter failures (Figure 1).

218 19th USENIX Conference on File and Storage Technologies USENIX Association

verts clients’ filesystem API calls into RPCs to Chunk and
Metadata Store services.
Name layer Background
Client
File layer
Key-value Services
Finally, each cluster runs stateless background services to
Library Store
maintain cluster consistency and fault tolerance (§3.5).
(stateless)
Block layer
Garbage collectors
Metadata Store
Rebalancer 3.2 Chunk Store: Exabyte-Scale Storage
Stat service
Disk inventory The Chunk Store is a flat, distributed object store for chunks,
Block repair/scan
Storage node health the unit of data storage in Tectonic. Chunks make up blocks,
Chunk Store
checker which in turn make up Tectonic files.
The Chunk Store has two features that contribute to Tec-
Figure 2: Tectonic architecture. Arrows indicate network tonic’s scalability and ability to support multiple tenants. First,
calls. Tectonic stores filesystem metadata in a key-value the Chunk Store is flat; the number of chunks stored grows
store. Apart from the Chunk and Metadata Stores, all linearly with the number of storage nodes. As a result, the
components are stateless. Chunk Store can scale to store exabytes of data. Second, it
is oblivious to higher-level abstractions like blocks or files;
A Tectonic cluster is made up of storage nodes, metadata these abstractions are constructed by the Client Library using
nodes, and stateless nodes for background operations. The the Metadata Store. Separating data storage from filesystem
Client Library orchestrates remote procedure calls to the meta- abstractions simplifies the problem of supporting good per-
data and storage nodes. Tectonic clusters can be very large: a formance for a diversity of tenants on one storage cluster
single cluster can serve the storage needs of all tenants in a (§5). This separation means reading to and writing from stor-
single datacenter. age nodes can be specialized to tenants’ performance needs
Tectonic clusters are multitenant, supporting around ten without changing filesystem management.
tenants on the same storage fabric (§4). Tenants are distributed Storing chunks efficiently. Individual chunks are stored
systems that will never share data with one another; tenants as files on a cluster’s storage nodes, which each run a local
include blob storage and data warehouse. These tenants in turn instance of XFS [26]. Storage nodes expose core IO APIs
serve hundreds of applications, including Newsfeed, Search, to get, put, append to, and delete chunks, along with APIs
Ads, and internal services, each with varying traffic patterns for listing chunks and scanning chunks. Storage nodes are
and performance requirements. responsible for ensuring that their own local resources are
Tectonic clusters support any number of arbitrarily-sized shared fairly among Tectonic tenants (§4).
namespaces, or filesystem directory hierarchies, on the same Each storage node has 36 hard drives for storing chunks [5].
storage and metadata components. Each tenant in a cluster Each node also has a 1 TB SSD, used for storing XFS meta-
typically owns one namespace. Namespace sizes are limited data and caching hot chunks. Storage nodes run a version
only by the size of the cluster. of XFS that stores local XFS metadata on flash [47]. This
Applications interact with Tectonic through a hierarchi- is particularly helpful for blob storage, where new blobs are
cal filesystem API with append-only semantics, similar to written as appends, updating the chunk size. The SSD hot
HDFS [15]. Unlike HDFS, Tectonic APIs are configurable at chunk cache is managed by a cache library which is flash
runtime, rather than being pre-configured on a per-cluster or endurance-aware [13].
per-tenant basis. Tectonic tenants leverage this flexibility to
match the performance of specialized storage systems (§4). Blocks as the unit of durable storage. In Tectonic, blocks
are a logical unit that hides the complexity of raw data storage
Tectonic components. Figure 2 shows the major compo- and durability from the upper layers of the filesystem. To the
nents of Tectonic. The foundation of a Tectonic cluster is the upper layers, a block is an array of bytes. In reality, blocks are
Chunk Store (§3.2), a fleet of storage nodes which store and composed of chunks which together provide block durability.
access data chunks on hard drives. Tectonic provides per-block durability to allow tenants to
On top of the Chunk Store is the Metadata Store (§3.3), tune the tradeoff between storage capacity, fault tolerance, and
which consists of a scalable key-value store and stateless performance. Blocks are either Reed-Solomon encoded [43]
metadata services that construct the filesystem logic over the or replicated for durability. For RS(r, k) encoding, the block
key-value store. Their scalability enables Tectonic to store data is split into r equal chunks (potentially by padding the
exabytes of data. data), and k parity chunks are generated from the data chunks.
Tectonic is a client-driven microservices-based system, a For replication, data chunks are the same size as the block
design that enables tenant-specific optimizations. The Chunk and multiple copies are created. Chunks in a block are stored
and Metadata Stores each run independent services to handle in different fault domains (e.g., different racks) for fault toler-
read and write requests for data and metadata. These services ance. Background services repair damaged or lost chunks to
are orchestrated by the Client Library (§3.4); the library con- maintain durability (§3.5).

USENIX Association 19th USENIX Conference on File and Storage Technologies 219

Layer Key Value Sharded by Mapping
Name (dir_id, subdirname) subdir_info, subdir_id dir_id dir ! list of subdirs (expanded)
(dir_id, filename) file_info, file_id dir_id dir ! list of files (expanded)
File (file_id, blk_id) blk_info file_id file ! list of blocks (expanded)
Block blk_id list blk_id block ! list of disks (i.e., chunks)
(disk_id, blk_id) chunk_info blk_id disk ! list of blocks (expanded)
Table 1: Tectonic’s layered metadata schema. dirname and filename are application-exposed strings. dir_id, file_id,
and block_id are internal object references. Most mappings are expanded for efficient updating.

3.3 Metadata Store: Naming Exabytes of Data modified without reading and then writing the entire list. In a
filesystem where mappings can be very large, e.g., directories
Tectonic’s Metadata Store stores the filesystem hierarchy and
may contain millions of files, expanding significantly reduces
the mapping of blocks to chunks. The Metadata Store uses
the overhead of some metadata operations such as file creation
a fine-grained partitioning of filesystem metadata for opera-
and deletion. The contents of a expanded key are listed by
tional simplicity and scalability. Filesystem metadata is first
doing a prefix scan over keys.
disaggregated, meaning the naming, file, and block layers
are logically separated. Each layer is then hash partitioned
Fine-grained metadata partitioning to avoid hotspots.
(Table 1). As we describe in this section, scalability and load
In a filesystem, directory operations often cause hotspots
balancing come for free with this design. Careful handling of
in metadata stores. This is particularly true for data ware-
metadata operations preserves filesystem consistency despite
house workloads where related data is grouped into directo-
the fine-grained metadata partitioning.
ries; many files from the same directory may be read in a
Storing metadata in a key-value store for scalability and short time, resulting in repeated accesses to the directory.
operational simplicity. Tectonic delegates filesystem meta- Tectonic’s layered metadata approach naturally avoids
data storage to ZippyDB [6], a linearizable, fault-tolerant, hotspots in directories and other layers by separating search-
sharded key-value store. The key-value store manages data at ing and listing directory contents (Name layer) from reading
the shard granularity: all operations are scoped to a shard, and file data (File and Block layers). This is similar to ADLS’s
shards are the unit of replication. The key-value store nodes separation of metadata layers [42]. However, ADLS range-
internally run RocksDB [23], a SSD-based single-node key- partitions metadata layers whereas Tectonic hash-partitions
value store, to store shard replicas. Shards are replicated with layers. Range partitioning tends to place related data on the
Paxos [30] for fault tolerance. Any replica can serve reads, same shard, e.g., subtrees of the directory hierarchy, making
though reads that must be strongly consistent are served by the metadata layer prone to hotspots if not carefully sharded.
the primary. The key-value store does not provide cross-shard
We found that hash partitioning effectively load-balances
transactions, limiting certain filesystem metadata operations.
metadata operations. For example, in the Name layer, the
Shards are sized so that each metadata node can host several
immediate directory listing of a single directory is always
shards. This allows shards to be redistributed in parallel to
stored in a single shard. But listings of two subdirectories
new nodes in case a node fails, reducing recovery time. It
of the same directory will likely be on separate shards. In
also allows granular load balancing; the key-value store will
the Block layer, block locator information is hashed among
transparently move shards to control load on each node.
shards, independent of the blocks’ directory or file. Around
Filesystem metadata layers. Table 1 shows the filesystem two-thirds of metadata operations in Tectonic are served by
metadata layers, what they map, and how they are sharded. the Block layer, but hash partitioning ensures this traffic is
The Name layer maps each directory to its sub-directories evenly distributed among Block layer shards.
and/or files. The File layer maps file objects to a list of blocks.
The Block layer maps each block to a list of disk (i.e., chunk) Caching sealed object metadata to reduce read load.
locations. The Block layer also contains the reverse index of Metadata shards have limited available throughput, so to re-
disks to the blocks whose chunks are stored on that disk, used duce read load, Tectonic allows blocks, files, and directories
for maintenance operations. Name, File, and Block layers are to be sealed. Directory sealing does not apply recursively, it
hash-partitioned by directory, file, and block IDs, respectively. only prevents adding objects in the immediate level of the
As shown in Table 1, the Name and File layer and disk directory. The contents of sealed filesystem objects cannot
to block list maps are expanded. A key mapped to a list is change; their metadata can be cached at metadata nodes and
expanded by storing each item in the list as a key, prefixed at clients without compromising consistency. The exception
by the true key. For example, if directory d1 contains files is the block-to-chunk mapping; chunks can migrate among
foo and bar, we store two keys (d1, foo) and (d1, bar) in d1’s disks, invalidating the Block layer cache. A stale Block layer
Name shard. Expanding allows the contents of a key to be cache can be detected during reads, triggering a cache refresh.

220 19th USENIX Conference on File and Storage Technologies USENIX Association

Providing consistent metadata operations. Tectonic re- performant way possible for applications, which might have
lies on the key-value store’s strongly-consistent opera- different workloads or prefer different tradeoffs (§5).
tions and atomic read-modify-write in-shard transactions for The Client Library replicates or RS-encodes data and writes
strongly-consistent same-directory operations. More specif- chunks directly to the Chunk Store. It reads and reconstructs
ically, Tectonic guarantees read-after-write consistency for chunks from the Chunk Store for the application. The Client
data operations (e.g., appends, reads), file and directory oper- Library consults the Metadata Store to locate chunks, and
ations involving a single object (e.g., create, list), and move updates the Metadata Store for filesystem operations.
operations where the source and destination are in the same
Single-writer semantics for simple, optimizable writes.
parent directory. Files in a directory reside in the directory’s
Tectonic simplifies the Client Library’s orchestration by allow-
shard (Table 1), so metadata operations like file create, delete,
ing a single writer per file. Single-writer semantics avoids the
and moves within a parent directory are consistent.
complexity of serializing writes to a file from multiple writers.
The key-value store does not support consistent cross-shard
The Client Library can instead write directly to storage nodes
transactions, so Tectonic provides non-atomic cross-directory
in parallel, allowing it to replicate chunks in parallel and to
move operations. Moving a directory to another parent di-
hedge writes (§5). Tenants needing multiple-writer semantics
rectory on a different shard is a two-phase process. First, we
can build serialization semantics on top of Tectonic.
create a link from the new parent directory, and then delete
the link from the previous parent. The moved directory keeps Tectonic enforces single-writer semantics with a write to-
a backpointer to its parent directory to detect pending moves. ken for every file. Any time a writer wants to add a block to a
This ensures only one move operation is active for a direc- file, it must include a matching token for the metadata write
tory at a time. Similarly, cross directory file moves involve to succeed. A token is added in the file metadata when a pro-
copying the file and deleting it from the source directory. The cess opens a file for appending, which subsequent writes must
copy step creates a new file object with the underlying blocks include to update file metadata. If a second process attempts
of the source file, avoiding data movement. to open the file, it will generate a new token and overwrite the
first process’s token, becoming the new, and only, writer for
In the absence of cross-shard transactions, multi-shard
the file. The new writer’s Client Library will seal any blocks
metadata operations on the same file must be carefully imple-
opened by the previous writer in the open file call.
mented to avoid race conditions. An example of such a race
condition is when a file named f1 in directory d is renamed 3.5 Background Services
to f2. Concurrently, a new file with the same name is created,
where creates overwrite existing files with the same name. Background services maintain consistency between metadata
The metadata layer and shard lookup key (shard(x)) are listed layers, maintain durability by repairing lost data, rebalance
for each step in parentheses. data across storage nodes, handle rack drains, and publish
A file rename has the following steps: statistics about filesystem usage. Background services are
R1: get file ID fid for f1 (Name, shard(d)) layered similar to the Metadata Store, and they operate on one
R2: add f2 as an owner of fid (File, shard(fid)) shard at a time. Figure 2 lists important background services.
R3: create the mapping f2 ! fid and delete f1 ! fid in A garbage collector between each metadata layer cleans
an atomic transaction (Name, shard(d)) up (acceptable) metadata inconsistencies. Metadata incon-
sistencies can result from failed multi-step Client Library
A file create with overwriting has the following steps:
operations. Lazy object deletion, a real-time latency optimiza-
C1: create new file ID fid_new (File, shard(fid_new))
tion that marks deleted objects at delete time without actually
C2: map f1 ! fid_new; delete f1 ! fid (Name, shard(d))
removing them, also causes inconsistencies.
Interleaving the steps in these transactions may leave the
A rebalancer and a repair service work in tandem to relocate
filesystem in an inconsistent state. If steps C1 and C2 are
or delete chunks. The rebalancer identifies chunks that need
executed after R1 but before R3, then R3 will erase the newly-
to be moved in response to events like hardware failure, added
created mapping from the create operation. Rename step R3
storage capacity, and rack drains. The repair service handles
uses a within-shard transaction to ensure that the file object
the actual data movement by reconciling the chunk list to
pointed to by f1 has not been modified since R1.
the disk-to-block map for every disk in the system. To scale
3.4 Client Library horizontally, the repair service works on a per-Block layer
shard, per-disk basis, enabled by the reverse index mapping
The Tectonic Client Library orchestrates the Chunk and Meta-
disks to blocks (Table 1).
data Store services to expose a filesystem abstraction to appli-
cations, which gives applications per-operation control over Copysets at scale. Copysets are combinations of disks that
how to configure reads and writes. Moreover, the Client Li- provide redundancy for the same block (e.g., a copyset for an
brary executes reads and writes at the chunk granularity, the RS(10,4)-encoded block consists of 14 disks) [20]. Having
finest granularity possible in Tectonic. This gives the Client too many copysets risks data unavailability if there is an un-
Library nearly free reign to execute operations in the most expected spike in disk failures. On the other hand, having too

USENIX Association 19th USENIX Conference on File and Storage Technologies 221

few copysets results in high reconstruction load to peer disks need finer-grained real-time automated management to ensure
when one disk fails, since they share many chunks. they are shared fairly, tenants are isolated from one another,
The Block Layer and the rebalancer service together at- and resource utilization is high. For the rest of this section, we
tempt to maintain a fixed copyset count that balances unavail- describe how Tectonic shares ephemeral resources effectively.
ability and reconstruction load. They each keep in memory Distributing ephemeral resources among and within ten-
about one hundred consistent shuffles of all the disks in the ants. Ephemeral resource sharing is challenging in Tectonic
cluster. The Block Layer forms copysets from contiguous because not only are tenants diverse, but each tenant serves
disks in a shuffle. On a write, the Block Layer gives the Client many applications with varied traffic patterns and perfor-
Library a copyset from the shuffle corresponding to that block mance requirements. For example, blob storage includes pro-
ID. The rebalancer service tries to keep the block’s chunks duction traffic from Facebook users and background garbage
in the copyset specified by that block’s shuffle. Copysets are collection traffic. Managing ephemeral resources at the tenant
best-effort, since disk membership in the cluster changes con- granularity would be too coarse to account for the varied work-
stantly. loads and performance requirements within a tenant. On the
4 Multitenancy other hand, because Tectonic serves hundreds of applications,
managing resources at the application granularity would be
Providing comparable performance for tenants as they move too complex and resource-intensive.
from individual, specialized storage systems to a consolidated Ephemeral resources are therefore managed within each
filesystem presents two challenges. First, tenants must share tenant at the granularity of groups of applications. These appli-
resources while giving each tenant its fair share, i.e., at least cation groups, called TrafficGroups, reduce the cardinality of
the same resources it would have in a single-tenant system. the resource sharing problem, reducing the overhead of man-
Second, tenants should be able to optimize performance as aging multitenancy. Applications in the same TrafficGroup
in specialized systems. This section describes how Tectonic have similar resource and latency requirements. For example,
supports resource sharing with a clean design that maintains one TrafficGroup may be for applications generating back-
operational simplicity. Section 5 describes how Tectonic’s ground traffic while another is for applications generating
tenant-specific optimizations allow tenants to get performance production traffic. Tectonic supports around 50 TrafficGroups
comparable to specialized storage systems. per cluster. Each tenant may have a different number of Traffic-
4.1 Sharing Resources Effectively Groups. Tenants are responsible for choosing the appropriate
TrafficGroup for each of their applications. Each TrafficGroup
As a shared filesystem for diverse tenants across Facebook,
is in turn assigned a TrafficClass. A TrafficGroup’s Traffic-
Tectonic needs to manage resources effectively. In particular,
Class indicates its latency requirements and decides which
Tectonic needs to provide approximate (weighted) fair shar-
requests should get spare resources. The TrafficClasses are
ing of resources among tenants and performance isolation
Gold, Silver, and Bronze, corresponding to latency-sensitive,
between tenants, while elastically shifting resources among
normal, and background applications. Spare resources are
applications to maintain high resource utilization. Tectonic
distributed according to TrafficClass priority within a tenant.
also needs to distinguish latency-sensitive requests to avoid
Tectonic uses tenants and TrafficGroups along with the
blocking them behind large requests.
notion of TrafficClass to ensure isolation and high resource
Types of resources. Tectonic distinguishes two types of utilization. That is, tenants are allocated their fair share of
resources: non-ephemeral and ephemeral. Storage capacity resources; within each tenant, resources are distributed by
is the non-ephemeral resource. It changes slowly and pre- TrafficGroup and TrafficClass. Each tenant gets a guaranteed
dictably. Most importantly, once allocated to a tenant, it can- quota of the cluster’s ephemeral resources, which is subdi-
not be given to another tenant. Storage capacity is managed vided between a tenant’s TrafficGroups. Each TrafficGroup
at the tenant granularity. Each tenant gets a predefined ca- gets its guaranteed resource quota, which provides isolation
pacity quota with strict isolation, i.e., there is no automatic between tenants as well as isolation between TrafficGroups.
elasticity in the space allocated to different tenants. Recon- Any ephemeral resource surplus within a tenant is shared
figuring storage capacity between tenants is done manually. with its own TrafficGroups by descending TrafficClass. Any
Reconfiguration does not cause downtime, so in case of an remaining surplus is given to TrafficGroups in other tenants
urgent capacity crunch, it can be done immediately. Tenants by descending TrafficClass. This ensures spare resources are
are responsible for distributing and tracking storage capacity used by TrafficGroups of the same tenant first before being
among their applications. distributed to other tenants. When one TrafficGroup uses
Ephemeral resources are those where demand changes resources from another TrafficGroup, the resulting traffic gets
from moment to moment, and allocation of these resources the minimum TrafficClass of the two TrafficGroups. This
can change in real time. Storage IOPS capacity and meta- ensures the overall ratio of traffic of different classes does not
data query capacity are two ephemeral resources. Because change based on resource allocation, which ensures the node
ephemeral resource demand changes quickly, these resources can meet the latency profile of the TrafficClass.

222 19th USENIX Conference on File and Storage Technologies USENIX Association

Enforcing global resource sharing. The Client Library talks to each layer directly. Since access control is on path for
uses a rate limiter to achieve the aforementioned elastic- every read and write, it must also be lightweight.
ity. The rate limiter uses high-performance, near-realtime Tectonic uses a token-based authorization mechanism that
distributed counters to track the demand for each tracked includes which resources can be accessed with the token [31].
resource in each tenant and TrafficGroup in the last small An authorization service authorizes top-level client requests
time window. The rate limiter implements a modified leaky (e.g., opening a file), generating an authorization token for the
bucket algorithm. An incoming request increments the de- next layer in the filesystem; each subsequent layer likewise
mand counter for the bucket. The Client Library then checks authorizes the next layer. The token’s payload describes the re-
for spare capacity in its own TrafficGroup, then other Traffic- source to which access is given, enabling granular access con-
Groups in the same tenant, and finally other tenants, adhering trol. Each layer verifies the token and the resource indicated
to TrafficClass priority. If the client finds spare capacity, the re- in the payload entirely in memory; verification can be per-
quest is sent to the backend. Otherwise, the request is delayed formed in tens of microseconds. Piggybacking token-passing
or rejected depending on the request’s timeout. Throttling on existing protocols reduces the access control overhead.
requests at clients puts backpressure on clients before they
make a potentially wasted request.
5 Tenant-Specific Optimizations
Tectonic supports around ten tenants in the same shared
Enforcing local resource sharing. The client rate limiter
filesystem, each with specific performance needs and work-
ensures approximate global fair sharing and isolation. Meta-
load characteristics. Two mechanisms permit tenant-specific
data and storage nodes also need to manage resources to
optimizations. First, clients have nearly full control over how
avoid local hotspots. Nodes provide fair sharing and isolation
to configure an application’s interactions with Tectonic; the
with a weighted round-robin (WRR) scheduler that provision-
Client Library manipulates data at the chunk level, the finest
ally skips a TrafficGroup’s turn if it will exceed its resource
possible granularity (§3.4). This Client Library-driven de-
quota. In addition, storage nodes need to ensure that small IO
sign enables Tectonic to execute operations according to the
requests (e.g., blob storage operations) do not see higher la-
application’s performance needs.
tency from colocation with large, spiky IO requests (e.g., data
warehouse operations). Gold TrafficClass requests can miss Second, clients enforce configurations on a per-call basis.
their latency targets if they are blocked behind lower-priority Many other filesystems bake configurations into the system or
requests on storage nodes. apply them to entire files or namespaces. For example, HDFS
configures durability per directory [7], whereas Tectonic con-
Storage nodes use three optimizations to ensure low latency
figures durability per block write. Per-call configuration is
for Gold TrafficClass requests. First, the WRR scheduler pro-
enabled by the scalability of the Metadata Store: the Meta-
vides a greedy optimization where a request from a lower
data Store can easily handle the increased metadata for this
TrafficClass may cede its turn to a higher TrafficClass if the
approach. We next describe how data warehouse and blob
request will have enough time to complete after the higher-
storage leverage per-call configurations for efficient writes.
TrafficClass request. This helps prevent higher-TrafficClass
requests from getting stuck behind a lower-priority request. 5.1 Data Warehouse Write Optimizations
Second, we limit how many non-Gold IOs may be in flight
A common pattern in data warehouse workloads is to write
for every disk. Incoming non-Gold traffic is blocked from
data once that will be read many times later. For these work-
scheduling if there are any pending Gold requests and the non-
loads, the file is visible to readers only once the creator closes
Gold in-flight limit has been reached. This ensures the disk is
the file. The file is then immutable for its lifetime. Because
not busy serving large data warehouse IOs while blob storage
the file is only read after it is written completely, applications
requests are waiting. Third, the disk itself may re-arrange the
prioritize lower file write time over lower append latency.
IO requests, i.e., serve a non-Gold request before an earlier
Gold request. To manage this, Tectonic stops scheduling non- Full-block, RS-encoded asynchronous writes for space,
Gold requests to a disk if a Gold request has been pending IO, and network efficiency. Tectonic uses the write-once-
on that disk for a threshold amount of time. These three tech- read-many pattern to improve IO and network efficiency,
niques combined effectively maintain the latency profile of while minimizing total file write time. The absence of partial
smaller IOs, even when outnumbered by larger IOs. file reads in this pattern allows applications to buffer writes
up to the block size. Applications then RS-encode blocks in
4.2 Multitenant Access Control memory and write the data chunks to storage nodes. Long-
Tectonic follows common security principles to ensure that lived data is typically RS(9,6) encoded; short-lived data, e.g.,
all communications and dependencies are secure. Tectonic map-reduce shuffles, is typically RS(3,3)-encoded.
additionally provides coarse access control between tenants Writing RS-encoded full blocks saves storage space, net-
(to prevent one tenant from accessing another’s data) and fine- work bandwidth, and disk IO over replication. Storage and
grained access control within a tenant. Access control must bandwidth are lower because less total data is written. Disk
be enforced at each layer of Tectonic, since the Client Library IO is lower because disks are more efficiently used. More

USENIX Association 19th USENIX Conference on File and Storage Technologies 223

100                                                   100

                   95
                                                                        75                                                    75
      Percentile

                                                           Percentile

                                                                                                                 Percentile
                   90
                                                                        50                                                    50
                   85
                                                                        25                    Haystack                        25
                   80                    Hedging                                       Quorum append                                                   Tectonic
                                      No Hedging                                       Standard append                                                 Haystack
                   75                                                     0                                                    0
                     800   1000 1200 1400 1600 1800 2000                      0   50       100       150   200                      0   20       40      60       80   100
                           72MB block write latency (ms)                           Write Latency (ms)                                        Read Latency (ms)
      (a) Data warehouse full block writes                               (b) Blob storage write latency                        (c) Blob storage read latency
Figure 3: Tail latency optimizations in Tectonic. (a) shows the improvement in data warehouse tail latency from hedged
quorum writes (72MB blocks) in a test cluster with ~80% load. (b) and (c) show Tectonic blob storage write latency
(with and without quorum appends) and read latency compared to Haystack.

IOPS are needed to write chunks to 15 disks in RS(9,6), but                                 storage metadata by storing many blobs together into log-
each write is small and the total amount of data written is                                 structured files, where new blobs are appended at the end of a
much smaller than with replication. This results in more effi-                              file. Blobs are located with a map from blob ID to the location
cient disk IO because block sizes are large enough that disk                                of the blob in the file.
bandwidth, not IOPS, is the bottleneck for full-block writes.                                  Blob storage is also on path for many user requests, so low
   The write-once-read-many pattern also allows applications                                latency is desirable. Blobs are usually much smaller than Tec-
to write the blocks of a file asynchronously in parallel, which                             tonic blocks (§2.1). Blob storage therefore writes new blobs
decreases the file write latency significantly. Once the blocks                             as small, replicated partial block appends for low latency. The
of the file are written, the file metadata is updated all together.                         partial block appends need to be read-after-write consistent
There is no risk of inconsistency with this strategy because a                              so blobs can be read immediately after successful upload.
file is only visible once it is completely written.                                         However, replicated data uses more disk space than full-block
                                                                                            RS-encoded data.
Hedged quorum writes to improve tail latency. For full-
block writes, Tectonic uses a variant of quorum writing which                               Consistent partial block appends for low latency. Tec-
reduces tail latency without any additional IO. Instead of                                  tonic uses partial block quorum appends to enable durable,
sending the chunk write payload to extra nodes, Tectonic first                              low-latency, consistent blob writes. In a quorum append, the
sends reservation requests ahead of the data and then writes                                Client Library acknowledges a write after a subset of storage
the chunks to the first nodes to accept the reservation. The                                nodes has successfully written the data to disk, e.g., two nodes
reservation step is similar to hedging [22], but it avoids data                             for three-way replication. The temporary decrease of durabil-
transfers to nodes that would reject the request because of                                 ity from a quorum write is acceptable because the block will
lack of resources or because the requester has exceeded its                                 soon be reencoded and because blob storage writes a second
resource share on that node (§4).                                                           copy to another datacenter.
   As an example, to write a RS(9,6)-encoded block, the Client                                 The challenge with partial block quorum appends is that
Library sends a reservation request to 19 storage nodes in                                  straggler appends could leave replica chunks at different sizes.
different failure domains, four more than required for the                                  Tectonic maintains consistency by carefully controlling who
write. The Client Library writes the data and parity chunks                                 can append to a block and when appends are made visible.
to the first 15 storage nodes that respond to the reservation                               Blocks can only be appended to by the writer that created
request. It acknowledges the write to the client as soon as a                               the block. Once an append completes, Tectonic commits the
quorum of 14 out of 15 nodes return success. If the 15th write                              post-append block size and checksum to the block metadata
fails, the corresponding chunk is repaired offline.                                         before acknowledging the partial block quorum append.
   The hedging step is more effective when the cluster is                                      This ordering of operations with a single appender provides
highly loaded. Figure 3a shows ~20% improvement in 99th                                     consistency. If block metadata reports a block size of S, then
percentile latency for RS(9,6) encoded, 72 MB full-block                                    all preceeding bytes in the block were written to at least two
writes, in a test cluster with 80% throughput utilization.                                  storage nodes. Readers will be able to access data in the block
                                                                                            up to offset S. Similarly, any writes acknowledged to the ap-
5.2          Blob Storage Optimizations                                                     plication will have been updated in the block metadata and so
Blob storage is challenging for filesystems because of the                                  will be visible to future reads. Figures 3b and 3c demonstrate
quantity of objects that need to be indexed. Facebook stores                                that Tectonic’s blob storage read and write latency is compa-
tens of trillions of blobs. Tectonic manages the size of blob                               rable to Haystack, validating that Tectonic’s generality does

224       19th USENIX Conference on File and Storage Technologies                                                                                    USENIX Association

Capacity Used bytes Files Blocks Storage Nodes Warehouse Blob storage Combined
1590 PB 1250 PB 10.7 B 15 B 4208 Supply 0.51 0.49 1.00
Table 2: Statistics from a multitenant Tectonic produc- Peak 1 0.60 0.12 0.72
tion cluster. File and block counts are in billions. Peak 2 0.54 0.14 0.68
Peak 3 0.57 0.11 0.68
not have a significant performance cost. Table 3: Consolidating data warehouse and blob storage
in Tectonic allows data warehouse to use what would oth-
Reencoding blocks for storage efficiency. Directly RS-
erwise be stranded surplus disk time for blob storage to
encoding small partial-block appends would be IO-inefficient.
handle large load spikes. This figure shows the normal-
Small disk writes are IOPS-bound and RS-encoding results
ized disk time demand vs. supply in three daily peaks in
in many more IOs (e.g, 14 IOs with RS(10, 4) instead of 3).
the representative cluster.
Instead of RS-encoding after each append, the Client Library
reencodes the block from replicated form to RS(10,4) en-
spike. For example, if a disk does 10 IOs in one second with
coding once the block is sealed. Reencoding is IO-efficient
each taking 50 ms (seek and fetch), then the disk was busy for
compared to RS-encoding at append time, requiring only a
500 out of 1000 ms. We use disk time to fairly account for
single large IO on each of the 14 target storage nodes. This
usage by different types of requests.
optimization, enabled by Tectonic’s Client Library-driven de-
For the representative production cluster, Table 3 shows
sign, provides nearly the best of both worlds with fast and
normalized disk time demand for data warehouse and blob
IO-efficient replication for small appends that are quickly
storage for three daily peaks and the supply of disk time each
transitioned to the more space-efficient RS-encoding.
would have if running on its own cluster. We normalize by
6 Tectonic in Production total disktime corresponding to used space in the cluster. The
daily peaks correspond to the same three days of traffic as
This section shows Tectonic operating at exabyte scale,
in Figures 4a and 4b. Data warehouse’s demand exceeds its
demonstrates benefits of storage consolidation, and discusses
supply in all three peaks and handling it on its own would
how Tectonic handles metadata hotspots. It also discusses
require disk overprovisioning. To handle peak data warehouse
tradeoffs and lessons from designing Tectonic.
demand over the three day period, the cluster would have
6.1 Exabyte-Scale Multitenant Clusters needed ~17% overprovisioning. Blob storage, on the other
hand, has surplus disk time that would be stranded if it ran
Production Tectonic clusters run at exabyte scale. Table 2 in its own cluster. Consolidating these tenants into a single
gives statistics on a representative multitenant cluster. All Tectonic cluster allows the blob storage’s surplus disk time to
results in this section are for this cluster. The 1250 PB of stor- be used for data warehouse’s storage load spikes.
age, ~70% of the cluster capacity at the time of the snapshot,
consists of 10.7 billion files and 15 billion blocks. 6.3 Metadata Hotspots
6.2 Efficiency from Storage Consolidation Load spikes to the Metadata Store may result in hotspots in
metadata shards. The bottleneck resource in serving meta-
The cluster in Table 2 hosts two tenants, blob storage and data operations is queries per second (QPS). Handling load
data warehouse. Blob storage uses ~49% of the used space in spikes requires the Metadata Store to keep up with the QPS
this cluster and data warehouse uses ~51%. Figures 4a and 4b demand on every shard. In production, each shard can serve a
show the cluster handling storage load over a three-day period. maximum of 10 KQPS. This limit is imposed by the current
Figure 4a shows the cluster’s aggregate IOPS during that time, isolation mechanism on the resources of the metadata nodes.
and Figure 4b shows its aggregate disk bandwidth. The data Figure 4c shows the QPS across metadata shards in the cluster
warehouse workload has large, regular load spikes triggered for the Name, File, and Block layers. All shards in the File
by very large jobs. Compared to the spiky data warehouse and Block layers are below this limit.
workload, blob storage traffic is smooth and predictable.
Over this three-day period, around 1% of Name layer shards
Sharing surplus IOPS capacity. The cluster handles hit the QPS limit because they hold very hot directories. The
spikes in storage load from data warehouse using the surplus small unhandled fraction of metadata requests are retried
IOPS capacity unlocked by consolidation with blob storage. after a backoff. The backoff allows the metadata nodes to
Blob storage requests are typically small and bound by IOPS clear most of the initial spike and successfully serve retried
while data warehouse requests are typically large and bound requests. This mechanism, combined with all other shards
by bandwidth. As a result, neither IOPS nor bandwidth can being below their maximum, enables Tectonic to successfully
fairly account for disk IO usage. The bottleneck resource in handle the large spikes in metadata load from data warehouse.
serving storage operations is disk time, which measures how The distribution of load across shards varies between the
often a given disk is busy. Handling a storage load spike re- Name, File, and Block layers. Each higher layer has a larger
quires Tectonic to have enough free disk time to serve the distribution of QPS per shard because it colocates more of a

USENIX Association 19th USENIX Conference on File and Storage Technologies 225

2000                      Warehouse                                  4.5
                                                                                                                Warehouse
                  1800                     Blob storage                                  4                     Blob storage                                    100
                  1600                                                                 3.5

                                                                    Bandwidth (TB/s)

                                                                                                                                        Percentile of shards
                  1400                                                                   3                                                                     75
                  1200
       IOPS (K)

                                                                                       2.5                                                                                                    max load >
                  1000
                                                                                         2                                                                     50
                   800
                   600                                                                 1.5
                                                                                         1                                                                     25                           Block
                   400                                                                                                                                                                      Name
                   200                                                                 0.5                                                                                                    File
                     0                                                                   0                                                                       0
                         0   10   20     30 40 50         60   70                            0   10   20     30 40 50         60   70                                0   2   4          6        8     10
                                       Time (hours)                                                        Time (hours)                                                          KQPS

                    (a) Aggregate cluster IOPS                                 (b) Aggregate cluster bandwidth                                                 (c) Peak metadata load (CDF)
Figure 4: IO and metadata load on the representative production cluster over three days. (a) and (b) show the difference
in blob storage and data warehouse traffic patterns and show Tectonic successfully handling spikes in storage IOPS and
bandwidth over 3 days. Both tenants occupy nearly the same space in this cluster. (c) is a CDF of peak metadata load
over three days on this cluster’s metadata shards. The maximum load each shard can handle is 10 KQPS (grey line).
Tectonic can handle all metadata operations at the File and Block layers. It can immediately handle almost all Name
layer operations; the remaining operations are handled on retry.

tenant’s operations. For instance, all directory-to-file lookups                                                are distributed round-robin across storage nodes [51]. Be-
for a given directory are handled by one shard. An alternative                                                  cause Tectonic uses contiguous RS encoding and the majority
design that used range partitioning like ADLS [42] would                                                        of reads are smaller than a chunk size, reads are usually di-
colocate many more of a tenant’s operations together and                                                        rect: they do not require RS reconstruction and so consist
result in much larger load spikes. Data warehouse jobs often                                                    of a single disk IO. Reconstruction reads require 10⇥ more
read many similarly-named directories, which would lead to                                                      IOs than direct reads (for RS(10,4) encoding). Though com-
extreme load spikes if the directories were range-partitioned.                                                  mon, it is difficult to predict the fraction of reads that will
Data warehouse jobs also read many files in a directory, which                                                  be reconstructed, since reconstruction is triggered by hard-
causes load spikes in the Name layer. Range-partitioning the                                                    ware failures as well as node overload. We learned that such
File layer would colocate files in a directory on the same shard,                                               a wide variability in resource requirements, if not controlled,
resulting in a much larger load spike because each job does                                                     can cause cascading failures that affect system availability
many more File layer operations than Name layer operations.                                                     and performance.
Tectonic’s hash partitioning reduces this colocation, allowing                                                     If some storage nodes are overloaded, direct reads fail and
Tectonic to handle metadata load spikes using fewer nodes                                                       trigger reconstructed reads. This increases load to the rest
than would be necessary with range partitioning.                                                                of the system and triggers yet more reconstructed reads, and
   Tectonic also codesigns with data warehouse to reduce                                                        so forth. The cascade of reconstructions is called a recon-
metadata hotspots. For example, compute engines commonly                                                        struction storm. A simple solution would be to use striped
use an orchestrator to list files in a directory and distribute                                                 RS encoding where all reads are reconstructed. This avoids
the files to workers. The workers open and process the files                                                    reconstruction storms because the number of IOs for reads
in parallel. In Tectonic, this pattern sends a large number of                                                  does not change when there are failures. However, it makes
nearly-simultaneous file open requests to a single directory                                                    normal-case reads much more expensive. We instead prevent
shard (§3.3), causing a hotspot. To avoid this anti-pattern,                                                    reconstruction storms by restricting reconstructed reads to
Tectonic’s list-files API returns the file IDs along with the file                                              10% of all reads. This fraction of reconstructed reads is typ-
names in a directory. The compute engine orchestrator sends                                                     ically enough to handle disk, host, and rack failures in our
the file IDs and names to its workers, which can open the files                                                 production clusters. In exchange for some tuning complexity,
directly by file ID without querying the directory shard again.                                                 we avoid over-provisioning disk resources.
6.4    The simplicity-performance tradeoffs                                                                     Efficiently accessing data within and across datacenters.
                                                                                                                Tectonic allows clients to directly access storage nodes; an
Tectonic’s design generally prioritizes simplicity over effi-
                                                                                                                alternative design might use front-end proxies to mediate all
ciency. We discuss two instances where we opted for addi-
                                                                                                                client access to storage. Making the Client Library accessible
tional complexity in exchange for performance gains.
                                                                                                                to clients introduces complexity because bugs in the library
Managing reconstruction load. RS-encoded data may be                                                            become bugs in the application binary. However, direct client
stored contiguously, where a data block is divided into chunks                                                  access to storage nodes is vastly more network- and hard-
that are each written contiguously to storage nodes, or striped,                                                ware resource efficient than a proxy design, avoiding an extra
where a data block is divided into much smaller chunks that                                                     network hop for terabytes of data per second.

226   19th USENIX Conference on File and Storage Technologies                                                                                                                       USENIX Association

You can also read