Facebook's Tectonic Filesystem: Efficiency from Exascale

Page created by Guy Gomez
 
CONTINUE READING
Facebook's Tectonic Filesystem: Efficiency from Exascale
Facebook’s Tectonic Filesystem:
           Efficiency from Exascale
    Satadru Pan, Facebook, Inc.; Theano Stavrinos, Facebook, Inc. and
    Princeton University; Yunqiao Zhang, Atul Sikaria, Pavel Zakharov,
    Abhinav Sharma, Shiva Shankar P, Mike Shuey, Richard Wareing,
  Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh,
  Kestutis Patiejunas, and JR Tipton, Facebook, Inc.; Ethan Katz-Bassett,
          Columbia University; Wyatt Lloyd, Princeton University
           https://www.usenix.org/conference/fast21/presentation/pan

     This paper is included in the Proceedings of the
19th USENIX Conference on File and Storage Technologies.
                           February 23–25, 2021
                              978-1-939133-20-5

                                         Open access to the Proceedings
                                        of the 19th USENIX Conference on
                                          File and Storage Technologies
                                             is sponsored by USENIX.
Facebook's Tectonic Filesystem: Efficiency from Exascale
Facebook’s Tectonic Filesystem: Efficiency from Exascale
 Satadru Pan1 , Theano Stavrinos1,2 , Yunqiao Zhang1 , Atul Sikaria1 , Pavel Zakharov1 , Abhinav Sharma1 ,
  Shiva Shankar P1 , Mike Shuey1 , Richard Wareing1 , Monika Gangapuram1 , Guanglei Cao1 , Christian
    Preseau1 , Pratap Singh1 , Kestutis Patiejunas1 , JR Tipton1 , Ethan Katz-Bassett3 , and Wyatt Lloyd2
                                    1 Facebook,   Inc., 2 Princeton University, 3 Columbia University

                         Abstract                                          In building Tectonic, we confronted three high-level chal-
Tectonic is Facebook’s exabyte-scale distributed filesystem.            lenges: scaling to exabyte-scale, providing performance isola-
Tectonic consolidates large tenants that previously used                tion between tenants, and enabling tenant-specific optimiza-
service-specific systems into general multitenant filesystem            tions. Exabyte-scale clusters are important for operational
instances that achieve performance comparable to the spe-               simplicity and resource sharing. Performance isolation and
cialized systems. The exabyte-scale consolidated instances              tenant-specific optimizations help Tectonic match the perfor-
enable better resource utilization, simpler services, and less          mance of specialized storage systems.
operational complexity than our previous approach. This pa-                To scale metadata, Tectonic disaggregates the filesys-
per describes Tectonic’s design, explaining how it achieves             tem metadata into independently-scalable layers, similar to
scalability, supports multitenancy, and allows tenants to spe-          ADLS [42]. Unlike ADLS, Tectonic hash-partitions each
cialize operations to optimize for diverse workloads. The               metadata layer rather than using range partitioning. Hash
paper also presents insights from designing, deploying, and             partitioning effectively avoids hotspots in the metadata layer.
operating Tectonic.                                                     Combined with Tectonic’s highly scalable chunk storage layer,
                                                                        disaggregated metadata allows Tectonic to scale to exabytes
1   Introduction                                                        of storage and billions of files.
Tectonic is Facebook’s distributed filesystem. It currently                Tectonic simplifies performance isolation by solving the
serves around ten tenants, including blob storage and data              isolation problem for groups of applications in each tenant
warehouse, both of which store exabytes of data. Prior to               with similar traffic patterns and latency requirements. Instead
Tectonic, Facebook’s storage infrastructure consisted of a              of managing resources among hundreds of applications, Tec-
constellation of smaller, specialized storage systems. Blob             tonic only manages resources among tens of traffic groups.
storage was spread across Haystack [11] and f4 [34]. Data
                                                                           Tectonic uses tenant-specific optimizations to match the
warehouse was spread across many HDFS instances [15].
                                                                        performance of specialized storage systems. These optimiza-
   The constellation approach was operationally complex, re-
                                                                        tions are enabled by a client-driven microservice architecture
quiring many different systems to be developed, optimized,
                                                                        that includes a rich set of client-side configurations for con-
and managed. It was also inefficient, stranding resources in
                                                                        trolling how tenants interact with Tectonic. Data warehouse,
the specialized storage systems that could have been reallo-
                                                                        for instance, uses Reed-Solomon (RS)-encoded writes to im-
cated for other parts of the storage workload.
                                                                        prove space, IO, and networking efficiency for its large writes.
   A Tectonic cluster scales to exabytes such that a single
                                                                        Blob storage, in contrast, uses a replicated quorum append
cluster can span an entire datacenter. The multi-exabyte ca-
                                                                        protocol to minimize latency for its small writes and later
pacity of a Tectonic cluster makes it possible to host several
                                                                        RS-encodes them for space efficiency.
large tenants like blob storage and data warehouse on the
same cluster, with each supporting hundreds of applications                Tectonic has been hosting blob storage and data warehouse
in turn. As an exabyte-scale multitenant filesystem, Tectonic           in single-tenant clusters for several years, completely replac-
provides operational simplicity and resource efficiency com-            ing Haystack, f4, and HDFS. Multitenant clusters are being
pared to federation-based storage architectures [8, 17], which          methodically rolled out to ensure reliability and avoid perfor-
assemble smaller petabyte-scale clusters.                               mance regressions.
   Tectonic simplifies operations because it is a single system            Adopting Tectonic has yielded many operational and effi-
to develop, optimize, and manage for diverse storage needs. It          ciency improvements. Moving data warehouse from HDFS
is resource-efficient because it allows resource sharing among          onto Tectonic reduced the number of data warehouse clusters
all cluster tenants. For instance, Haystack was the storage             by 10⇥, simplifying operations from managing fewer clusters.
system specialized for new blobs; it bottlenecked on hard disk          Consolidating blob storage and data warehouse into multi-
IO per second (IOPS) but had spare disk capacity. f4, which             tenant clusters helped data warehouse handle traffic spikes
stored older blobs, bottlenecked on disk capacity but had spare         with spare blob storage IO capacity. Tectonic manages these
IO capacity. Tectonic requires fewer disks to support the same          efficiency improvements while providing comparable or better
workloads through consolidation and resource sharing.                   performance than the previous specialized storage systems.

USENIX Association                                               19th USENIX Conference on File and Storage Technologies           217
2     Facebook’s Previous Storage Infrastructure                     Tectonic cluster                                                      Tectonic cluster
                                                                                                                    geo-replication

Before Tectonic, each major storage tenant stored its data                 dc1:blobstore   dc1:warehouse               dc2:blobstore   dc2:warehouse
                                                                        appA           appZ           Metadata      appZ                          Metadata
in one or more specialized storage systems. We focus here                       …
                                                                                                         Store
                                                                                                                             …
                                                                                                                                                     Store

                                                                                                                                        …
                                                                                                                                        …
                                                                                                                                        …
                                                                                              …
                                                                                              …
                                                                                              …
on two large tenants, blob storage and data warehouse. We                                                  Chunk                                       Chunk
discuss each tenant’s performance requirements, their prior                                                 Store                                       Store

storage systems, and why they were inefficient.                      Datacenter 1                                                             Datacenter 2

                                                                    Figure 1: Tectonic provides durable, fault-tolerant stor-
2.1    Blob Storage                                                 age inside a datacenter. Each tenant has one or more sep-
Blob storage stores and serves binary large objects. These          arate namespaces. Tenants implement geo-replication.
may be media from Facebook apps (photos, videos, or mes-
sage attachments) or data from internal applications (core          2.2        Data Warehouse
dumps, bug reports). Blobs are immutable and opaque. They           Data warehouse provides storage for data analytics. Data
vary in size from several kilobytes for small photos to several     warehouse applications store objects like massive map-reduce
megabytes for high-definition video chunks [34]. Blob stor-         tables, snapshots of the social graph, and AI training data
age expects low-latency reads and writes as blobs are often         and models. Multiple compute engines, including Presto [3],
on path for interactive Facebook applications [29].                 Spark [10], and AI training pipelines [4] access this data,
                                                                    process it, and store derived data. Warehouse data is parti-
Haystack and f4. Before Tectonic, blob storage consisted            tioned into datasets that store related data for different product
of two specialized systems, Haystack and f4. Haystack han-          groups like Search, Newsfeed, and Ads.
dled “hot” blobs with a high access frequency [11]. It stored          Data warehouse storage prioritizes read and write through-
data in replicated form for durability and fast reads and writes.   put over latency, since data warehouse applications often
As Haystack blobs aged and were accessed less frequently,           batch-process data. Data warehouse workloads tend to issue
they were moved to f4, the “warm” blob storage [34]. f4 stored      larger reads and writes than blob storage, with reads averaging
data in RS-encoded form [43], which is more space-efficient         multiple megabytes and writes averaging tens of megabytes.
but has lower throughput because each blob is directly acces-
sible from two disks instead of three in Haystack. f4’s lower       HDFS for data warehouse storage. Before Tectonic,
throughput was acceptable because of its lower request rate.        data warehouse used the Hadoop Distributed File System
                                                                    (HDFS) [15, 50]. However, HDFS clusters are limited in size
   However, separating hot and warm blobs resulted in poor          because they use a single machine to store and serve metadata.
resource utilization, a problem exacerbated by hardware and            As a result, we needed tens of HDFS clusters per datacenter
blob storage usage trends. Haystack’s ideal effective replica-      to store analytics data. This was operationally inefficient; ev-
tion factor was 3.6⇥ (i.e., each logical byte is replicated 3⇥,     ery service had to be aware of data placement and movement
with an additional 1.2⇥ overhead for RAID-6 storage [19]).          among clusters. Single data warehouse datasets are often large
However, because IOPS per hard drive has remained steady            enough to exceed a single HDFS cluster’s capacity. This com-
as drive density has increased, IOPS per terabyte of storage        plicated compute engine logic, since related data was often
capacity has declined over time.                                    split among separate clusters.
   As a result, Haystack became IOPS-bound; extra hard                 Finally, distributing datasets among the HDFS clusters cre-
drives had to be provisioned to handle the high IOPS load           ated a two-dimensional bin-packing problem. The packing
of hot blobs. The spare disk capacity resulted in Haystack’s        of datasets into clusters had to respect each cluster’s capacity
effective replication factor increasing to 5.3⇥. In contrast, f4    constraints and available throughput. Tectonic’s exabyte scale
had an effective replication factor of 2.8⇥ (using RS(10,4)         eliminated the bin-packing and dataset-splitting problems.
encoding in two different datacenters). Furthermore, blob stor-
age usage shifted to more ephemeral media that was stored in        3     Architecture and Implementation
Haystack but deleted before moving to f4. As a result, an in-       This section describes the Tectonic architecture and imple-
creasing share of the total blob data was stored at Haystack’s      mentation, focusing on how Tectonic achieves exabyte-scale
high effective replication factor.                                  single clusters with its scalable chunk and metadata stores.
   Finally, since Haystack and f4 were separate systems, each
stranded resources that could not be shared with other sys-
                                                                    3.1        Tectonic: A Bird’s-Eye View
tems. Haystack overprovisioned storage to accommodate peak          A cluster is the top-level Tectonic deployment unit. Tectonic
IOPS, whereas f4 had an abundance of IOPS from storing a            clusters are datacenter-local, providing durable storage that is
large volume of less frequently-accessed data. Moving blob          resilient to host, rack, and power domain failures. Tenants can
storage to Tectonic harvested these stranded resources and          build geo-replication on top of Tectonic for protection against
resulted in an effective replication factor of ~2.8⇥.               datacenter failures (Figure 1).

218    19th USENIX Conference on File and Storage Technologies                                                                   USENIX Association
verts clients’ filesystem API calls into RPCs to Chunk and
                                                                         Metadata Store services.
                 Name layer                     Background
   Client
                  File layer
                                 Key-value        Services
                                                                           Finally, each cluster runs stateless background services to
  Library                          Store
                                                                         maintain cluster consistency and fault tolerance (§3.5).
                                                 (stateless)
                 Block layer
                                               Garbage collectors
                      Metadata Store
                                                  Rebalancer             3.2    Chunk Store: Exabyte-Scale Storage
                                                  Stat service
                                                 Disk inventory          The Chunk Store is a flat, distributed object store for chunks,
                                               Block repair/scan
                                              Storage node health        the unit of data storage in Tectonic. Chunks make up blocks,
                 Chunk Store
                                                    checker              which in turn make up Tectonic files.
                                                                            The Chunk Store has two features that contribute to Tec-
Figure 2: Tectonic architecture. Arrows indicate network                 tonic’s scalability and ability to support multiple tenants. First,
calls. Tectonic stores filesystem metadata in a key-value                the Chunk Store is flat; the number of chunks stored grows
store. Apart from the Chunk and Metadata Stores, all                     linearly with the number of storage nodes. As a result, the
components are stateless.                                                Chunk Store can scale to store exabytes of data. Second, it
                                                                         is oblivious to higher-level abstractions like blocks or files;
   A Tectonic cluster is made up of storage nodes, metadata              these abstractions are constructed by the Client Library using
nodes, and stateless nodes for background operations. The                the Metadata Store. Separating data storage from filesystem
Client Library orchestrates remote procedure calls to the meta-          abstractions simplifies the problem of supporting good per-
data and storage nodes. Tectonic clusters can be very large: a           formance for a diversity of tenants on one storage cluster
single cluster can serve the storage needs of all tenants in a           (§5). This separation means reading to and writing from stor-
single datacenter.                                                       age nodes can be specialized to tenants’ performance needs
   Tectonic clusters are multitenant, supporting around ten              without changing filesystem management.
tenants on the same storage fabric (§4). Tenants are distributed         Storing chunks efficiently. Individual chunks are stored
systems that will never share data with one another; tenants             as files on a cluster’s storage nodes, which each run a local
include blob storage and data warehouse. These tenants in turn           instance of XFS [26]. Storage nodes expose core IO APIs
serve hundreds of applications, including Newsfeed, Search,              to get, put, append to, and delete chunks, along with APIs
Ads, and internal services, each with varying traffic patterns           for listing chunks and scanning chunks. Storage nodes are
and performance requirements.                                            responsible for ensuring that their own local resources are
   Tectonic clusters support any number of arbitrarily-sized             shared fairly among Tectonic tenants (§4).
namespaces, or filesystem directory hierarchies, on the same                Each storage node has 36 hard drives for storing chunks [5].
storage and metadata components. Each tenant in a cluster                Each node also has a 1 TB SSD, used for storing XFS meta-
typically owns one namespace. Namespace sizes are limited                data and caching hot chunks. Storage nodes run a version
only by the size of the cluster.                                         of XFS that stores local XFS metadata on flash [47]. This
   Applications interact with Tectonic through a hierarchi-              is particularly helpful for blob storage, where new blobs are
cal filesystem API with append-only semantics, similar to                written as appends, updating the chunk size. The SSD hot
HDFS [15]. Unlike HDFS, Tectonic APIs are configurable at                chunk cache is managed by a cache library which is flash
runtime, rather than being pre-configured on a per-cluster or            endurance-aware [13].
per-tenant basis. Tectonic tenants leverage this flexibility to
match the performance of specialized storage systems (§4).               Blocks as the unit of durable storage. In Tectonic, blocks
                                                                         are a logical unit that hides the complexity of raw data storage
Tectonic components. Figure 2 shows the major compo-                     and durability from the upper layers of the filesystem. To the
nents of Tectonic. The foundation of a Tectonic cluster is the           upper layers, a block is an array of bytes. In reality, blocks are
Chunk Store (§3.2), a fleet of storage nodes which store and             composed of chunks which together provide block durability.
access data chunks on hard drives.                                          Tectonic provides per-block durability to allow tenants to
   On top of the Chunk Store is the Metadata Store (§3.3),               tune the tradeoff between storage capacity, fault tolerance, and
which consists of a scalable key-value store and stateless               performance. Blocks are either Reed-Solomon encoded [43]
metadata services that construct the filesystem logic over the           or replicated for durability. For RS(r, k) encoding, the block
key-value store. Their scalability enables Tectonic to store             data is split into r equal chunks (potentially by padding the
exabytes of data.                                                        data), and k parity chunks are generated from the data chunks.
   Tectonic is a client-driven microservices-based system, a             For replication, data chunks are the same size as the block
design that enables tenant-specific optimizations. The Chunk             and multiple copies are created. Chunks in a block are stored
and Metadata Stores each run independent services to handle              in different fault domains (e.g., different racks) for fault toler-
read and write requests for data and metadata. These services            ance. Background services repair damaged or lost chunks to
are orchestrated by the Client Library (§3.4); the library con-          maintain durability (§3.5).

USENIX Association                                                  19th USENIX Conference on File and Storage Technologies            219
Layer     Key                         Value                          Sharded by    Mapping
 Name      (dir_id, subdirname)        subdir_info, subdir_id         dir_id        dir ! list of subdirs (expanded)
           (dir_id, filename)          file_info, file_id             dir_id        dir ! list of files (expanded)
 File      (file_id, blk_id)           blk_info                       file_id       file ! list of blocks (expanded)
 Block     blk_id                      list                  blk_id        block ! list of disks (i.e., chunks)
           (disk_id, blk_id)           chunk_info                     blk_id        disk ! list of blocks (expanded)
Table 1: Tectonic’s layered metadata schema. dirname and filename are application-exposed strings. dir_id, file_id,
and block_id are internal object references. Most mappings are expanded for efficient updating.

3.3    Metadata Store: Naming Exabytes of Data                        modified without reading and then writing the entire list. In a
                                                                      filesystem where mappings can be very large, e.g., directories
Tectonic’s Metadata Store stores the filesystem hierarchy and
                                                                      may contain millions of files, expanding significantly reduces
the mapping of blocks to chunks. The Metadata Store uses
                                                                      the overhead of some metadata operations such as file creation
a fine-grained partitioning of filesystem metadata for opera-
                                                                      and deletion. The contents of a expanded key are listed by
tional simplicity and scalability. Filesystem metadata is first
                                                                      doing a prefix scan over keys.
disaggregated, meaning the naming, file, and block layers
are logically separated. Each layer is then hash partitioned
                                                                      Fine-grained metadata partitioning to avoid hotspots.
(Table 1). As we describe in this section, scalability and load
                                                                      In a filesystem, directory operations often cause hotspots
balancing come for free with this design. Careful handling of
                                                                      in metadata stores. This is particularly true for data ware-
metadata operations preserves filesystem consistency despite
                                                                      house workloads where related data is grouped into directo-
the fine-grained metadata partitioning.
                                                                      ries; many files from the same directory may be read in a
Storing metadata in a key-value store for scalability and             short time, resulting in repeated accesses to the directory.
operational simplicity. Tectonic delegates filesystem meta-              Tectonic’s layered metadata approach naturally avoids
data storage to ZippyDB [6], a linearizable, fault-tolerant,          hotspots in directories and other layers by separating search-
sharded key-value store. The key-value store manages data at          ing and listing directory contents (Name layer) from reading
the shard granularity: all operations are scoped to a shard, and      file data (File and Block layers). This is similar to ADLS’s
shards are the unit of replication. The key-value store nodes         separation of metadata layers [42]. However, ADLS range-
internally run RocksDB [23], a SSD-based single-node key-             partitions metadata layers whereas Tectonic hash-partitions
value store, to store shard replicas. Shards are replicated with      layers. Range partitioning tends to place related data on the
Paxos [30] for fault tolerance. Any replica can serve reads,          same shard, e.g., subtrees of the directory hierarchy, making
though reads that must be strongly consistent are served by           the metadata layer prone to hotspots if not carefully sharded.
the primary. The key-value store does not provide cross-shard
                                                                         We found that hash partitioning effectively load-balances
transactions, limiting certain filesystem metadata operations.
                                                                      metadata operations. For example, in the Name layer, the
   Shards are sized so that each metadata node can host several
                                                                      immediate directory listing of a single directory is always
shards. This allows shards to be redistributed in parallel to
                                                                      stored in a single shard. But listings of two subdirectories
new nodes in case a node fails, reducing recovery time. It
                                                                      of the same directory will likely be on separate shards. In
also allows granular load balancing; the key-value store will
                                                                      the Block layer, block locator information is hashed among
transparently move shards to control load on each node.
                                                                      shards, independent of the blocks’ directory or file. Around
Filesystem metadata layers. Table 1 shows the filesystem              two-thirds of metadata operations in Tectonic are served by
metadata layers, what they map, and how they are sharded.             the Block layer, but hash partitioning ensures this traffic is
The Name layer maps each directory to its sub-directories             evenly distributed among Block layer shards.
and/or files. The File layer maps file objects to a list of blocks.
The Block layer maps each block to a list of disk (i.e., chunk)       Caching sealed object metadata to reduce read load.
locations. The Block layer also contains the reverse index of         Metadata shards have limited available throughput, so to re-
disks to the blocks whose chunks are stored on that disk, used        duce read load, Tectonic allows blocks, files, and directories
for maintenance operations. Name, File, and Block layers are          to be sealed. Directory sealing does not apply recursively, it
hash-partitioned by directory, file, and block IDs, respectively.     only prevents adding objects in the immediate level of the
   As shown in Table 1, the Name and File layer and disk              directory. The contents of sealed filesystem objects cannot
to block list maps are expanded. A key mapped to a list is            change; their metadata can be cached at metadata nodes and
expanded by storing each item in the list as a key, prefixed          at clients without compromising consistency. The exception
by the true key. For example, if directory d1 contains files          is the block-to-chunk mapping; chunks can migrate among
foo and bar, we store two keys (d1, foo) and (d1, bar) in d1’s        disks, invalidating the Block layer cache. A stale Block layer
Name shard. Expanding allows the contents of a key to be              cache can be detected during reads, triggering a cache refresh.

220    19th USENIX Conference on File and Storage Technologies                                                 USENIX Association
Providing consistent metadata operations. Tectonic re-               performant way possible for applications, which might have
lies on the key-value store’s strongly-consistent opera-             different workloads or prefer different tradeoffs (§5).
tions and atomic read-modify-write in-shard transactions for            The Client Library replicates or RS-encodes data and writes
strongly-consistent same-directory operations. More specif-          chunks directly to the Chunk Store. It reads and reconstructs
ically, Tectonic guarantees read-after-write consistency for         chunks from the Chunk Store for the application. The Client
data operations (e.g., appends, reads), file and directory oper-     Library consults the Metadata Store to locate chunks, and
ations involving a single object (e.g., create, list), and move      updates the Metadata Store for filesystem operations.
operations where the source and destination are in the same
                                                                     Single-writer semantics for simple, optimizable writes.
parent directory. Files in a directory reside in the directory’s
                                                                     Tectonic simplifies the Client Library’s orchestration by allow-
shard (Table 1), so metadata operations like file create, delete,
                                                                     ing a single writer per file. Single-writer semantics avoids the
and moves within a parent directory are consistent.
                                                                     complexity of serializing writes to a file from multiple writers.
   The key-value store does not support consistent cross-shard
                                                                     The Client Library can instead write directly to storage nodes
transactions, so Tectonic provides non-atomic cross-directory
                                                                     in parallel, allowing it to replicate chunks in parallel and to
move operations. Moving a directory to another parent di-
                                                                     hedge writes (§5). Tenants needing multiple-writer semantics
rectory on a different shard is a two-phase process. First, we
                                                                     can build serialization semantics on top of Tectonic.
create a link from the new parent directory, and then delete
the link from the previous parent. The moved directory keeps            Tectonic enforces single-writer semantics with a write to-
a backpointer to its parent directory to detect pending moves.       ken for every file. Any time a writer wants to add a block to a
This ensures only one move operation is active for a direc-          file, it must include a matching token for the metadata write
tory at a time. Similarly, cross directory file moves involve        to succeed. A token is added in the file metadata when a pro-
copying the file and deleting it from the source directory. The      cess opens a file for appending, which subsequent writes must
copy step creates a new file object with the underlying blocks       include to update file metadata. If a second process attempts
of the source file, avoiding data movement.                          to open the file, it will generate a new token and overwrite the
                                                                     first process’s token, becoming the new, and only, writer for
   In the absence of cross-shard transactions, multi-shard
                                                                     the file. The new writer’s Client Library will seal any blocks
metadata operations on the same file must be carefully imple-
                                                                     opened by the previous writer in the open file call.
mented to avoid race conditions. An example of such a race
condition is when a file named f1 in directory d is renamed          3.5    Background Services
to f2. Concurrently, a new file with the same name is created,
where creates overwrite existing files with the same name.           Background services maintain consistency between metadata
The metadata layer and shard lookup key (shard(x)) are listed        layers, maintain durability by repairing lost data, rebalance
for each step in parentheses.                                        data across storage nodes, handle rack drains, and publish
   A file rename has the following steps:                            statistics about filesystem usage. Background services are
   R1: get file ID fid for f1 (Name, shard(d))                       layered similar to the Metadata Store, and they operate on one
   R2: add f2 as an owner of fid (File, shard(fid))                  shard at a time. Figure 2 lists important background services.
   R3: create the mapping f2 ! fid and delete f1 ! fid in               A garbage collector between each metadata layer cleans
an atomic transaction (Name, shard(d))                               up (acceptable) metadata inconsistencies. Metadata incon-
                                                                     sistencies can result from failed multi-step Client Library
   A file create with overwriting has the following steps:
                                                                     operations. Lazy object deletion, a real-time latency optimiza-
   C1: create new file ID fid_new (File, shard(fid_new))
                                                                     tion that marks deleted objects at delete time without actually
   C2: map f1 ! fid_new; delete f1 ! fid (Name, shard(d))
                                                                     removing them, also causes inconsistencies.
   Interleaving the steps in these transactions may leave the
                                                                        A rebalancer and a repair service work in tandem to relocate
filesystem in an inconsistent state. If steps C1 and C2 are
                                                                     or delete chunks. The rebalancer identifies chunks that need
executed after R1 but before R3, then R3 will erase the newly-
                                                                     to be moved in response to events like hardware failure, added
created mapping from the create operation. Rename step R3
                                                                     storage capacity, and rack drains. The repair service handles
uses a within-shard transaction to ensure that the file object
                                                                     the actual data movement by reconciling the chunk list to
pointed to by f1 has not been modified since R1.
                                                                     the disk-to-block map for every disk in the system. To scale
3.4    Client Library                                                horizontally, the repair service works on a per-Block layer
                                                                     shard, per-disk basis, enabled by the reverse index mapping
The Tectonic Client Library orchestrates the Chunk and Meta-
                                                                     disks to blocks (Table 1).
data Store services to expose a filesystem abstraction to appli-
cations, which gives applications per-operation control over         Copysets at scale. Copysets are combinations of disks that
how to configure reads and writes. Moreover, the Client Li-          provide redundancy for the same block (e.g., a copyset for an
brary executes reads and writes at the chunk granularity, the        RS(10,4)-encoded block consists of 14 disks) [20]. Having
finest granularity possible in Tectonic. This gives the Client       too many copysets risks data unavailability if there is an un-
Library nearly free reign to execute operations in the most          expected spike in disk failures. On the other hand, having too

USENIX Association                                              19th USENIX Conference on File and Storage Technologies          221
few copysets results in high reconstruction load to peer disks      need finer-grained real-time automated management to ensure
when one disk fails, since they share many chunks.                  they are shared fairly, tenants are isolated from one another,
   The Block Layer and the rebalancer service together at-          and resource utilization is high. For the rest of this section, we
tempt to maintain a fixed copyset count that balances unavail-      describe how Tectonic shares ephemeral resources effectively.
ability and reconstruction load. They each keep in memory           Distributing ephemeral resources among and within ten-
about one hundred consistent shuffles of all the disks in the       ants. Ephemeral resource sharing is challenging in Tectonic
cluster. The Block Layer forms copysets from contiguous             because not only are tenants diverse, but each tenant serves
disks in a shuffle. On a write, the Block Layer gives the Client    many applications with varied traffic patterns and perfor-
Library a copyset from the shuffle corresponding to that block      mance requirements. For example, blob storage includes pro-
ID. The rebalancer service tries to keep the block’s chunks         duction traffic from Facebook users and background garbage
in the copyset specified by that block’s shuffle. Copysets are      collection traffic. Managing ephemeral resources at the tenant
best-effort, since disk membership in the cluster changes con-      granularity would be too coarse to account for the varied work-
stantly.                                                            loads and performance requirements within a tenant. On the
4     Multitenancy                                                  other hand, because Tectonic serves hundreds of applications,
                                                                    managing resources at the application granularity would be
Providing comparable performance for tenants as they move           too complex and resource-intensive.
from individual, specialized storage systems to a consolidated         Ephemeral resources are therefore managed within each
filesystem presents two challenges. First, tenants must share       tenant at the granularity of groups of applications. These appli-
resources while giving each tenant its fair share, i.e., at least   cation groups, called TrafficGroups, reduce the cardinality of
the same resources it would have in a single-tenant system.         the resource sharing problem, reducing the overhead of man-
Second, tenants should be able to optimize performance as           aging multitenancy. Applications in the same TrafficGroup
in specialized systems. This section describes how Tectonic         have similar resource and latency requirements. For example,
supports resource sharing with a clean design that maintains        one TrafficGroup may be for applications generating back-
operational simplicity. Section 5 describes how Tectonic’s          ground traffic while another is for applications generating
tenant-specific optimizations allow tenants to get performance      production traffic. Tectonic supports around 50 TrafficGroups
comparable to specialized storage systems.                          per cluster. Each tenant may have a different number of Traffic-
4.1    Sharing Resources Effectively                                Groups. Tenants are responsible for choosing the appropriate
                                                                    TrafficGroup for each of their applications. Each TrafficGroup
As a shared filesystem for diverse tenants across Facebook,
                                                                    is in turn assigned a TrafficClass. A TrafficGroup’s Traffic-
Tectonic needs to manage resources effectively. In particular,
                                                                    Class indicates its latency requirements and decides which
Tectonic needs to provide approximate (weighted) fair shar-
                                                                    requests should get spare resources. The TrafficClasses are
ing of resources among tenants and performance isolation
                                                                    Gold, Silver, and Bronze, corresponding to latency-sensitive,
between tenants, while elastically shifting resources among
                                                                    normal, and background applications. Spare resources are
applications to maintain high resource utilization. Tectonic
                                                                    distributed according to TrafficClass priority within a tenant.
also needs to distinguish latency-sensitive requests to avoid
                                                                       Tectonic uses tenants and TrafficGroups along with the
blocking them behind large requests.
                                                                    notion of TrafficClass to ensure isolation and high resource
Types of resources. Tectonic distinguishes two types of             utilization. That is, tenants are allocated their fair share of
resources: non-ephemeral and ephemeral. Storage capacity            resources; within each tenant, resources are distributed by
is the non-ephemeral resource. It changes slowly and pre-           TrafficGroup and TrafficClass. Each tenant gets a guaranteed
dictably. Most importantly, once allocated to a tenant, it can-     quota of the cluster’s ephemeral resources, which is subdi-
not be given to another tenant. Storage capacity is managed         vided between a tenant’s TrafficGroups. Each TrafficGroup
at the tenant granularity. Each tenant gets a predefined ca-        gets its guaranteed resource quota, which provides isolation
pacity quota with strict isolation, i.e., there is no automatic     between tenants as well as isolation between TrafficGroups.
elasticity in the space allocated to different tenants. Recon-         Any ephemeral resource surplus within a tenant is shared
figuring storage capacity between tenants is done manually.         with its own TrafficGroups by descending TrafficClass. Any
Reconfiguration does not cause downtime, so in case of an           remaining surplus is given to TrafficGroups in other tenants
urgent capacity crunch, it can be done immediately. Tenants         by descending TrafficClass. This ensures spare resources are
are responsible for distributing and tracking storage capacity      used by TrafficGroups of the same tenant first before being
among their applications.                                           distributed to other tenants. When one TrafficGroup uses
   Ephemeral resources are those where demand changes               resources from another TrafficGroup, the resulting traffic gets
from moment to moment, and allocation of these resources            the minimum TrafficClass of the two TrafficGroups. This
can change in real time. Storage IOPS capacity and meta-            ensures the overall ratio of traffic of different classes does not
data query capacity are two ephemeral resources. Because            change based on resource allocation, which ensures the node
ephemeral resource demand changes quickly, these resources          can meet the latency profile of the TrafficClass.

222    19th USENIX Conference on File and Storage Technologies                                                 USENIX Association
Enforcing global resource sharing. The Client Library                    talks to each layer directly. Since access control is on path for
uses a rate limiter to achieve the aforementioned elastic-               every read and write, it must also be lightweight.
ity. The rate limiter uses high-performance, near-realtime                  Tectonic uses a token-based authorization mechanism that
distributed counters to track the demand for each tracked                includes which resources can be accessed with the token [31].
resource in each tenant and TrafficGroup in the last small               An authorization service authorizes top-level client requests
time window. The rate limiter implements a modified leaky                (e.g., opening a file), generating an authorization token for the
bucket algorithm. An incoming request increments the de-                 next layer in the filesystem; each subsequent layer likewise
mand counter for the bucket. The Client Library then checks              authorizes the next layer. The token’s payload describes the re-
for spare capacity in its own TrafficGroup, then other Traffic-          source to which access is given, enabling granular access con-
Groups in the same tenant, and finally other tenants, adhering           trol. Each layer verifies the token and the resource indicated
to TrafficClass priority. If the client finds spare capacity, the re-    in the payload entirely in memory; verification can be per-
quest is sent to the backend. Otherwise, the request is delayed          formed in tens of microseconds. Piggybacking token-passing
or rejected depending on the request’s timeout. Throttling               on existing protocols reduces the access control overhead.
requests at clients puts backpressure on clients before they
make a potentially wasted request.
                                                                         5     Tenant-Specific Optimizations
                                                                         Tectonic supports around ten tenants in the same shared
Enforcing local resource sharing. The client rate limiter
                                                                         filesystem, each with specific performance needs and work-
ensures approximate global fair sharing and isolation. Meta-
                                                                         load characteristics. Two mechanisms permit tenant-specific
data and storage nodes also need to manage resources to
                                                                         optimizations. First, clients have nearly full control over how
avoid local hotspots. Nodes provide fair sharing and isolation
                                                                         to configure an application’s interactions with Tectonic; the
with a weighted round-robin (WRR) scheduler that provision-
                                                                         Client Library manipulates data at the chunk level, the finest
ally skips a TrafficGroup’s turn if it will exceed its resource
                                                                         possible granularity (§3.4). This Client Library-driven de-
quota. In addition, storage nodes need to ensure that small IO
                                                                         sign enables Tectonic to execute operations according to the
requests (e.g., blob storage operations) do not see higher la-
                                                                         application’s performance needs.
tency from colocation with large, spiky IO requests (e.g., data
warehouse operations). Gold TrafficClass requests can miss                  Second, clients enforce configurations on a per-call basis.
their latency targets if they are blocked behind lower-priority          Many other filesystems bake configurations into the system or
requests on storage nodes.                                               apply them to entire files or namespaces. For example, HDFS
                                                                         configures durability per directory [7], whereas Tectonic con-
   Storage nodes use three optimizations to ensure low latency
                                                                         figures durability per block write. Per-call configuration is
for Gold TrafficClass requests. First, the WRR scheduler pro-
                                                                         enabled by the scalability of the Metadata Store: the Meta-
vides a greedy optimization where a request from a lower
                                                                         data Store can easily handle the increased metadata for this
TrafficClass may cede its turn to a higher TrafficClass if the
                                                                         approach. We next describe how data warehouse and blob
request will have enough time to complete after the higher-
                                                                         storage leverage per-call configurations for efficient writes.
TrafficClass request. This helps prevent higher-TrafficClass
requests from getting stuck behind a lower-priority request.             5.1    Data Warehouse Write Optimizations
Second, we limit how many non-Gold IOs may be in flight
                                                                         A common pattern in data warehouse workloads is to write
for every disk. Incoming non-Gold traffic is blocked from
                                                                         data once that will be read many times later. For these work-
scheduling if there are any pending Gold requests and the non-
                                                                         loads, the file is visible to readers only once the creator closes
Gold in-flight limit has been reached. This ensures the disk is
                                                                         the file. The file is then immutable for its lifetime. Because
not busy serving large data warehouse IOs while blob storage
                                                                         the file is only read after it is written completely, applications
requests are waiting. Third, the disk itself may re-arrange the
                                                                         prioritize lower file write time over lower append latency.
IO requests, i.e., serve a non-Gold request before an earlier
Gold request. To manage this, Tectonic stops scheduling non-             Full-block, RS-encoded asynchronous writes for space,
Gold requests to a disk if a Gold request has been pending               IO, and network efficiency. Tectonic uses the write-once-
on that disk for a threshold amount of time. These three tech-           read-many pattern to improve IO and network efficiency,
niques combined effectively maintain the latency profile of              while minimizing total file write time. The absence of partial
smaller IOs, even when outnumbered by larger IOs.                        file reads in this pattern allows applications to buffer writes
                                                                         up to the block size. Applications then RS-encode blocks in
4.2    Multitenant Access Control                                        memory and write the data chunks to storage nodes. Long-
Tectonic follows common security principles to ensure that               lived data is typically RS(9,6) encoded; short-lived data, e.g.,
all communications and dependencies are secure. Tectonic                 map-reduce shuffles, is typically RS(3,3)-encoded.
additionally provides coarse access control between tenants                 Writing RS-encoded full blocks saves storage space, net-
(to prevent one tenant from accessing another’s data) and fine-          work bandwidth, and disk IO over replication. Storage and
grained access control within a tenant. Access control must              bandwidth are lower because less total data is written. Disk
be enforced at each layer of Tectonic, since the Client Library          IO is lower because disks are more efficiently used. More

USENIX Association                                                  19th USENIX Conference on File and Storage Technologies           223
100                                                   100

                   95
                                                                        75                                                    75
      Percentile

                                                           Percentile

                                                                                                                 Percentile
                   90
                                                                        50                                                    50
                   85
                                                                        25                    Haystack                        25
                   80                    Hedging                                       Quorum append                                                   Tectonic
                                      No Hedging                                       Standard append                                                 Haystack
                   75                                                     0                                                    0
                     800   1000 1200 1400 1600 1800 2000                      0   50       100       150   200                      0   20       40      60       80   100
                           72MB block write latency (ms)                           Write Latency (ms)                                        Read Latency (ms)
      (a) Data warehouse full block writes                               (b) Blob storage write latency                        (c) Blob storage read latency
Figure 3: Tail latency optimizations in Tectonic. (a) shows the improvement in data warehouse tail latency from hedged
quorum writes (72MB blocks) in a test cluster with ~80% load. (b) and (c) show Tectonic blob storage write latency
(with and without quorum appends) and read latency compared to Haystack.

IOPS are needed to write chunks to 15 disks in RS(9,6), but                                 storage metadata by storing many blobs together into log-
each write is small and the total amount of data written is                                 structured files, where new blobs are appended at the end of a
much smaller than with replication. This results in more effi-                              file. Blobs are located with a map from blob ID to the location
cient disk IO because block sizes are large enough that disk                                of the blob in the file.
bandwidth, not IOPS, is the bottleneck for full-block writes.                                  Blob storage is also on path for many user requests, so low
   The write-once-read-many pattern also allows applications                                latency is desirable. Blobs are usually much smaller than Tec-
to write the blocks of a file asynchronously in parallel, which                             tonic blocks (§2.1). Blob storage therefore writes new blobs
decreases the file write latency significantly. Once the blocks                             as small, replicated partial block appends for low latency. The
of the file are written, the file metadata is updated all together.                         partial block appends need to be read-after-write consistent
There is no risk of inconsistency with this strategy because a                              so blobs can be read immediately after successful upload.
file is only visible once it is completely written.                                         However, replicated data uses more disk space than full-block
                                                                                            RS-encoded data.
Hedged quorum writes to improve tail latency. For full-
block writes, Tectonic uses a variant of quorum writing which                               Consistent partial block appends for low latency. Tec-
reduces tail latency without any additional IO. Instead of                                  tonic uses partial block quorum appends to enable durable,
sending the chunk write payload to extra nodes, Tectonic first                              low-latency, consistent blob writes. In a quorum append, the
sends reservation requests ahead of the data and then writes                                Client Library acknowledges a write after a subset of storage
the chunks to the first nodes to accept the reservation. The                                nodes has successfully written the data to disk, e.g., two nodes
reservation step is similar to hedging [22], but it avoids data                             for three-way replication. The temporary decrease of durabil-
transfers to nodes that would reject the request because of                                 ity from a quorum write is acceptable because the block will
lack of resources or because the requester has exceeded its                                 soon be reencoded and because blob storage writes a second
resource share on that node (§4).                                                           copy to another datacenter.
   As an example, to write a RS(9,6)-encoded block, the Client                                 The challenge with partial block quorum appends is that
Library sends a reservation request to 19 storage nodes in                                  straggler appends could leave replica chunks at different sizes.
different failure domains, four more than required for the                                  Tectonic maintains consistency by carefully controlling who
write. The Client Library writes the data and parity chunks                                 can append to a block and when appends are made visible.
to the first 15 storage nodes that respond to the reservation                               Blocks can only be appended to by the writer that created
request. It acknowledges the write to the client as soon as a                               the block. Once an append completes, Tectonic commits the
quorum of 14 out of 15 nodes return success. If the 15th write                              post-append block size and checksum to the block metadata
fails, the corresponding chunk is repaired offline.                                         before acknowledging the partial block quorum append.
   The hedging step is more effective when the cluster is                                      This ordering of operations with a single appender provides
highly loaded. Figure 3a shows ~20% improvement in 99th                                     consistency. If block metadata reports a block size of S, then
percentile latency for RS(9,6) encoded, 72 MB full-block                                    all preceeding bytes in the block were written to at least two
writes, in a test cluster with 80% throughput utilization.                                  storage nodes. Readers will be able to access data in the block
                                                                                            up to offset S. Similarly, any writes acknowledged to the ap-
5.2          Blob Storage Optimizations                                                     plication will have been updated in the block metadata and so
Blob storage is challenging for filesystems because of the                                  will be visible to future reads. Figures 3b and 3c demonstrate
quantity of objects that need to be indexed. Facebook stores                                that Tectonic’s blob storage read and write latency is compa-
tens of trillions of blobs. Tectonic manages the size of blob                               rable to Haystack, validating that Tectonic’s generality does

224       19th USENIX Conference on File and Storage Technologies                                                                                    USENIX Association
Capacity   Used bytes     Files     Blocks    Storage Nodes                   Warehouse      Blob storage    Combined
    1590 PB     1250 PB      10.7 B      15 B          4208             Supply      0.51            0.49           1.00
Table 2: Statistics from a multitenant Tectonic produc-                 Peak 1      0.60            0.12           0.72
tion cluster. File and block counts are in billions.                    Peak 2      0.54            0.14           0.68
                                                                        Peak 3      0.57            0.11           0.68
not have a significant performance cost.                              Table 3: Consolidating data warehouse and blob storage
                                                                      in Tectonic allows data warehouse to use what would oth-
Reencoding blocks for storage efficiency. Directly RS-
                                                                      erwise be stranded surplus disk time for blob storage to
encoding small partial-block appends would be IO-inefficient.
                                                                      handle large load spikes. This figure shows the normal-
Small disk writes are IOPS-bound and RS-encoding results
                                                                      ized disk time demand vs. supply in three daily peaks in
in many more IOs (e.g, 14 IOs with RS(10, 4) instead of 3).
                                                                      the representative cluster.
Instead of RS-encoding after each append, the Client Library
reencodes the block from replicated form to RS(10,4) en-
                                                                      spike. For example, if a disk does 10 IOs in one second with
coding once the block is sealed. Reencoding is IO-efficient
                                                                      each taking 50 ms (seek and fetch), then the disk was busy for
compared to RS-encoding at append time, requiring only a
                                                                      500 out of 1000 ms. We use disk time to fairly account for
single large IO on each of the 14 target storage nodes. This
                                                                      usage by different types of requests.
optimization, enabled by Tectonic’s Client Library-driven de-
                                                                         For the representative production cluster, Table 3 shows
sign, provides nearly the best of both worlds with fast and
                                                                      normalized disk time demand for data warehouse and blob
IO-efficient replication for small appends that are quickly
                                                                      storage for three daily peaks and the supply of disk time each
transitioned to the more space-efficient RS-encoding.
                                                                      would have if running on its own cluster. We normalize by
6     Tectonic in Production                                          total disktime corresponding to used space in the cluster. The
                                                                      daily peaks correspond to the same three days of traffic as
This section shows Tectonic operating at exabyte scale,
                                                                      in Figures 4a and 4b. Data warehouse’s demand exceeds its
demonstrates benefits of storage consolidation, and discusses
                                                                      supply in all three peaks and handling it on its own would
how Tectonic handles metadata hotspots. It also discusses
                                                                      require disk overprovisioning. To handle peak data warehouse
tradeoffs and lessons from designing Tectonic.
                                                                      demand over the three day period, the cluster would have
6.1     Exabyte-Scale Multitenant Clusters                            needed ~17% overprovisioning. Blob storage, on the other
                                                                      hand, has surplus disk time that would be stranded if it ran
Production Tectonic clusters run at exabyte scale. Table 2            in its own cluster. Consolidating these tenants into a single
gives statistics on a representative multitenant cluster. All         Tectonic cluster allows the blob storage’s surplus disk time to
results in this section are for this cluster. The 1250 PB of stor-    be used for data warehouse’s storage load spikes.
age, ~70% of the cluster capacity at the time of the snapshot,
consists of 10.7 billion files and 15 billion blocks.                 6.3    Metadata Hotspots
6.2     Efficiency from Storage Consolidation                         Load spikes to the Metadata Store may result in hotspots in
                                                                      metadata shards. The bottleneck resource in serving meta-
The cluster in Table 2 hosts two tenants, blob storage and            data operations is queries per second (QPS). Handling load
data warehouse. Blob storage uses ~49% of the used space in           spikes requires the Metadata Store to keep up with the QPS
this cluster and data warehouse uses ~51%. Figures 4a and 4b          demand on every shard. In production, each shard can serve a
show the cluster handling storage load over a three-day period.       maximum of 10 KQPS. This limit is imposed by the current
Figure 4a shows the cluster’s aggregate IOPS during that time,        isolation mechanism on the resources of the metadata nodes.
and Figure 4b shows its aggregate disk bandwidth. The data            Figure 4c shows the QPS across metadata shards in the cluster
warehouse workload has large, regular load spikes triggered           for the Name, File, and Block layers. All shards in the File
by very large jobs. Compared to the spiky data warehouse              and Block layers are below this limit.
workload, blob storage traffic is smooth and predictable.
                                                                         Over this three-day period, around 1% of Name layer shards
Sharing surplus IOPS capacity. The cluster handles                    hit the QPS limit because they hold very hot directories. The
spikes in storage load from data warehouse using the surplus          small unhandled fraction of metadata requests are retried
IOPS capacity unlocked by consolidation with blob storage.            after a backoff. The backoff allows the metadata nodes to
Blob storage requests are typically small and bound by IOPS           clear most of the initial spike and successfully serve retried
while data warehouse requests are typically large and bound           requests. This mechanism, combined with all other shards
by bandwidth. As a result, neither IOPS nor bandwidth can             being below their maximum, enables Tectonic to successfully
fairly account for disk IO usage. The bottleneck resource in          handle the large spikes in metadata load from data warehouse.
serving storage operations is disk time, which measures how              The distribution of load across shards varies between the
often a given disk is busy. Handling a storage load spike re-         Name, File, and Block layers. Each higher layer has a larger
quires Tectonic to have enough free disk time to serve the            distribution of QPS per shard because it colocates more of a

USENIX Association                                               19th USENIX Conference on File and Storage Technologies        225
2000                      Warehouse                                  4.5
                                                                                                                Warehouse
                  1800                     Blob storage                                  4                     Blob storage                                    100
                  1600                                                                 3.5

                                                                    Bandwidth (TB/s)

                                                                                                                                        Percentile of shards
                  1400                                                                   3                                                                     75
                  1200
       IOPS (K)

                                                                                       2.5                                                                                                    max load >
                  1000
                                                                                         2                                                                     50
                   800
                   600                                                                 1.5
                                                                                         1                                                                     25                           Block
                   400                                                                                                                                                                      Name
                   200                                                                 0.5                                                                                                    File
                     0                                                                   0                                                                       0
                         0   10   20     30 40 50         60   70                            0   10   20     30 40 50         60   70                                0   2   4          6        8     10
                                       Time (hours)                                                        Time (hours)                                                          KQPS

                    (a) Aggregate cluster IOPS                                 (b) Aggregate cluster bandwidth                                                 (c) Peak metadata load (CDF)
Figure 4: IO and metadata load on the representative production cluster over three days. (a) and (b) show the difference
in blob storage and data warehouse traffic patterns and show Tectonic successfully handling spikes in storage IOPS and
bandwidth over 3 days. Both tenants occupy nearly the same space in this cluster. (c) is a CDF of peak metadata load
over three days on this cluster’s metadata shards. The maximum load each shard can handle is 10 KQPS (grey line).
Tectonic can handle all metadata operations at the File and Block layers. It can immediately handle almost all Name
layer operations; the remaining operations are handled on retry.

tenant’s operations. For instance, all directory-to-file lookups                                                are distributed round-robin across storage nodes [51]. Be-
for a given directory are handled by one shard. An alternative                                                  cause Tectonic uses contiguous RS encoding and the majority
design that used range partitioning like ADLS [42] would                                                        of reads are smaller than a chunk size, reads are usually di-
colocate many more of a tenant’s operations together and                                                        rect: they do not require RS reconstruction and so consist
result in much larger load spikes. Data warehouse jobs often                                                    of a single disk IO. Reconstruction reads require 10⇥ more
read many similarly-named directories, which would lead to                                                      IOs than direct reads (for RS(10,4) encoding). Though com-
extreme load spikes if the directories were range-partitioned.                                                  mon, it is difficult to predict the fraction of reads that will
Data warehouse jobs also read many files in a directory, which                                                  be reconstructed, since reconstruction is triggered by hard-
causes load spikes in the Name layer. Range-partitioning the                                                    ware failures as well as node overload. We learned that such
File layer would colocate files in a directory on the same shard,                                               a wide variability in resource requirements, if not controlled,
resulting in a much larger load spike because each job does                                                     can cause cascading failures that affect system availability
many more File layer operations than Name layer operations.                                                     and performance.
Tectonic’s hash partitioning reduces this colocation, allowing                                                     If some storage nodes are overloaded, direct reads fail and
Tectonic to handle metadata load spikes using fewer nodes                                                       trigger reconstructed reads. This increases load to the rest
than would be necessary with range partitioning.                                                                of the system and triggers yet more reconstructed reads, and
   Tectonic also codesigns with data warehouse to reduce                                                        so forth. The cascade of reconstructions is called a recon-
metadata hotspots. For example, compute engines commonly                                                        struction storm. A simple solution would be to use striped
use an orchestrator to list files in a directory and distribute                                                 RS encoding where all reads are reconstructed. This avoids
the files to workers. The workers open and process the files                                                    reconstruction storms because the number of IOs for reads
in parallel. In Tectonic, this pattern sends a large number of                                                  does not change when there are failures. However, it makes
nearly-simultaneous file open requests to a single directory                                                    normal-case reads much more expensive. We instead prevent
shard (§3.3), causing a hotspot. To avoid this anti-pattern,                                                    reconstruction storms by restricting reconstructed reads to
Tectonic’s list-files API returns the file IDs along with the file                                              10% of all reads. This fraction of reconstructed reads is typ-
names in a directory. The compute engine orchestrator sends                                                     ically enough to handle disk, host, and rack failures in our
the file IDs and names to its workers, which can open the files                                                 production clusters. In exchange for some tuning complexity,
directly by file ID without querying the directory shard again.                                                 we avoid over-provisioning disk resources.
6.4    The simplicity-performance tradeoffs                                                                     Efficiently accessing data within and across datacenters.
                                                                                                                Tectonic allows clients to directly access storage nodes; an
Tectonic’s design generally prioritizes simplicity over effi-
                                                                                                                alternative design might use front-end proxies to mediate all
ciency. We discuss two instances where we opted for addi-
                                                                                                                client access to storage. Making the Client Library accessible
tional complexity in exchange for performance gains.
                                                                                                                to clients introduces complexity because bugs in the library
Managing reconstruction load. RS-encoded data may be                                                            become bugs in the application binary. However, direct client
stored contiguously, where a data block is divided into chunks                                                  access to storage nodes is vastly more network- and hard-
that are each written contiguously to storage nodes, or striped,                                                ware resource efficient than a proxy design, avoiding an extra
where a data block is divided into much smaller chunks that                                                     network hop for terabytes of data per second.

226   19th USENIX Conference on File and Storage Technologies                                                                                                                       USENIX Association
You can also read