HSS: A simple file storage system for web applications Abstract

Page created by Russell Roberts
 
CONTINUE READING
HSS: A simple file storage system for web applications Abstract
HSS: A simple file storage system for web applications

                                             Abstract
AOL Technologies has created a scalable object store for web applications. The goal of the
object store was to eliminate the creation of a separate storage system for every application we
produce while avoiding sending data to external storage services. AOL developers had been
devoting a significant amount of time to creating backend storage systems to enable functionality
in their applications. These storage systems were all very similar and many relied on difficult-to-
scale technologies like network attached file systems. This paper describes our implementation
and the operating experience with the storage system. The paper also presents a feature roadmap
and our release of an open source version.

                                                         upgrades. The multi-tenant requirement to
1 . Introduction                                         use the system as a service makes isolation
All web applications require some type of                within the clustered file systems difficult
durable storage system to hold the content               other than through traditional file system
assets used to present the look and feel of              layout structures which have limited
the application and in some cases the code               flexibility. The interfaces that clustered file
that runs the application as well. Web scale             systems provide tend to look like traditional
applications require that the content be                 POSIX interfaces but almost all of them
published from numerous sources and be                   have unusual behaviors that many
delivered from numerous sources as well.                 applications     cannot      understand      or
There have been quite a few solutions to this            workaround       without      changes.    The
problem over time but many of these past                 requirement to make application changes to
solutions     have     presented    scalability          accommodate the file system specifics
problems, availability challenges, or both.              makes clustered file systems unattractive to
We have implemented storage systems                      legacy applications.
based on traditional network file systems                        The storage system we have created
like NFS[1] and we have also implemented                 seeks to improve on the availability and
storage systems based on more scalable                   operational characteristics of the storage
cluster file systems like IBRIX Fusion[2],               systems we have used in the past while
Lustre[3], and OrangeFS/PVFS[4,5]. These                 providing easily consumable front end
solutions have drawbacks related to the tight            semantics for the storage and retrieval of
coupling of the server and client, the                   files by applications. We provide a minimal
restricted ability to add and remove capacity            set of operations and rely on external
from the system, the recovery behaviors of               components for any additional functionality
the system, the ability of the system to allow           that an application may need. We have built
non-disruptive upgrades of software, single              several mechanisms into the system that
points of failure built into critical parts of           provide data durability and recovery. We
the system, and limited if any multi-tenant              have also added a number of load
capabilities.                                            management features through internal and
         The server and client dependencies              external mechanisms to reduce hotspots and
in clustered file systems create a difficult-to-         allow for data migration within the system.
manage issue around updates of the                       The system is aware of its physical makeup
software. In general, mismatched client and              to provide resilience to local failures. There
server versions do not work together, so the             are also autonomic behaviors in the system
entire system must be shut down to perform               related to failure recovery. All of these

USENIX Association                           26th Large Installation System Administration Conference (LISA ’12) 1
HSS: A simple file storage system for web applications Abstract
features are built and implemented using                         multiple sites that do not share physical
well-understood open source software and                         infrastructure.
industry standard servers.                                       2 .2 Data protection through file
         We illustrate here the design criteria,                 replication
explain why the criteria were chosen, and                        Data protection can be accomplished
the current implementation in the system.                        through     hardware,     software     or   a
We follow up with a discussion of the                            combination of both. The greatest flexibility
observed behaviors of the system under                           comes from the software data protection.
production load for the past 24 months. We                       Software data protection improves the
then present the feature roadmap and our                         availability of the system and site failure
invitation to add on to the system by open                       resilience. File replication is the chosen
sourcing the project as it is today.                             method in the system. While space-
2 . Design                                                       inefficient compared to traditional RAID
These are the design criteria that we have                       algorithms, the availability considerations
selected to implement in the system as                           outweigh the cost of additional storage.
required features. They are in our opinion of                            Software control of the replication
general use to all web applications. They are                    allows tuning based on a particular
all fully implemented in the system today.                       application’s    availability    requirements
2 .1 Awareness of physical system                                independent of the hardware used and of
configuration                                                    other applications. In addition to tuning
In order to be resilient to physical                             considerations, the software allows us the
component failure in the system as well as                       flexibility to specify how file writes are
failures of the hosting site infrastructure the                  acknowledged. We can create as many
system has awareness of its physical                             copies as necessary before the write is
makeup. This awareness is fairly granular                        acknowledged to the application. In general,
with known components consisting of                              the number of copies created synchronously
storage devices, servers which are containers                    is two, and then any additional copies,
of storage devices, and groups of servers                        typically one more, are created by the
referred to as a site.                                           system asynchronously. These additional
         Storage device awareness prevents                       asynchronous copies include any copies
storage of a file and its replicas on the same                   made to another site in the system.
storage device which would be unsafe in the                      2 .3 Background data checking and
event of a failure of that device. Servers and                   recovery
the storage devices they contain have the                        Data corruption[6] is of concern for a
same merit in the system for data protection.                    system designed to store large numbers of
None of the copies of a file should reside on                    objects for an indefinite amount of time.
the same disk or server.                                         With the corruption problem in mind we
         The grouping of servers into a site                     have implemented several processes within
allows for the creation of a disaster recovery                   the system that check and repair data both
option where an application that chooses to                      proactively and at access time.
store files in multiple sites can achieve                                The system runs a background
higher availability in the event of a failure of                 process that runs constantly and compares
an entire site. There is no requirement that                     file locations and checksums to the
the sites be physically separate. The                            metadata. The background scrubber
preferred configuration does include                             continuously reads files from the system and
                                                                 confirms that the correct number of copies

2 26th Large Installation System Administration Conference (LISA ’12)                         USENIX Association
HSS: A simple file storage system for web applications Abstract
of a file are available and that the files are           of files in a directory. In the pathological
not corrupt by using a checksum. If the                  case that all files on a given disk are 2KB
number of copies is incorrect, the number of             with 1TB drives, we will not go over
file copies is adjusted by creating any                  125000 files in any directory with even file
missing copies or removing any additional                distribution before filling any individual
copies. If a checksum fails on any file                  drive. In fact, the majority of files stored
replica then the corrupt copy is deleted and             have sizes between 10KB and 100KB,
the correct number of replicas is recreated              which reduces the possible number of files
by scheduling a copy process from one of                 on a disk before it fills to somewhere
the known good copies of the file.                       between 2400 and 24000 files. This should
         In addition to the background                   be manageable for the system even as drives
process that constantly checks data, file                get larger. In the futures section there is a
problems can be repaired upon an access                  discussion of how the large numbers of files
attempt of a missing or corrupt file. If the             can be managed.
retrieval of a file fails due to a missing file          2 .5 High performance database for
or checksum error the request is served by               system metadata
one of the other file replicas that is available         Metadata scalability and availability have
and checksums properly.           The correct            direct affects on the system throughput and
number of replicas is then created by                    is ability to serve data. A number of options
scheduling a copy process from one of the                exist for systems that we believe are reliable
known good copies of the file to a new                   enough to serve as the metadata engine in
location. Metadata reliability is handled by             the system. Depending on the performance
the metadata system.                                     requirements, a well-scaled RDBMS system
2 .4 Internal file system layout                         has been found to meet the needs of the
Each HSS server is comprised of multiple                 system for a large variety of workloads. An
file systems, one for each physical drive; file          additional benefit of using an RDBMS
systems do not span multiple drives. A                   system is that they are widely available and
single top level directory will be the location          well understood.
of all the subdirectories that contain files.                     We have implemented the system
The directory structure of the file system               with MySQL as the primary metadata
will be identical on all hosts but file replica          engine. The data structure within the system
locations will vary by host. This is                     is rather simple with almost no relation
necessitated because each host may have                  between the tables. Even though there are
different capacities, but the desire is to               few relational aspects to the data, the use of
balance utilization across all hosts and                 a well understood RDBMS simplifies the
spindles, precluding placement in the same               metadata management aspects of the system
directory on all systems.                                and the development effort required to
         Each node will have some number of              create the system. Most of the data is rarely
physical disks. Each disk will be mounted                changed and most of the tables are used for
on the top level directory as a subdirectory.            configuration of the system. This simplicity
Below the mount point of the directory are               in the database can help maintain high
4096 directories with names aaa to 999                   throughput. There are only six tables in the
using the hexadecimal notation. This                     database as shown below in figure 1.
structure is used to prevent potential
problems with having a very large number

USENIX Association                           26th Large Installation System Administration Conference (LISA ’12) 3
HSS: A simple file storage system for web applications Abstract
Figure 1

MySQL can be run in a partitioned or                               implemented on Linux servers running
sharded fashion as the database grows to                           Apache Tomcat for the storage nodes and
address potential performance concerns due                         MySQL or VoltDB for the metadata system.
to database growth. Replicas of the database                       These components were selected for their
are also maintained to create resilience to                        widespread      adoption,     functionality,
failures in the metadata system. A number of                       transparency, and the deep knowledge about
additional technologies exist from various                         these systems prevalent in the technology
sources to improve MySQL performance                               community.
and      availability   while    maintaining
compatibility with the MySQL system. We                            2 .7 System scalability
have implemented several of those to                               The system has been designed to allow the
improve the overall system.                                        addition or removal of capacity to increase
        We have also developed a version of                        both storage available and performance of
the metadata system based on the VoltDB[7]                         the system. All of the changes to the system
database. VoltDB provides clustering for                           can be accomplished while the system
availability, an in-memory database engine                         remains online and completely available.
for performance, and snapshots to durable                                  When additional storage nodes are
storage for recovery and durability. The                           added to the system, the new storage nodes
schema is unchanged when using VoltDB                              are immediately available to take load from
and the metadata engine is a configuration                         requesting applications. Additional write
option for the storage nodes within the                            throughput from the new nodes is available
system.                                                            immediately and read throughput of the new
                                                                   nodes comes online as files are stored on the
2 .6 Off the shelf open source components                          new system through new writes or
The system is built from readily available                         redistribution of data. Redistribution of
open source software systems. The system is                        stored data in the entire system or a subset

4 26th Large Installation System Administration Conference (LISA ’12)                           USENIX Association
of the system is accomplished in the                     IDs are provided by the system in most
background to ensure that capacity                       cases but there is also a capability to allow
utilization is roughly equal on all storage              applications to define their own object IDs.
nodes in the system. It is anticipated that              The system does not allow the updating of
many files in the system are not commonly                files. Examples of the system operations are
accessed and therefore those files can be                discussed in section 3.
moved without impact to the applications                 The external operations for the system are
that own them.                                           listed in Table
         Nodes can be removed from the
system to scale down or migrate data. The                 Operation                    Example
system can be configured to disable writes               Read             GET
to nodes that are being removed and then                                  /hss/storage/appname/fileID
files are copied in an even distribution to all          Write            POST /hss/storage/appname
other remaining storage nodes. Once                      Delete           DELETE
additional copies of the files have been                                  /hss/storage/appname/fileId
made on the rest of the system, the files on                                  Table 1
the node being removed are deleted. This
process continues until the node being                            All of the external operations are
removed has no files stored.                             atomic and return only success or failure.
         Metadata scalability tends to be more           The client application is not expected to
problematic but can still be accomplished                have to perform any operation other than
within the constraints of the RDBMS being                perhaps a retry if appropriate. While the
used for the metadata. In general, it is                 capabilities of the system may appear
difficult to scale down the metadata engine.             restrictive, we have found that adoption has
It is the expectation that the system will only          been quite good. There are a number of
increase metadata capacity and if the system             internal operations that are used to manage
is scaled down then the overhead in the                  the system behaviors.
metadata system will have to be tolerated.               The internal operations listed in Table 2 are
                                                         only available to the storage nodes
2 .8 Simple interface                                    themselves. Administrative operations are
The system is accessed though a very small               an additional class of internal operations that
number of web calls that, while somewhat                 are used for system configuration and
restrictive, cover a large number of the use             batched file operations. The administrative
cases for web application storage systems.               operations are listed in table 3.
The system can write a file, read a file,
delete a file, and replace a file. File object

   Operation                     Example                                     Description
Copy                 POST /hss/storage/internal_copy          Store replica on another storage node
Move                 GET /hss/storage/internal_move           Move a file from one storage node to
                                                              another
Migrate              GET /hss/storage/internal_migrate        Migrate a file from one server to
                                                              another during healing process. The file
                                                              is not removed from the old storage
                                                              node
                                               Table 2

USENIX Association                          26th Large Installation System Administration Conference (LISA ’12) 5
Operation                                           Description
/hss/admin/addServer                             add a server to a site configuration
/hss/admin/addDrive                              add a drive to a server configuration
/hss/admin/changeServerStatus                    set the server status to one of read-write, read-
                                                 only, maintenance, or offline
/hss/admin/changeDriveStatus                     set a drive status to read-write, read-only, or
                                                 offline
/hss/admin/addApplication                        add an application name to the configuration
                                                 for files to be stored
/hss/admin/updateApplication                     change application specific parameters about
                                                 the way files are managed
/hss/admin/addJob                                add a job to copy files to or from a device
                                             Table 3
                                                     of replica copies for workload where data
2 .9 Network load balancer                           may not need to be up to date.
The storage nodes in the system are                  2 .10 ACLs via IP address
accessed through a network load balancer to          The system has very little in the way of
provide even distribution of incoming                access controls. The system is designed to
requests. Access to the metadata system is           provide service to applications and the
also provided by a network load balancer.            applications are expected to provide fine
This provides several benefits in addition to        grained access controls and file security
providing uniform request rates to all of the        required. There are some basic controls
system components.                                   available to restrict the system and assist
        The load balancer implementation             applications with access control. The ACLs
can remove unresponsive system nodes in a            are provided not to prevent access to
very short amount of time. The failure               specific data or a subset of the data but to
interval is also configurable providing a            protect the system from bad external
flexible tolerance for component failures.           behaviors.
Unavailable system components do not                 Read and write ACLs are defined for the
cause requests to slow down or fail in the           system as a whole and on a per application
system except locally to a storage node for a        basis. ACLs are specified by IP address. The
very few requests that are in-flight during          system will reject write requests that do not
the actual failure.                                  come from systems that are explicitly
        Load balancers insulate client               defined within the ACL. The system will
applications from the particulars of the             deny read request for systems that are
system and allow changes to be made                  explicitly defined within the ACL. The write
without having to update any external part           ACL behavior is to only allow authorized
of the system. The same is true of access to         systems to store files within the system. It is
the metadata engine. The metadata is                 worth noting that only objects with file
accessed through load balancers that provide         prefixes can be overwritten, so in the
access to the metadata system for reads,             common use case for the system a file
writes, or administrative background jobs            cannot be replaced and keep the same object
independently and in some cases making use           ID. It is also possible for any allowed
                                                     system to write files with any valid

6 26th Large Installation System Administration Conference (LISA ’12)              USENIX Association
application name. This can potentially                   2 .13 File expiration
become an accounting issue within the                    The system has the ability to automatically
system but it doesn’t place data at risk for             delete files after a tunable interval has
overwrite or corruption in general. The read             expired. The interval can be defined per
ACL behavior is to restrict reads from any               application in the system. This enables the
clients that are found to be using the system            temporary storage of files for applications
improperly. The system is susceptible to                 that make use of transient data. Deletion of
bypassing application security to request an             files based on expiration offloads the need to
object if the object ID can be discovered.               keep track of file creation and lifetime from
2 .11 Files served by proxy or redirect                  all applications that need this feature and
based on size                                            implements it in the storage system
Any node may receive a request for any file              consistently for all applications.
in the system. Not all files will reside locally         2 .14 Application metadata stored with the
on every system node. The system serves                  file
file reads based on GET requests and those               The system provides a simple mechanism to
are treated like standard HTTP GET                       store any additional metadata about a file
requests. If the file is not stored locally, a           along with the file record in the metadata
node will either retrieve the file from the              system. This data is opaque to the storage
node where it is stored and return the file              system and simply exists as an assistive
itself or send a HTTP REDIRECT to the                    feature for applications. An application may
requester identifying a node that stores a               store some metadata, perhaps a description,
copy of the file. The system implements a                with the file and that metadata will be
tunable size for the server redirect case.               returned when the file is retrieved. The
Files below the size threshold will be                   expectation is that this data is used within
returned from the initial requesting node.               the storing and requesting application to
Files above the size threshold will be                   provide some additional functionality. At
redirected. This behavior allows the system              some point it is expected that the system
to be tuned based on load and typical file               would be enhanced to make use of the
size within and application.                             application metadata stored to make the
2 .12 Commonly served files cached                       system more responsive and useful to
Typically we have seen that a small number               application consumers.
of popular files are commonly requested. In              2 .15 Applications are system consumers
order to avoid constant proxy or redirect                with local configurations
requests a cache of commonly accessed files              The system assumes that files will be stored
is implemented on the nodes. A tunable                   and retrieved by applications. Applications
number of files can be cached on every                   that make use of the system are expected to
node. The additional space consumed by the               implement any necessary controls on files
cache is more than offset by the read                    they store. The nature of the system keeps
performance trade off of avoiding additional             file storage as simple as possible and moves
communication about file locations. The                  responsibilities for much of traditional
cached files also provide some insight into              content management and fine access control
what files are popular and may be                        to external entities that make up the
candidates for placement in an external                  applications.
CDN.                                                     2 .16 File accounting
                                                         While the system does have the concept of
                                                         application to file relationships for multi-

USENIX Association                           26th Large Installation System Administration Conference (LISA ’12) 7
tenancy concerns, it has no hierarchical                         2 .17 Monitoring integrated with existing
namespace, so files stored are kept track of                     systems
external to the storage system by the owning                     There are several external mechanisms that
applications. This has the affect of allowing                    are responsible for monitoring the system
any file to be stored anywhere in the system                     and its components. Based on events that
for improved scalability and availability                        occur in the system a number of external
without concern for relationships between                        systems can be used to manage and adjust
files. Multiple applications can store unique                    the system using the administrative
copies of the same file. While that                              operations. External systems are used to
duplication of content is space-inefficient, it                  monitor throughput, software errors, and
prevents dependencies between applications                       hardware errors. The external systems used
within the storage system. The space                             are widely used industry standard systems.
overhead is preferred to potential conflicts                             Nagios[8], for example, is used to
between applications. It may be possible in a                    monitor the hardware components of the
future update to the system to provide object                    system. In the event of a disk failure, a
level de-duplication to eliminate excessive                      Nagios monitor detects the occurrence.
copies of the same file.                                         When that happens, an alert is sent to notify
        The relationship between application                     the system administrators, and the drive
owners and files allows the system to                            status is changed from read-write to offline
provide utilization reporting for capacity                       automatically. This prevents any new
planning and usage metering. The system                          requests for writes and any new read
provides sufficient detail about files to                        requests from using the failed disk drive and
report on capacity consumed per application,                     causes any at risk data to be re-protected by
how much data is transferred per                                 background processes.
application, file specific errors, and several                   3 . Implementation
other usage based metrics. These metrics are                     A high level diagram of the system and how
used to predict when the system will require                     the components interact is shown in Figure
expansion or when access to files has                            2. A description of the external operations is
become unbalanced for any reason. The                            provided to get a better feel for how the
metrics also enable the examination of usage                     system functions from an application
trends within applications so that customers                     perspective. IP address access control
can be notified of observed changes to                           checks are included in all operations.
confirm that those changes are as expected.

8 26th Large Installation System Administration Conference (LISA ’12)                          USENIX Association
Figure 2
3 .1 Write a file                                            configurable per application. Based on the
When a POST request is sent to the system                    application configuration any additional
to write a file, the storage node that receives              replicas of the file are created at this time
the request calculates the object ID and                     using the POST /hss/storage/internal_copy
picks a local location to write the file. The                method from storage node that handled the
storage node then inserts into the database                  original request. Once the minimum number
the server, disk, and object ID of the file as               of replicas is written a file is considered to
well as some data about the file such as                     be safe.
mime type and creation time. It then picks                            In the event of a failure before the
another node within the same rack if                         first file write completes, the system retries
possible based on server availability and                    the write and the process restarts. If the
space free to write a second replica of the                  retries are unsuccessful after a limit is
file. When the replica creation is successful                reached, the system returns a failure and the
the database is updated with the details that                application is expected to retry the write. If
pertain to the replica. The write is then                    any additional writes of file replicas fail, the
acknowledged as successful to the client                     system will retry the writes until the
once the minimum number of copies                            minimum replicas requirement is achieved
required is created. The default minimum                     and then any additional replicas will be
number of copies is two but it is                            created in the background.

USENIX Association                          26th Large Installation System Administration Conference (LISA ’12) 9
If background replica creations fail,                   database replica is used to service some
either by making too few or too many                             database lookups for read requests, a
replicas, then the expectation is that the                       number of retries are attempted and then the
system background file check process will                        lookup can be attempted on the master copy
detect the improper number of replicas and                       of the database if the application is
take the appropriate action by making any                        configured to do so. The reasoning behind
additional replicas required or removing any                     this second set of lookups is that replication
additional replicas beyond what is required                      lag between the master and the replica
by the application configuration.                                database may account for the lookup failure.
3 .2 Read a file                                                 If that is the case, the lookup in the master
A file read GET request specifies the                            database should succeed if the replica
application and objectID as part of the path                     database lookup fails. In general, the
sent to the system. The receiving storage                        replication lag is less than 10 seconds within
node looks up the object ID and finds the                        the current system, but some applications
location of all replicas. A replica is selected                  exhibit a usage pattern that causes an
from the list of locations by preferring the                     immediate read after a write that makes the
replicas that are local if there are any, then                   lookup on the primary necessary. The
within the same rack and then the same site.                     metadata replication we use is asynchronous
If the file is local it is simply served. If the                 so it is not possible to require metadata
file is not local, it is retrieved and served by                 replication to complete before a write is
the requesting storage node if it is smaller in                  acknowledged to the client. Alternative
size than the threshold specified for redirect                   metadata storage systems that have
by the application. If the file is larger than                   synchronous replication would also solve
the redirect size, a redirect is served to the                   the problem of the replica metadata database
client that then requests the file for a server                  being slightly behind the primary metadata
that has the file locally. In the case of a                      database.
storage node serving a file it does not own,                              Occasionally a replica copy that is
the file is placed in a local cache on the                       selected by the system will fail to be
serving storage node with the expectation                        available either because the file is not found
that commonly accessed files will be                             on the owning server or the file will fail the
requested continuously from nodes that do                        checksum when it is retrieved. If either of
not have a local copy due to the load                            these cases occurs, the file will be served
balanced access to the system. This cache                        from one of the other replicas in the list
will avoid future lookups and file retrievals                    returned by the lookup. The internal GET
from other nodes for heavily accessed files.                     /hss/storage/internal_move method is then
It also has the effect of making many copies                     used to replace the missing or bad file
of the file throughout the system to enable                      replica.
better concurrency when the file is                              3 .3 Delete a file
requested.                                                       The deletion of a file is the simplest of the
         In the case of a lookup failure from                    external cases. The request to delete marks
the database, an http 404 error is served to                     the files for deletion in the database and then
the requesting process. In order to avoid                        deletes the replicas of the file and the
transient lookup failures a number of retry                      records from the database. Any failures
options are available. In the simple case a                      during the delete process are expected to be
number of retries is attempted before                            handled by the background checking
returning the 404 error. In the case where a                     processes of the system. If files are deleted

10 26th Large Installation System Administration Conference (LISA ’12)                         USENIX Association
but the database entries remain for some               Sites – The number of sites that file replicas
reason, the records marked for deletion are            must be stored in.
removed by the background processes of the             Use Cache – Enables caching of often
system. It is possible for the delete process          requested files on nodes that do not
to orphan files on storage nodes that are not          normally have local copies.
referenced by the database. The process to             Expire Files – Enables or disables automatic
recover from this occurrence requires                  deletion of files.
looking up all files found on the storage              Use Master on 404 – Enables or disables
nodes and comparing them to existing                   queries for files that fail on read replicas of
database entries. If the entries do not exist,         the metadata database to be retried on the
the files can be safely deleted as there is no         master metadata database before returning a
reference to them and they cannot be served.           failure to the application.
3 .4 Recovery of errors                                Redir Size (MB) – The threshold size for a
A number of errors both well-understood                file when the request will be redirected
and unknown are expected to occur within               instead of proxied by the storage node
the system. These errors may include things            receiving the request.
like file system or storage node failures              File Prefixes – ObjectID prefixes that are
destroying data and causing the number of              specified by an application. An application
replicas to fall below the desired number,             may specify the ObjectID of a file and the
database lookup failures, storage nodes                system requires a consistent prefix for all of
unavailable for extended periods due to                the files in that case. The prefix is usually
hardware maintenance, network failures, and            the application name. This is useful for
many others. The system is designed to be              applications where the filename can be
as resilient as possible to errors. The system         mapped directly to the file content such as a
has implemented recovery processes for the             geographic location used in maps.
errors that are known to occur and updates             4 . Results and observation
are made to the system as new errors are               The system as described above has been in
detected and the root causes are discovered            production since August of 2010. In that
to prevent new errors from recurring.                  time it has served billions of files for dozens
3 .5 Configuration                                     of applications. The system has good
Application specific settings are managed              scalability and has serviced over 210 million
from the administrative interface for the              requests in a 24 hour period. As of this time
system. Application settings control the way           the system manages just over 380 million
files are stored in a number of ways. The              files. The file count includes all replicas. An
specific parameters that can be set for an             analysis of the usage data from the system
application are:                                       reveals that about one third of the current
Expire Time – number of seconds a file will            files in the system are active. The feature set
be kept before automatic deletion.                     is in-line with the intended use case. The
Automatic deletion requires Expire Files to            feature set of the system also compares well
be set true.                                           with ongoing work on open storage systems
Max Upload Size (MB) – The size limit of a             designed for the same use case. Table 4
file that may be written. There is no built in         shows a feature comparison between the
limit by default.                                      Openstack object storage system[9] (Swift)
Copies – The number of replicas of a file              current and future features and the current
required.                                              features of the system presented here.

USENIX Association                        26th Large Installation System Administration Conference (LISA ’12) 11
Figure 3

                                                                 In some cases days with fewer requests
                                                                 transfer more data than days with higher
                                                                 numbers of requests. This shows that not
                                                                 only is the request rate variable but file size
                       Table 4                                   is also quite variable.

The load on the system is highly variable
and dependent on external applications but
the behaviors of the system are highly
consistent as request load varies. Figure 3
shows the load over a thirty day period.
Figure 4 shows the service time for all
requests.

                                                                                    Figure 4

12 26th Large Installation System Administration Conference (LISA ’12)                         USENIX Association
The overall observed response time of the               in Linux file systems are difficult to access
system is quite acceptable with less than .5%           efficiently. Significant thought was given to
of the requests taking more than 200ms and              creating a file system directory layout on the
less than 2% of requests taking more than               storage servers that would limit the impact
100ms. This data includes all reads, writes             of this problem but it still exists when very
and errors and we believe that the                      high numbers of small files are present even
abnormally long response times are in                   if they are efficiently stored in the file
actuality failures. The number of these                 system. Additionally, small file transfers
abnormal events is very small and also quite            when files are replicated between storage
acceptable.                                             servers are limited in the amount of
         Using only features of the system              bandwidth that can be utilized.
several infrastructures have been built and                      Another issue that occurs with the
components have also been added and                     data protection availability aspects of the
removed in existing infrastructures while the           system is the I/O behaviors of the Tomcat
system has been under load and in                       servers when accessing a failed device of
production. The addition or removal of                  file system during the initial failure. There
hardware components is simple and well-                 are a number of abnormal behaviors that
understood. The ability to update the system            prevent the system from returning the
software components is more complicated                 correct file to an application. Load spikes,
but can also be done without taking the                 crashes, and other oddities can occur when
system offline. The existing systems have               retrieving data. One response to the issue of
seen software updates, metadata database                failed physical devices can be to add local
changes, and metadata database migrations               RAID data protection on a storage node to
without interruption in service.                        remove the possibility of a single device
5 . Lessons learned                                     failure making a file system unavailable.
Many details of the system are implemented              This has the effect of increasing cost and
as conceived in the original design. There              complexity which are both to be avoided.
are still quite a few behaviors and functions           5 .2 Metadata replication
of the system that can only be discovered               Metadata replication is a requirement for the
though the use and maintenance of the                   system in disaster scenarios as well as
system. The need to modify the way the                  maintenance scenarios. It is temping to
system functions becomes apparent through               make use of the replicated metadata
its use by applications and administrators.             subsystem to assume some of the workload
5 .1 Data protection                                    of the system. The benefits of this approach
There are several aspects of data protection            are that the replicated metadata subsystem is
that include both durability of the data and            exercised to validate its proper functionality
the availability of the data. Data replication          and that some load can be reduced on the
within the system protects availability by              primary metadata subsystem.
ensuring that a file in the system is stored on                  It is attractive to attempt to increase
multiple redundant components and it                    the total read capacity of the system by
protects durability through that redundancy.            performing any reads from both the primary
There are some areas that require additional            and the replicated metadata. The system is
focus with file replication in practice.                built with this capability in mind but the
         The ability to rapidly recover large           initial implementation suffered from the lag
numbers of small files is limited due to                between the two systems. Applications that
several issues. Large numbers of small files            write a file and then immediately attempt a

USENIX Association                         26th Large Installation System Administration Conference (LISA ’12) 13
read of that data initially received data                        will be protected. The system also provides
access errors if the read was requested from                     a consistent interface so that application
the replicated metadata prior to the                             developers can avoid having to relearn how
completion of replication. A retry case was                      to store files for each new application that is
added to always check the primary before                         developed.
returning a failure to the requesting client.                            From an operational point of view
         It is preferable to run administrative                  the system requires very little immediate
and maintenance jobs on the data from the                        maintenance due to its significant
replicated metadata subsystem. This                              automation and redundancy. Repairs can be
prevents internal housekeeping from                              effected to subcomponents of the system in
impacting the system. Many of the                                a delayed fashion without impacting the
administrative jobs are long running and can                     applications      that   use    the     system.
be strenuous and having a secondary copy of                      Modifications to the system can be made
the data isolates these jobs very effectively.                   while maintaining availability and durability
Ad hoc jobs are also run against the                             of the data. These aspects make the system
replicated metadata subsystem.                                   very lightweight for operators to manage.
5 .3 Configuration tuning                                                The work done to develop this
The default configuration for the software                       system and its usage in production
components of the system are acceptable for                      applications for an extended period of time
integrating and bringing up the system.                          show that it is possible to build a system that
These configuration variables require                            is simple, reliable, scalable and extensible.
adjusting to better manage the load and                          The use of common industry standard
improve the overall experience with the                          hardware and software components makes
system. There are many limits within the                         the system economical to implement and
system based on the initial configuration that                   easy to understand. There are certainly use
can safely be modified to increase the                           cases where the system is not a good fit and
capabilities of the system.                                      should not be used but it is an excellent
         Variables        control      memory                    choice for the use around which it was
utilization,      thread    limits,    network                   designed.
throughput, and dozens of other functions                                There is room for improvement in
within the system. These functions can be                        the feature set and efficiency of the system.
improved to the overall benefit of the system                    With that in mind there are some
by examining the system and observing the                        improvements that definitely should be
limits. Many of the limits are imposed by                        investigated for integration into future
the configuration and not by any physical or                     versions of the system. A list of future
functional aspect of the system. Making                          planned improvements includes container
informed changes to the configuration yields                     files to address file management and
significant benefits.                                            performance concerns, lazy deletes to
6 . Conclusion                                                   disconnect housekeeping operations from
To store and manage the large numbers of                         online operations, improved geographic
objects associated with web applications, the                    awareness to improve access latency, and a
system is an extremely good fit. It offloads                     policy engine to manage file placement and
the need for application developers to                           priority          in        the         system.
manage where files are stored and how they

14 26th Large Installation System Administration Conference (LISA ’12)                         USENIX Association
References
[1] Network File system
    http://tools.ietf.org/html/rfc1094
    http://tools.ietf.org/html/rfc1813
    http://tools.ietf.org/html/rfc3530
[2] IBRIX Fusion
   http://en.wikipedia.org/wiki/IBRIX_Fusion
[3] Lustre
    http://wiki.lustre.org/
[4] Orange File System Project
    http://www.orangefs.org/
[5] PVFS
   http://www.pvfs.org/
[6] Panzer-Steindel, Bernd Data Integrity April 2007 CERN/IT
[7] VoltDB
   http://community.voltdb.com/
[8] Nagios
    http://www.nagios.org/
[9] Openstack Object Storage (Swift)
    http://openstack.org/projects/storage/

USENIX Association                    26th Large Installation System Administration Conference (LISA ’12) 15
You can also read