Usher: An Extensible Framework for Managing Clusters of Virtual Machines

Page created by Lori Shelton
 
CONTINUE READING
Usher: An Extensible Framework for
            Managing Clusters of Virtual Machines
               Marvin McNett, Diwaker Gupta, Amin Vahdat, and Geoffrey M. Voelker
                              – University of California, San Diego

                                                   ABSTRACT
               Usher is a virtual machine management system designed to impose few constraints upon the
         computing environment under its management. Usher enables administrators to choose how their
         virtual machine environment will be configured and the policies under which they will be managed.
         The modular design of Usher allows for alternate implementations for authentication, authorization,
         infrastructure handling, logging, and virtual machine scheduling. The design philosophy of Usher is
         to provide an interface whereby users and administrators can request virtual machine operations
         while delegating administrative tasks for these operations to modular plugins. Usher’s implementa-
         tion allows for arbitrary action to be taken for nearly any event in the system. Since July 2006, Usher
         has been used to manage virtual clusters at two locations under very different settings, demonstrating
         the flexibility of Usher to meet different virtual machine management requirements.

                     Introduction                               requirements, or service infrastructures. To facilitate
                                                                its use in a wide range of environments, Usher com-
      Usher is a cluster management system designed
                                                                bines a core set of interfaces that implement basic
to substantially reduce the administrative burden of
                                                                mechanisms, clients for using these mechanisms, and
managing cluster resources while simultaneously im-
                                                                a framework for expressing and customizing adminis-
proving the ability of users to request, control, and
                                                                trative policies in extensible modules, or plugins.
customize their resources and computing environment.
System administrators of cluster computing environ-                    The Usher core implements basic virtual cluster
ments face a number of imposing challenges. Different           and machine management mechanisms, such as creat-
users within a cluster can have a wide range of com-            ing, destroying, and migrating VMs. Usher clients use
puting demands, spanning general best-effort comput-            this core to manipulate virtual clusters. These clients
ing needs, batch scheduling systems, and complete               serve as interfaces to the system for users as well as
control of dedicated resources. These resource de-              for use by higher-level cluster software. For example,
mands vary substantially over time in response to               an Usher client called ush provides an interactive
changes in workload, user base, and failures. Further-          command shell for users to interact with the system.
more, users often need to customize their operating             We have also implemented an adapter for a high-level
system and application environments, substantially in-          execution management system [6], which operates as
creasing configuration and maintenance tasks. Finally,          an Usher client, that creates and manipulates virtual
clusters rarely operate in isolated administrative envi-        clusters on its own behalf.
ronments, and must be integrated into existing authen-                 Usher supports customizable modules for two
tication, storage, network, and host address and name           important purposes. First, these modules enable Usher
service infrastructure.                                         to interact with broader site infrastructure, such as au-
      Usher balances these imposing requirements us-            thentication, storage, and host address and naming ser-
ing a combination of abstraction and architecture.              vices. Usher implements default behavior for common
Usher provides a simple abstraction of a logical clus-          situations, e.g., newly created VMs in Usher can use a
ter of virtual machines, or virtual cluster. Usher users        site’s DHCP service to obtain addresses and domain
can create any number of virtual clusters (VCs) of ar-          names. Additionally, sites can customize Usher to im-
bitrary size, while Usher multiplexes individual virtual        plement more specialized policies; at UCSD, an Usher
machines (VMs) on available physical machine hard-              VM identity module allocates IP address ranges to
ware. By decoupling logical machine resources from              VMs within the same virtual cluster.
physical machines, users can create and use machines                   Second, pluggable modules enable system admin-
according to their needs rather than according to as-           istrators to express site-specific policies for the place-
signed physical resources.                                      ment, scheduling, and use of VMs. As a result, Usher al-
      Architecturally, Usher is designed to impose few          lows administrators to decide how to configure their vir-
constraints upon the computing environment under its            tual machine environments and determine the appropri-
management. No two sites have identical hardware                ate management policies. For instance, to support a gen-
and software configurations, user and application               eral-purpose computing environment, administrators can

21st Large Installation System Administration Conference (LISA ’07)                                                   167
Usher: An Extensible Framework for Managing Clusters . . .                                               McNett, et al.

install an available Usher scheduling and placement           among services while satisfying the constraints of the
plugin that performs round-robin placement of VMs             SLA requirements.
across physical machines and simple rebalancing in re-              For example, Océano provides a computing utility
sponse to the addition or removal of virtual and physical     for e-commerce [9]. Services formally state their work-
machines. With this plugin, users can dynamically add         load performance requirements (e.g., response time),
or remove VMs from VCs at any time without having to          and Océano dynamically allocates physical servers in
specify service level agreements (SLAs) [9, 17, 22],          response to changing workload conditions to satisfy
write configuration files [10], or obtain leases on re-       such requirements. Rocks and Rolls provide scalable
sources [12, 14]. With live migration of VMs, Usher can       and customizable configuration for computational grids
dynamically and transparently adjust the mapping of           [11, 18], and Cluster-on-Demand (COD) performs re-
virtual to physical machines to adapt to changes in load      source allocation for computing utilities and computa-
among active VMs or the working set of active VMs,            tional grid services [12]. COD implements a virtual
exploit affinities among VMs (e.g., to enhance physical       cluster abstraction, where a virtual cluster is a disjoint
page sharing [20]), or add and remove hardware with           set of physical servers specifically configured to the re-
little or no interruption.                                    quirements of a particular service, such as a local site
       Usher enables other powerful policies to be ex-        component of a larger wide-area computational grid.
pressed, such as power management (reduce the num-            Services specify and request resources to a site manager
ber of active physical machines hosting virtual clus-         and COD leases those resources to them. Finally, Emu-
ters), distribution (constrain virtual machines within a      lab provides a shared network testbed in which users
virtual cluster to run on separate nodes), and resource       specify experiments [21]. An experiment specifies net-
guarantees. Another installation of Usher uses its clus-      work topologies and characteristics as well as node
ter to support scientific batch jobs running within vir-      software configurations, and Emulab dedicates, iso-
tual clusters, guarantees resources to those jobs when        lates, and configures testbed resources for the duration
they run, and implements a load-balancing policy that         of the experiment.
migrates VMs in response to load spikes [13].                       The recent rise in virtual machine monitor (VMM)
       Usher is a fully functional system. It has been in-    popularity has naturally led to systems for configuring
stalled in cluster computing environments at UCSD             and managing virtual machines. For computational
and the Russian Research Center in Kurchatov, Rus-            grid systems, for example, Shirako extends Cluster-
sia. At UCSD, Usher has been in production use since          on-Demand by incorporating virtual machines to fur-
January 2007. It has managed up to 142 virtual ma-            ther improve system resource multiplexing while sat-
chines in 26 virtual clusters across 25 physical ma-          isfying explicit service requirements [14], and VIO-
chines. The Usher implementation is sufficiently reli-        LIN supports both intra-and inter-domain migration to
able that we are now migrating the remainder of our           satisfy specified resource utilization limits [17]. Sand-
user base from dedicated physical machines to virtual         piper develops policies for detecting and reacting to
clusters, and Usher will soon manage all 130 physical         hotspots in virtual cluster systems while satisfying ap-
nodes in our cluster. In the rest of this paper we de-        plication SLAs [22], including determining when and
scribe the design and implementation of Usher, as well        where to migrate virtual machines, although again un-
as our experiences using it.                                  der the constraints of meeting the stringent SLA re-
                                                              quirements of a data center.
                     Related Work
                                                                    On the other hand, Usher provides a framework
      Since the emergence of widespread cluster com-          that allows system administrators to express site-spe-
puting over a decade ago [8, 16], many cluster config-        cific policies depending upon their needs and goals.
uration and management systems have been developed            By default, the Usher core provides, in essence, a gen-
to achieve a range of goals. These goals naturally in-        eral-purpose, best-effort computing environment. It
fluence individual approaches to cluster management.          imposes no restrictions on the number and kind of vir-
Early configuration and management systems, such as           tual clusters and machines, and performs simple load
Galaxy [19], focus on expressive and scalable mecha-          balancing across physical machines. We believe this
nisms for defining clusters for specific types of ser-        usage model is important because it is widely applica-
vice, and physically partition cluster nodes among            ble and natural to use. Requiring users to explicitly
those types.                                                  specify their resource requirements for their needs, for
      More recent systems target specific domains,            example, can be awkward and challenging since users
such as Internet services, computational grids, and ex-       often do not know when or for how long they will
perimental testbeds, that have strict workload or re-         need resources. Further, allocating and reserving re-
source allocation requirements. These systems support         sources can limit resource utilization; guaranteed re-
services that express explicit resource requirements,         sources that go idle cannot be used for other purposes.
typically in some form of service level agreement             However, sites can specify more elaborate policies in
(SLA). Services provide their requirements as input to        Usher for controlling the placement, scheduling, and
the system, and the system allocates its resources            migration of VMs if desired. Such policies can range

168                                          21st Large Installation System Administration Conference (LISA ’07)
McNett, et al.                                         Usher: An Extensible Framework for Managing Clusters . . .

from batch schedulers to allocation of dedicated physi-        we designed Usher as a flexible platform for con-
cal resources.                                                 structing virtual machine management installations
      In terms of configuration, Usher shares many of          customized to the needs of a particular site.
the motivations that inspired the Manage Large Net-                  To accomplish this goal, we had two design objec-
works (MLN) tool [10]. The goal of MLN is to enable            tives for Usher. First, Usher maintains a clean separation
administrators and users to take advantage of virtual-         between policy and mechanism. The Usher core pro-
ization while easing administrator burden. Administra-         vides a minimal set of mechanisms essential for virtual
tors can use MLN to configure and manage virtual               machine management. For instance, the Usher core has
machines and clusters (distributed projects), and it           mechanisms for placing and migrating virtual machines,
supports multiple virtualization platforms (Xen and            while administrators can install site-specific policy mod-
User-Mode Linux). MLN, however, requires adminis-              ules that govern where and when VMs are placed.
trators to express a number of static configuration de-              Second, Usher is designed for extensibility. The
cisions through configuration files (e.g., physical host       Usher core provides three ways to extend functionali-
binding, number of virtual hosts), and supports only           ty, as illustrated in Figure 1. First, Usher provides a set
coarse granularity dynamic reallocation (manually by           of hooks to integrate with existing infrastructure. For
the administrator). Usher configuration is interactive         instance, while Usher provides a reference implemen-
and dynamic, enables users to create and manage their          tation for use with the Xen VMM, it is straightforward
virtual clusters without administrative intervention,          to write stubs for other virtualization platforms. Sec-
and enables a site to globally manage all VMs accord-          ond, developers can use a Plugin API to enhance Ush-
ing to cluster-wide policies.                                  er functionality. For example, plugins can provide
      XenEnterprise [5] from XenSource and Virtual-            database functionality for persistently storing system
Center [4] from VMware are commercial products for             state using a file-backed database, or provide authenti-
managing virtual machines on cluster hardware from             cation backed by local UNIX passwords. Third, Usher
the respective companies. XenEnterprise provides a             provides a Client API for integrating with user inter-
graphical administration console, Virtual Data Center,         faces and third-party tools, such as the Usher com-
for creating, managing, and monitoring Xen virtual ma-         mand-line shell and the Plush execution management
chines. VirtualCenter monitors and manages VMware              system (discussed in the Applications subsection of
virtual machines on a cluster as a data center, support-       the Implementation section).
ing VM restart when nodes fail and dynamic load bal-
ancing through live VM migration. Both list interfaces
for external control, although it is not clear whether ad-
ministrators can implement arbitrary plugins and poli-
cies for integrating the systems into existing infrastruc-
ture, or controlling VMs in response to arbitrary events
in the system. In this regard, VMWare’s Infrastructure
Management SDK provides functionality similar to that
provided by the Usher client API. However, this SDK
does not provide the tight integration with VMWare’s                         Figure 1: Usher interfaces.
centralized management system that plugins do for the          Usher Overview
Usher system. Also, of course, these are all tied to
managing a single VM product, whereas Usher is de-                   A running Usher system consists of three main
signed to interface with any virtualization platform           components: local node managers (LNMs), a central-
that exports a standard administrative interface.              ized controller, and clients. A client consists of an ap-
                                                               plication that utilizes the Usher client library to send
                 System Architecture                           virtual machine management requests to the controller.
                                                               We have written a few applications that import the
      This section describes the architecture of Usher.        Usher client library for managing virtual machines (a
We start by briefly summarizing the goals guiding our          shell and XML-RPC server) with more under develop-
design, and then present a high-level overview of the          ment (a web frontend and command line suite).
system. We then describe the purpose and operation of
each of the various system components, and how they                  Figure 2 depicts the core components of an Ush-
                                                               er installation. One LNM runs on each physical node
interact with each other to accomplish their tasks. We
                                                               and interacts directly with the VMM to perform man-
end with a discussion of how the Usher system accom-
                                                               agement operations such as creating, deleting, and mi-
modates software and hardware failures.
                                                               grating VMs on behalf of the controller. The local
Design Goals                                                   node managers also collect resource usage data from
      As mentioned, no two sites have identical hard-          the VMMs and monitor local events. LNMs report re-
ware and software configurations, user and application         source usage updates and events back to the controller
requirements, or service infrastructures. As a result,         for use by plugins and clients.

21st Large Installation System Administration Conference (LISA ’07)                                                  169
Usher: An Extensible Framework for Managing Clusters . . .                                               McNett, et al.

       The controller is the central component of the          operation to complete. We emphasize that this architec-
Usher system. It receives authenticated requests from          ture abstracts VMM-specific implementations – the
clients and issues authorized commands to the LNMs.            controller is oblivious to the specific VMMs running
It also communicates with the LNMs to collect usage            on the physical nodes as long as the LNM provides the
data and manage virtual machines running on each               remote API implementation. As a result, although our
physical node. The controller provides event notifica-         implementation currently uses the Xen VMM, Usher
tion to clients and plugins registered to receive notifi-      can target other virtualization platforms. Further, Usher
cation for a particular event (e.g., a VM has started,         is capable of managing VMs running any operating
been destroyed, or changed state). Plugin modules can          system supported by the VMMs under its management.
perform a wide range of tasks, such as maintaining                   As the Usher system runs, VM and VMM re-
persistent system-wide state information, performing           source usage fluctuates considerably. The local node
DDNS updates, or doing external environment prepa-             manager on each node monitors these fluctuations and
ration and cleanup.                                            reports them back to the controller. It reports resource
       The client library provides an API for applications     usage of CPU utilization, network receive and trans-
to communicate with the Usher controller. Essentially,         mit loads, disk I/O activity, and memory usage in 1, 5,
clients submit requests to the controller when they need       and 15-minute averages.
to manipulate their VMs or request additional VMs.                   In addition to changes in resource usage, VM
The controller can grant or deny these requests as its
                                                               state changes sometimes occur unexpectedly. VMs can
operational policy dictates. One purpose of clients is to
                                                               crash or even unexpectedly appear or disappear from
serve as the user interface to the system, and users use
                                                               the system. Detecting these and other related events
clients to manage their VMs and monitor system state.
                                                               requires both careful monitoring by the local node
More generally, arbitrary applications can use the client
                                                               managers as well as VMM support for internal event
library to register callbacks for events of interest in the
                                                               notification. Administrators can set a tunable parame-
Usher system.
                                                               ter for how often the LNM scans for missing VMs or
       Typically, a few services also support a running        unexpected VMs. The LNM will register callbacks
Usher system. Depending upon the functionality de-             with the VMM platform for other events, such as VM
sired and the infrastructure provided by a particular          crashes; if the VMM does not support such callbacks,
site, these services might include a combination of the        LNM will periodically scan to detect these events.
following: a database server for maintaining state in-
formation or logging, a NAS server to serve VM
filesystems, an authentication server to provide au-
thentication for Usher and VMs created by Usher, a
DHCP server to manage IP addresses, and a DNS
server for name resolution of all Usher created VMs.
Note that an administrator may configure Usher to use
any set of support services desired, not necessarily the
preceding list.
Usher Components
       As noted earlier, Usher consists of three main
components, local node managers on each node, a
central controller, and Usher clients.                                      Figure 2: Usher components.
Local Node Managers
       The local node managers (LNMs) operate closest          Usher Controller
to the hardware. As shown in Figure 2, LNMs run as                   The controller is the center of the Usher system.
servers on each physical node in the Usher system.             It can either be bootstapped into a VM running in the
The LNMs have three major duties: i) to provide a re-          system, or run on a separate server. The controller pro-
mote API to the controller for managing local VMs, ii)         vides the following:
to collect and periodically upload local resource usage           • User authentication
data to the controller, and iii) to report local events to        • VM operation request API
the controller.                                                   • Global state maintenance
       Each LNM presents a remote API to the con-                 • Consolidation of LNM monitoring data
troller for manipulating VMs on its node. Upon invok-             • Event notification
ing an API method, the LNM translates the operation                  User authentication: Usher uses SSL-encrypted
into the equivalent operation of the VM management             user authentication. All users of the Usher system must
API exposed by the VMM running on the node. Note               authenticate before making requests of the system. Ad-
that all LNM API methods are asynchronous so that              ministrators are free to use any of the included authenti-
the controller does not block waiting for the VMM              cation modules for use with various authentication

170                                            21st Large Installation System Administration Conference (LISA ’07)
McNett, et al.                                           Usher: An Extensible Framework for Managing Clusters . . .

backends (e.g., LDAP), or implement their own. An ad-                 In addition to the above lists, the controller main-
ministrator can register multiple authentication mod-           tains three other lists of VMs: lost, missing, and un-
ules, and Usher will query each in turn. This support is        managed VMs. The subtle distinction between lost
useful, for instance, to provide local password authen-         and missing is that lost VMs are a result of an LNM or
tication if LDAP or NIS authentication fails. After re-         VMM failure (the controller is unable to make this
ceiving a user’s credentials, the controller checks them        distinction), whereas a missing VM is a result of an
against the active authentication module chain. If one          unexpected VM disappearance as reported by the
succeeds before reaching the end of the chain, the user         LNM where the VM was last seen running. A missing
is authenticated. Otherwise, authentication fails and           VM can be the result of an unexpected VMM error
the user must retry.                                            (e.g., we have encountered this case upon a VMM er-
       VM operation request API: A key component of             ror on migration). Unmanaged VMs are typically a re-
the controller is the remote API for Usher clients. This        sult of an administrator manually starting a VM on a
API is the gateway into the system for VM manage-               VMM being managed by Usher; Usher is aware of the
ment requests (via RPC) from connecting clients. Typi-          VM, but is not itself managing it. The list of unman-
cally, the controller invokes an authorization plugin to        aged VMs aids resource usage reporting so that Usher
verify that the authenticated user can perform the oper-        has a complete picture of all VMs running on its
ation before proceeding. The controller may also in-            nodes.
voke other plugins to do preprocessing such as check-
                                                                      Having the controller maintain system state re-
ing resource availability and making placement deci-
                                                                moves the need for it to query all LNMs in the system
sions at this point. Usher calls any plugin modules reg-
istered to receive notifications for a particular request       for every VM management operation and state query.
once the controller receives such a request.                    However, the controller does have to synchronize with
                                                                the rest of system and we discuss synchronization fur-
       Usher delegates authorization to plugin modules          ther in the Component Interaction subsection below.
so that administrators are free to implement any policy
or policies they wish and stack and swap modules as                   Consolidation of LNM monitoring data: Prop-
the system runs. In addition, an administrator can con-         er state maintenance relies upon system monitoring.
figure the monitoring infrastructure to automatically           The controller is responsible for consolidating moni-
swap or add policies as the system runs based upon              toring data sent by the local node managers into a for-
current system load, time of day, etc. In its simplest          mat accessible by the rest of the system. Clients use
form, an authorization policy module openly allows              this data to describe the state of the system to users,
users to create and manipulate their VMs as they de-            and plugins use this data to make policy decisions. For
sire or view the global state of the system. More re-           example, plugin modules may use this data to restrict
strictive policies may limit the number of VMs a user           user resource requests based on the current system
can start, prohibit or limit migration, or restrict what        load, or make VM scheduling decisions to determine
information the system returns upon user query.                 where VMs should run.
       Once a request has successfully traversed the au-              Event notification: Usher often needs to alert
thorization and preprocessing steps, the controller exe-        clients and plugin modules when various events in the
cutes it by invoking asynchronous RPCs to each LNM              system occur. Events typically fall into one of three
involved. As described above, it is up to any plugin pol-       categories:
icy modules to authorize and check resource availabili-            • VM operation requests
ty prior to this point. Depending upon the running poli-           • VM state changes
cy, the authorization and preprocessing steps may alter            • Errors and unexpected events
a user request before the controller executes it. For ex-             Clients automatically receive notices of state
ample, the policy may be to simply ‘‘do the best I can’’        changes of their virtual machines. Clients are free to
to honor a request when it arrives. If a user requests
                                                                take any action desired upon notification, and can
more VMs than allowed, this policy will simply start as
                                                                safely ignore them. Plugin modules, however, must
many VMs as are allowed for this user, and report back
to the client what action was taken. Finally, if insuffi-       explicitly register with the controller to receive event
cient resources are available to satisfy an authorized          notifications. Plugins can register for any type of event
and preprocessed request, the controller will attempt to        in the system. For example, a plugin may wish to re-
fulfill the request until resources are exhausted.              ceive notice of VM operation requests for preprocess-
                                                                ing, or error and VM state change events for reporting
       Global state maintenance: The controller main-           and cleanup.
tains a few lists which constitute the global state of the
system. These lists link objects encapsulating state in-        Clients and the Client API
formation for running VMs, running VMMs, and in-                      Applications use the Usher client API to interact
stantiated virtual clusters (VCs). A virtual cluster in         with the Usher controller. This API provides methods
Usher can contain an arbitrary set of VMs, and admin-           for requesting or manipulating VMs and performing
istrators are free to define VCs in any way suitable to         state queries. We refer to any application importing
their computing environment.                                    this API as a client.

21st Large Installation System Administration Conference (LISA ’07)                                                   171
Usher: An Extensible Framework for Managing Clusters . . .                                                McNett, et al.

       The client API provides the mechanism for clients      In this way, the update serves as an acknowledgment
to securely authenticate and connect to the Usher con-        and the LNM knows that the controller received notice
troller. Once connected, an application may call any of       of the event.
the methods provided by the API. All methods are asyn-              Similarly, the controller sends VM object in-
chronous, event-based calls to the controller (see the        stances for newly created VMs to an LNM before the
Implementation section below). As mentioned above,            VM is actually started there. Upon successful return
connected clients also receive notifications from the         from a start command, the controller updates the VMs
controller for state changes to any of their VMs. Client      cached object state on the LNM. Subsequently, the
applications can have their own callbacks invoked for         LNM assumes the responsibility for monitoring and
these notifications.                                          reporting any unexpected state changes back to the
Component Interaction                                         controller.
       Having described each of the Usher components          Controller and Client Interaction
individually, we now describe how they interact in                  Clients to the Usher system communicate with
more detail. We first discuss how the controller and          the controller. Before a client can make any requests,
LNMs interact, and then describe how the controller           it must authenticate with the controller. If authentica-
and clients interact. Note that clients never directly        tion succeeds, the controller returns a capability to the
communicate with LNMs; in effect, the controller              client for invoking its remote API methods. Clients
‘‘proxies’’ all interactions between clients and LNMs.        use this API to manipulate VMs.
Controller and LNM Interaction                                      Similar to the local node managers, clients re-
       When an LNM starts or receives a controller re-        ceive cached object instances corresponding to their
covery notice, it connects to the controller specified in     VMs from the controller upon connection. If desired,
its configuration file. The controller authenticates all      clients can filter this list of VMs based upon virtual
connections from LNMs, and encrypts the connection            cluster grouping to limit network traffic. The purpose
for privacy. Upon connection to the controller, the           of the cached objects at the client is twofold. First,
LNM passes a capability to the controller for access to       they provide a convenient mechanism by which clients
its VM management API.                                        can receive notification of events affecting their VMs,
       Using the capability returned by the LNM, the          since the controller sends updates to each cached VM
controller first requests information about the hard-         object when the actual VM is modified. Second, the
ware configuration and a list of currently running vir-       cached VM objects provide state information to the
tual machines on the new node. The controller adds            clients when they request VM operations. With this or-
this information to its lists of running VMs and              ganization, clients do not have to query the controller
VMMs in the system. It then uses the capability to as-        about the global state of the system before actually
sume management of the VMs running on the LNM’s               submitting a valid request. For example, a client
node.                                                         should not request migration of a non-existent VM, or
                                                              try to destroy a VM which it does not own. The client
       The controller also returns a capability back to
                                                              library is designed to check for these kinds of condi-
the LNM. The LNM uses this capability for both event
                                                              tions before submitting a request. Note that the con-
notification and periodic reporting of resource usage
                                                              troller is capable of handling errant requests; this
back to the controller.
                                                              scheme simply offloads request filtering to the client.
       When the controller discovers that a new node
                                                                    The controller is the authority on the global state
already has running VMs (e.g., because the node’s
                                                              of the system. When the controller performs an action,
LNM failed and restarted), it first determines if it
                                                              it does so based on what it believes is the current glob-
should assume management of any of these newly dis-
                                                              al state. The cached state at the client reflects the con-
covered VMs. The controller makes this determination
                                                              troller ’s global view. For this reason, even if the con-
based solely upon the name of the VM. If the VM
                                                              troller is in error, its state is typically used by clients
name ends with the domain name specified in the con-
                                                              for making resource requests. The controller must be
troller ’s configuration file, then the controller assumes
                                                              capable of recovering from errors due to inconsisten-
it should manage this VM. Any VMs which it should
                                                              cies between its own view of the global state of the
not manage are placed on the unmanaged list dis-
                                                              system and the actual global state. These inconsisten-
cussed above. For any VMs which the controller
                                                              cies are typically transient (e.g., a late event notifica-
should manage, the controller creates a VM object in-
                                                              tion from an LNM), in which case the controller may
stance and places this object on its running VMs list.
                                                              log an error and return an error message to the client.
These instances are sent to the LNMs where the VMs
are running and cached there. Whenever an LNM sees            Failures
that a cached VM object is inconsistent with the corre-             As the Usher system runs, it is possible for the
sponding VM running there (e.g., the state of the VM          controller or any of the local node managers to become
changed), it alerts the controller of this event. The         unavailable. This situation could be the result of hard-
controller then updates the cached object on the LNM.         ware failure, operating system failure, or the server itself

172                                            21st Large Installation System Administration Conference (LISA ’07)
McNett, et al.                                          Usher: An Extensible Framework for Managing Clusters . . .

failing. Usher has been designed to handle these failures     total of 3993 lines of code. Also included is the line
gracefully.                                                   count for the ush application (over 400 of which is
      In the event of a controller failure, the LNMs          simply online documentation).
will start a listening server for a recovery announce-        Local Node Managers
ment sent by the controller. When the controller                    Local Node Managers export the remote API
restarts, it sends a recovery message to all previously       shown in Table 2 to the controller. This API is made
known LNMs. When the LNMs receive this an-                    available to the controller via a capability passed to
nouncement, they reconnect to the controller. As men-         the controller when an LNM connects to it.
tioned in the Controller and LNM Interaction section
above, when an LNM connects to the controller, it                     Method Name               Description
passes information about its physical parameters and               get_details(vm name)   get VM state infor-
locally running VMs. With this information from all                                       mation
connecting LNMs, the controller recreates the global
                                                                   get_status(vm name)    get VM resource us-
state of the system. With this design, Usher only re-
                                                                                          age statistics
quires persistent storage of the list of previously
known LNMs rather than the entire state of the system              receive(vm instance)   receive new cached
to restore system state upon controller crash or failure.                                 VM object
      Since the controller does not keep persistent in-            start(vm name)         start cached VM
formation about which clients were known to be con-                op_on(operation,       operate on existing
nected before a failure, it cannot notify clients when it                  vm name)       VM
restarts. Instead, clients connected to a controller               migrate(vm name,       migrate VM to LNM
which fails will attempt to reconnect with timeouts                        lnm name)
following an exponential backoff. Once reconnected,
                                                                   get_node_info()        get node physical
clients flush their list of cached VMs and receive a
                                                                                          characteristics
new list from the controller.
                                                                   get_node_status()      get node dynamic
      The controller detects local node manager fail-
                                                                                          and resource usage
ures upon disconnect or TCP timeout. When this situa-
                                                                                          info
tion occurs, the controller changes the state of all VMs
known to be running on the node with the failed LNM                 Table 2: Local node manager remote API.
to lost. It makes no out of band attempts to determine
if lost VMs are still running or if VMMs on which                   This API includes methods to query for VM state
                                                              information and VM resource usage details using the
LNMs have failed are still running. The controller
                                                              get_details and get_status methods, respectively. State
simply logs an error, and relies upon the Usher admin-
                                                              information includes run state, memory allocation, IP
istrator or a recovery plugin to investigate the cause of
                                                              and MAC addresses, the node on which VM is run-
the error.
                                                              ning, VM owner, etc. Resource usage includes 1, 5,
                   Implementation                             and 15-minute utilizations of the various hardware re-
                                                              sources.
     In this section we describe the implementation of              The receive method creates a cached copy of a
Usher, including the interfaces that each component           VM object on an LNM. An LNM receives the cached
supports and the plugins and applications currently           copy when it connects to the controller. It compares
implemented for use with the system.                          the state of the VM object with the actual state of the
                                                              virtual machine. If the states differ the LNM notifies
            Component                   LoC                   the controller, which updates the LNM’s cached copy
            LNM (w/ Xen hooks)          907                   of the VM as an acknowledgment that it received the
            Controller                 1703                   state change notice.
            Client API                  750                         In addition, the cached copy of a VM at its LNM
            Utilities                   633                   contains methods for manipulating the VM it repre-
            Ush                        1099                   sents. When a VM manipulation method exposed by
    Table 1: Code size of individual components.              the LNM’s API is invoked (one of start, op_on, or mi-
                                                              grate), the method calls the corresponding method of
      The main Usher components are written in Python         the cached VM object to perform the operation. This
[2]. In addition, Usher makes use of the Twisted net-         structure provides a convenient way to organize VM
work programming framework [3]. Twisted provides              operations. To manipulate a VM, a developer simply
convenient mechanisms for implementing event based            calls the appropriate method of the cached VM object.
servers, asynchronous remote procedure calls, and re-         Note that the controller must still update the state of
mote object synchronization. Table 1 shows source             its VM object as an acknowledgment that the con-
code line counts for the main Usher components, for           troller knows the operation was successful.

21st Large Installation System Administration Conference (LISA ’07)                                              173
Usher: An Extensible Framework for Managing Clusters . . .                                             McNett, et al.

      Most operations on an existing VM are encapsu-        suggestion to the controller as to where to migrate the
lated in the op_on function, and have similar signa-        VMs. The controller can choose to ignore this sugges-
tures. Table 3 shows the list of valid operations to the    tion or ignore the migrate request altogether based up-
op_on method.                                               on the policies installed.
                                                                  The operations supported by the op_on method in
  Operation                  Description                    the controller API are the same as those to the op_on
  pause         pause VM execution, keeping                 method of the remote LNM API (Table 3).
                memory image resident
  resume        resume execution of a paused VM                     Method Name                      Description
  shutdown      nicely halt a VM                              list(vm list, status)         list state and resource
                                                                                            usage information for
  reboot        shutdown and restart VM                                                     VMs
  hibernate     save VM’s memory image to per-                list_lnms(lnm list, status)   list LNMs and resource
                sistent storage                                                             usage information for
  restore       restore hibernated VM to run state                                          VMMs
  destroy       hard shutdown a VM                            start(vm list, lnm list)      start list of VMs on
  cycle         destroy and restart a VM                                                    LNMs
 Table 3: Operations supported by the op_on method.           op_on(operation, vm list)     operate on existing VMs
                                                              migrate(vm list, lnm list)    migrate VMs to LNMs
      All VM operations invoke a corresponding oper-
ation in the VMM’s administration API. Though Ush-          Table 4: Controller remote API for use by Usher
er currently only manages Xen VMs, it is designed to            clients.
be VMM-agnostic. An installation must provide an            Client API
implementation of Usher’s VMM interface to support
                                                                  The client API closely mirrors that of the con-
new virtual machine managers.
                                                            troller. An important difference between these two
      The LNM’s remote API exposes a few methods            APIs, though, is that the client API signatures contain
that do not operate on VMs. The get_node_info method        many additional parameters to aid in working with
returns hardware characteristics of the physical ma-        large sets of VMs. These additional parameters allow
chine. The controller calls this method when an LNM         users to operate on arbitrary sets of VMs and virtual
connects. The get_node_status method is similar to the      clusters in a single method call. The API supports spec-
get_status method. Additionally, it reports the number      ifying VM sets as regular expressions, explicit lists, or
of VMs running on the VMM and the amount of free            ranges (when VM names contain numbers). The client
memory on the node.                                         API also allows users to specify source and destination
Usher Controller                                            LNMs using regular expressions or explicit lists.
      The remote API exported by the controller to                Another difference between the client and con-
connecting clients closely resembles the interface ex-      troller APIs is that the client API expands the op_on
ported by LNMs to the controller. Table 4 lists the         method into methods for each type of operation. Ex-
methods exported by the controller to Usher clients.        plicitly enumerating the operations as individual meth-
This API is made available to clients via a capability      ods avoids confusing application writers unfamiliar
passed upon successful authentication with the con-         with the op_on method. These methods simply wrap
troller.                                                    the call to the op_on method, which is still available
      Note that most of these methods operate on lists      for those wishing to call it directly.
of VMs, rather than single VMs expected by the LNM                Finally, the client API contains connect and recon-
API methods. Since Usher was designed to manage             nect methods. These methods contact and authenticate
clusters, the common case is to invoke these methods        with the controller via SSL. They also start the client’s
on lists of VMs rather than on a single VM at a time.       event loop to handle cached object updates and results
This convention saves significant call overhead when        from asynchronous remote method calls. The reconnect
dealing with large lists of VMs.                            method is merely a convenience method to avoid hav-
      The start and migrate methods both take a list of     ing to pass credentials to the API if a reconnect is re-
                                                            quired after having been successfully connected. This
LNMs. For start, the list specifies the LNMs on which
                                                            method can be used by a reconnecting application up-
the VMs should be started. An empty list indicates
                                                            on an unexpected disconnect.
that the VMs can be started anywhere. Recall that this
parameter is simply a suggestion to the controller.         Configuration Files
Policies installed in the controller dictate whether or           All Usher components are capable of reading
not the controller will honor the suggestion. Likewise,     configuration data from text files. All valid configura-
the LNM list passed to the migrate method is simply a       tion parameters, their type, and default values are

174                                        21st Large Installation System Administration Conference (LISA ’07)
McNett, et al.                                         Usher: An Extensible Framework for Managing Clusters . . .

specified in the code for each component. When each                  Besides authorization policies, one can imagine
component starts, it first parses its configuration files      policies for VM operation and placement. For exam-
(if found). The configuration system tries to read in          ple, initial VM placement, VM scheduling (i.e., dy-
values from the following locations (in order): a de-          namic migration based on load or optimizing a utility
fault location in the host filesystem, a default location      function), or reservations. A policy plugin for initial
in the user’s home directory, and finally a file indicat-      placement would be registered for the start_request
ed by an environment variable. This search ordering            event (probably with a higher order attribute than the
enables users to override default values easily. Values        startup authorization policy discussed above so that it
read in later configuration files replace values speci-        is called later in the plugin callback chain). Some sim-
fied in a previously read file.                                ple policies such a plugin might support are round-
Plugins                                                        robin and least-loaded. Scheduling and reservation
       Plugins are separate add-on modules which can           plugins could be registered with a timer to be fired pe-
be registered to receive notification of nearly any            riodically to evaluate the state of the system and make
event in the system. Plugins live in a special directory       decisions about where VMs should be migrated and
(aptly named ‘‘plugins’’) of the Usher source tree.            which VMs might have an expired reservation, respec-
Usher also looks in a configurable location for third-         tively.
party/user plugins. Any plugins found are automatical-               As a concrete example of plugin usage in Usher,
ly sourced and added to a list of available plugins. To        we now discuss plugins implemented for use by the
register a plugin, the controller provides an additional       UCSD Usher installation, and outline the sequence of
API call register_plugin(plugin name, event, configuration     events for a scenario of starting a set of VMs. Detailed
file). Each plugin is required to provide a method             discussion about these plugins is deferred to the
named entry_point to be called when an event fires for         UCSD SysNet subsection of the Usher Deployments
which it is registered. It is possible to add a single         section.
plugin to multiple event handler chains. Note that the
                                                                     The UCSD installation uses the following plug-
register_plugin method can be called from anywhere in
                                                               ins: an SQL database plugin for logging, mirroring
the Usher code.
                                                               global system state, and IP address management; an
       By default, plugins for each event are simply           LDAP plugin for user authentication for both Usher
called in the order in which they are registered. There-       and VMs created by Usher; a filesystem plugin for
fore careful consideration must be given to ordering           preparing writable VM filesystems; a DNS plugin for
while registering plugins. A plugin’s configuration ob-        modifying DNS entries for VMs managed by Usher;
ject can optionally take an order parameter that gov-          and a default placement plugin to determine where
erns the order in which plugins are called on the              VMs should be started. We are developing additional
event’s callback list. The plugin API also provides a          modules for VM scheduling as part of ongoing re-
converse unregister_plugin call to change event han-
                                                               search.
dling at runtime.
                                                                     All plugins for the UCSD installation are written
       Plugins can be as simple or complex as neces-
                                                               in Python. Table 5 contains line counts for these plug-
sary. Since the controller invokes plugin callback
                                                               ins. Overall, the UCSD plugins total 1406 lines of
chains asynchronously, complex plugins should not in-
terfere with the responsiveness of the Usher system            code.
(i.e., the main controller event loop will not block
waiting for a plugin to finish its task).                                        Plugin         LoC
       Policies in an Usher installation are implemented                         Database       260
as plugins. As an example, an administrator may have                             LDAP           870
strict policies regarding startup and migration of virtu-                        Filesystem      54
al machines. To enforce these policies, a plugin (or                             DNS             90
plugins) is written to authorize start and migrate re-                           Placement      132
quests. This plugin gets registered for the start_request              Table 5: Code size of UCSD plugins.
and migrate_request events, either manually using the
controllers register_plugin command, or by specifying                When a request to start a list of VMs arrives, the
these registrations in the controller’s configuration          controller calls the modules registered for the ‘‘start
file. Once registered, subsequent start and migrate re-        request’’ event. The placement and database modules
quests are passed to the plugin (in the form of a Re-          are in the callback list for this event. The placement
quest object) for authorization. At this point, the            module first determines which VMs to start based on
plugin can approve, approve with modification, or              available resources and user limits, then determines
simply reject the request. Once this is done, the re-          where each of the allowed VMs will start. The data-
quest is passed on to any other plugins registered on          base module receives the modified request list, logs
the start_request or migrate_request event lists with a        the request, then reserves IP addresses for each of the
higher order attribute.                                        new VMs.

21st Large Installation System Administration Conference (LISA ’07)                                                175
Usher: An Extensible Framework for Managing Clusters . . .                                               McNett, et al.

      The controller generates a separate VM start           succeeds, the controller returns a capability for its re-
command for each VM in the start list. Prior to invok-       mote API and all of user mmcnett’s VMs. The some-
ing the start command, the controller triggers a ‘‘regis-    what unusual output ‘‘’’
ter VM’’ event. The database, DNS, and file system           reflects the fact that all client calls to the controller are
plugin modules are registered for this event. The data-      asynchronous. When ‘‘connect’’ returns, Ush responds
base module adds the VM to a ‘‘VMs’’ table to mirror         with the ‘‘Command 0 result:’’ message followed by the
the fact that this is now a VM included in the con-          actual result ‘‘Connected’’.
troller ’s view of the global state. The DNS plugin sim-           Upon connecting Ush saves the capability and
ply sends a DDNS update to add an entry for this VM          cached VM instances sent by the controller. Once con-
in our DNS server. The filesystem module prepares            nected, the user runs the ‘‘list’’ command to view his
mount points on an NFS server for use by each VM.            currently running VMs. Since the client already has
      Finally, upon return from each start command, a        cached instances of user mmcnett’s VMs, the list com-
‘‘start complete’’ event fires. The database module          mand does not invoke any remote procedures. Conse-
registers to receive this event. The database module         quently, Ush responds immediately indicating that us-
checks the result of the command, logs this to the           er mmcnett already has two VMs running.
database, then either marks the corresponding IP ad-               The user then requests the start of ten VMs in the
dress as used (upon success) or available (upon fail-        ‘‘sneetch’’ cluster. In this case, the -n argument speci-
ure). Note that the database module does not change          fies the name of a cluster, and the -c argument speci-
the state of the VM in the VMs table until receiving a       fies how many VMs to start in this cluster. When the
‘‘state changed’’ event for this VM (which originates        controller receives this request, it first calls on the au-
at the LNM).                                                 thorization and database modules to authorize the re-
      Sites can install or customize plugins as neces-       quest and reserve IP addresses for the VMs to be start-
sary. The Usher system supports taking arbitrary input       ed. Next, the controller calls the initial placement
and passing it to plugins on request events. For exam-       plugin to map where the authorized VMs should be
ple, a request to start a VM may include information         started. The controller calls the start method of the re-
about which OS to boot, which filesystem to mount,           mote LNM API at each new VM’s LNM. The LNMs
etc. Plugin authors can use this mechanism to com-           call the corresponding method of the VMM admini-
pletely customize VMs at startup.                            stration API to start each VM. Upon successful return
Applications                                                 of all of these remote method calls, the controller re-
      We have written two applications using the client      sponds to the client that the ten VMs were started in
API, a shell named Ush and an XML-RPC server                 two seconds and provides information about where
named plusher, and are developing two other applica-         each VM was started. After completing their boot se-
tions, a Web interface and a command-line suite. The         quence, user mmcnett can ssh into any of his new
Web interface will provide a convenient point and            VMs by name.
click interface accessible from any computer with a
web browser. The command line suite will facilitate
writing shell scripts to manage virtual machines.
Ush Client
      The Usher shell Ush provides an interactive
command-line interface to the API exported by the
Usher controller. Ush provides persistent command
line history and comes with extensive online help for
each command. If provided, Ush can read connection
details and other startup parameters from a configura-
tion file. Ush is currently the most mature and pre-
ferred interface for interacting with the Usher system.
      As an example of using the Usher system, we de-
scribe a sample Ush session from the UCSD Usher in-
stallation, along with a step-by-step description of ac-
tions taken by the core components to perform each                                 Figure 3: Ush
request. In this example, user ‘‘mmcnett’’ requests ten
VMs. Figure 3 contains a snapshot of Ush upon com-           Plusher
pletion of the start command.                                      Plush [7] is an extensible execution management
      First a user connects to the Usher controller by       system for large-scale distributed systems, and plusher
running the ‘‘connect’’ command. In connecting, the          is an XML-RPC server that integrates Plush with Ush-
controller receives the user’s credentials and checks        er. Plush users describe batch experiments or compu-
them against the LDAP database. Once authentication          tations in a domain-specific language. Plush uses this

176                                         21st Large Installation System Administration Conference (LISA ’07)
McNett, et al.                                            Usher: An Extensible Framework for Managing Clusters . . .

input to map resource requirements to physical re-                machines far exceeds the supply of physical machines,
sources, bind a set of matching physical resources to             and juggling physical machine allocations never satis-
the experiment, set up the execution environment, and             fies all parties. However, for most of their lifetimes, vir-
finally execute, monitor and control the experiment.              tual machines can satisfy the needs of nearly all
       Since Usher is essentially a service provider for          projects: resource utilization is bursty with very low
the virtual machine ‘‘resource’’, it was natural to inte-         averages (1% or less), an ideal situation for multiplex-
grate it with Plush. This integration enables users to            ing; virtualization overhead is an acceptable trade-off
request virtual machines (instead of physical ma-                 to the benefits Usher provides; and users have com-
chines) for running their experiments using a familiar            plete control over their clusters of virtual machines,
interface.                                                        and can fully customize their machine environments.
                                                                  Usher can also isolate machines, or even remove them
       Developing plusher was straightforward. Plush al-          from virtualization use, for particular circumstances
ready exports a simple control interface through XML-             (e.g., obtaining final experimental results for a paper
RPC to integrate with resource providers. Plush re-               deadline) and simply place them back under Usher
quires providers to implement a small number of up-               management when the deadline passes.
calls and down-calls. Up-calls allow resource providers
to notify Plush of asynchronous events. For example,                    At the time of this writing, we have staged 25
using down-calls Plush requests resources asynchro-               physical machines from our hardware cluster into
nously so that it does not have to wait for resource allo-        Usher. On those machines, Usher has multiplexed up
cation to complete before continuing. When the pro-               to 142 virtual machines in 26 virtual clusters, with an
vider finishes allocating resources, it notifies Plush us-        average of 63 VMs active at any given time. Our Ush-
ing an up-call.                                                   er controller runs on a Dell PowerEdge 1750 with a
                                                                  2.8 GHz processor and 2 GB of physical memory.
       To integrate Plush and Usher in plusher, we only           This system easily handles our workload. Although
needed to implement stubs for this XML-RPC inter-                 load is mostly dictated by plugin complexity, using the
face in Usher. The XML-RPC stub uses the Client API               plugins discussed below, the Usher controller con-
to talk to the Usher controller. The XML-RPC stub                 sumes less than 1 percent CPU on average (managing
acts as a proxy for authentication – it relays the au-            75 virtual machines) with a memory footprint of ap-
thentication information (provided by users to Plush)             proximately 20 MB. The Usher implementation is suf-
to the controller before proceeding. When the request-            ficiently reliable that we are now migrating the re-
ed virtual machines have been created, plusher returns            mainder of our user base from dedicated physical ma-
a list of IP addresses to Plush. If the request fails, it re-     chines to virtual clusters, and Usher will soon manage
turns an appropriate error message.                               all 130 physical nodes in our cluster.
                  Usher Deployments                               Usher Usage
                                                                        The straightforward ability to both easily create
      Next we describe two deployments of Usher that              arbitrary numbers of virtual machines as well as de-
are in production use at different sites. The first de-           stroy them has proved to be very useful, and the Sys-
ployment is for the UCSD CSE Systems and Network-                 Net group has used this capability in a variety of ways.
ing research group, and the second deployment is at               As expected, this ability has greatly eased demand for
the Russian Research Center, Kurchatov Institute                  physical machines within the research group. Projects
(RRC-KI). The two sites have very different usage                 simply create VMs as necessary. Usher has also been
models and computing environments. In describing                  used to create clusters of virtual machines for students
these deployments, our goal is to illustrate the flexibil-        in a networking course; each student can create a clus-
ity of Usher to meet different virtual machine manage-            ter on demand to experiment with a distributed proto-
ment requirements and to concretely demonstrate how               col implementation. The group also previously re-
sites can extend Usher to achieve complex manage-                 served a set of physical machines for general login ac-
ment goals. Usher does not force one to setup or man-             cess (as opposed to reserved use by a specific research
age their infrastructure as done by either of these two           project). With Usher, a virtual cluster of convenience
installations.                                                    VMs now serves this purpose, and an alias with
UCSD SysNet                                                       round-robin DNS provides a logical machine name for
      The UCSD CSE Systems and Networking (Sys-                   reference while distributing users among the VMs up-
Net) research group has been using Usher experimen-               on login. Even mundane tasks, such as experimenting
tally since June 2006 and for production since January            with software installations or configurations, can ben-
2007. The group consists of nine faculty, 50 graduate             efit as well because the cost of creating a new machine
students, and a handful of research staff and undergrad-          is negligible. Rather than having to undo mistakes, a
uate student researchers. The group has a strong focus            user can simply destroy a VM with an aborted config-
on experimental networking and distributed systems re-            uration and start from scratch with a new one.
search, and most projects require large numbers of ma-                  The SysNet group currently uses a simple policy
chines in their research. As a result, the demand for             module in Usher to determine the scheduling and

21st Large Installation System Administration Conference (LISA ’07)                                                      177
You can also read