Frangipani: A Scalable Distributed File System

Page created by Everett Marsh
 
CONTINUE READING
Frangipani: A Scalable Distributed File System
Frangipani: A Scalable
Distributed File System

  C. A. Thekkath, T. Mann, and E. K. Lee
         Systems Research Center
      Digital Equipment Corporation
              Presented by: Long Zhang
Slides come from the combination of previous course
and Frangipani’s original slides in SOSP 97
Motivation

Large-scale distributed file systems
are hard to administer

 Administration is a problem because of
- size of installation
- number of components

                                          2
Outline
   Background
   Introduction
   System Structure
   Disk Layout
   Logging and Recovery
   The Lock Service
   Easy Administration
   Performance
   Conclusions
   Questions
Background (cont'd)
 Original slides: http://ftp.digital.com/pub/Digital/SRC/
  publications/thekkath/talk/frangipani-sosp.ppt
 This paper is built on top of two related papers:
    Edward K. Lee , Chandramohan A. Thekkath, Petal: distributed
  virtual disks, Proceedings of the seventh international
  conference on Architectural support for programming
  languages and operating systems, p.84-92, October, 1996,
  Cambridge, Massachusetts, United States.
   Leslie Lamport. The Part-Time Parliament. Technical Report 49,
  Digital Equipment Corporation, Systems Research Center,
  130Lytton Ave., Palo Alto, CA943011044, September 1989.
Related Work

NFS (Sandberg et al.,’85, SUN)
VAXClusters (Kronenberg, Levy, & Strecker,’86,
DEC)
AFS (Howard et al.,’88, CMU)
Echo (Mann et al.,’94, SRC)
xFS (Anderson et al.,’95, Berkeley)
Calypso (Devarakonda, Kish, and Mohindra,’95,
IBM)
Shillner and Felten (’96, Princeton)
                                                 5
Introduction
 Many distributed file systems already there:
   VMS Cluster file system, Echo, Calypso, and etc.
 Generally, large-scaled   distributed file
  systems are hard to manage. Lots of file systems
  administration work require human intervention
  – have to be done manually.
 The administration problem is caused by
   Growing computer installation.
   More disks attached to more machines.
    (components)
Introduction – Solution
 Frangipani
   A new scalable distributed file system.
   Two layered model: build on top of Petal, a
    distributed storage system.
   Can also be viewed as a cluster file system.
 It can solve the administration problem by
   Give all users a consistent view of files.
   Frangipani servers can be easily added to existing
    installation to improve the performance.
   Add users without manually configuration.
   Dynamic/hot backup support
   Fault tolerance. (machine, network, disk failures)
Petal Prototype

        Petal       Petal        Petal     Petal

     Client      Client     Client       Client
                Switched Network

     Petal Server    Petal Server Petal Server

       Disk               Disk           Disk
       s                  s              s
                Petal virtual
                disk
                                                   8
Introduction – Layered structure
       User              User               User
     program           program            program

            Frangipani file    Frangipani file
                server             server

     Distributed                Petal
     lock service        distributed
                              virtual

                    Physical disks
System Structure – Common
          workstations

          Petal virtual disk
System Structure – Components
 User programs access Frangipani through the
  standard operating system call interface. (Digital
  Unix vnode interface)
 Frangipani file server module runs within OS
  kernel.
   Changes to file contents are staged through the
    local kernel buffer pool. Could be volatile until
    next fsync/sync system call.
   Metadata changes are logged in Petal and be
    guaranteed non-volatile. (Write ahead redo log,
    discuss later)
Components
 Frangipani file server module read/write Petal
 virtual disks using local Petal device driver.
   Exploit Petal’s large virtual space.
   More details in a separate paper.
 The lock services
   Multi-reader/single-writer lock
   Lock with leases (discuss later)
Client/Server configuration
 Security issues:
   Any Frangipani machine can read/write any block of
    the shared Petal virtual disk.
   Eavesdropping on the network interconnecting the
    Petal and Frangipani machines
 Solution: run Frangipani, Petal and lock servers on
  trusted network, machines and OSs .
 Client/Server configuration.
   All the servers are interconnecting with a private
    network.
   Remote, untrusted clients talk to Frangipani servers
    through a separate network. (have no access to Petal)
   Bonus: Clients can use Frangipani without modifying
Client/Server configuration
System Structure – Design issues
 Why not use an old file system on Petal?
   Petal works with old file systems.
   Traditional file systems such as UFS, AdvFS (target
    in performance section) cannot share a block
    device.
   The machine runs the file system can be a
    bottleneck.
 Why choose two layer structure?
   Two layer structure is not unique. e.g. Universal
    File Server.
   Modularity. Frangipani machines can be added
    and deleted transparently.
   Consistent backup without halting the system.
Design issues (cont'd)
 Three aspects of the Frangipani design can be
 problematic:
   Duplicated logging. Sometimes logged both by
    Petal and Frangipani.
   Doesn’t use disk location information in placing
    data.
   Frangipani locks entire files and directories rather
    than blocks.
Disk Layout
 264 bytes of address space provided by Petal
 Commits/decommits in large chunks – 64K
 Six regions in address space:
   1st region stores shared configuration parameters
    and housekeeping information – 1TB
   2nd region stores logs. Each Frangipani server has
    one. Reserved 1TB, partitioned into 256 logs.
   3rd region is used for allocation bitmaps, to
    describe which blocks in the remaining regions
    are free – 3TB
   4th region holds inodes. 1 TB inode space, each
Disk Layout (cont'd)
   5th region hold small data blocks, each 4KB in
    size. Allocated 7TB
   The remainder holds for large data blocks. 1 TB
    for each large block. 224 large files limit.
 Frangipani takes advantage of Petal’s large,
 sparse disk address space to simplify its data
 structure.
Logging and Recovery
 Frangipani uses a write ahead redo log for
 metadata
   Metadata: any on-disk data structure other than
    the content of an ordinary file.
   Log records are kept on Petal.
   Logs are bounded in size – 128 KB
 Data is written to Petal
   On fsync/sync system calls, or every 30 seconds.
   On lock revocation or then the log wraps.
 Each Frangipani machine has a separate log
   Reduces contention
   Independent recovery
Logging and Recovery (cont'd)
 Frangipani server crashes can be detected in two
 ways:
   Detected by a client of failed server;
   When the lock service asks the failed server to
    return a lock it is holding.
 Generally, recovery is initiated by the lock
 service.
   Recovery demon will take the ownership of the
    failed server’s logs and locks.
   After recovery, releases all the locks and frees the
    logs.
Lock Services
 Multiple reader/single writer lock mechanism
   Read lock allows a server to read data and cache
    it.
   Write lock allows a server to read or write data .
   When a write lock is downgraded or released, the
    server must flush its dirty data to disk.

 Locks are moderately coarse-grained
   Lock for each logical segments
      Each file, directory or symbolic link is one segment.
   protects entire file or directory
Lock Services (cont'd)
 Avoiding deadlock by globally ordering these
 locks.

 And acquiring these locks in two phases:
   A server determine what locks it needs. Which file
    or directory? Read lock or write lock?
   The server sorts the locks by inode address and
    acquires each lock in turn.
     Then checks whether any objects identified in phase
      one were modified while their locks were released. If
      so, the server releases locks and loops back to phase
      one.
Lock Services (cont'd)
 The lock service deal with client failure using
 leases
   Client obtain a lease together with the lock. If the
    lease expires, the client either renew the lease or
    the lock will become invalid.
 Three different implementations: (Key problem:
 where to store the lock state?)
   1st : A single, centralized server. All lock states are
    keep in the server volatile memory.
   2nd: Primary/backup server. Store the lock state on
    a Petal virtual disk, so in case of server crash, the
    lock state can be recovered. Poor performance.
Lock Services (cont'd)
   3rd and final: A set of mutually cooperating lock
    servers, and a clerk module linked into each
    Frangipani server. Result: fully distributed for fault
    tolerance and scalable performance.
 Highlights of final implementation:
   The lock servers maintain a lock table for each
    Frangipani server. Clerk module is responsible for
    communications. (via asynchronous messages)
   A small amount of global state information is
    replicated across all lock servers using Lamport’s
    Paxos algorithm. (Also used in Google chubby
    lock service http://labs.google.com/papers/
    chubby.html)
Easy Administration
(adding/removing servers)
 Adding another Frangipani server requires a
 minimal amount of administrative work:
   Which Petal virtual disk to use
   And where to find lock service.

 Removing a Frangipani server is even easier.
   Simply shut the server off. Lock servers will invalid
   the locks hold by the server after the lease
   expired and initiate recovery service to run the
   redo logs.
Easy Administration – backup
 Petal’s snapshot feature provides a convenient
 way to make consistent full dump of a
 Frangipani file system
   Uses copy-on-write techniques
   Crash consistent: a snapshot reflects a coherent
    state.

 Backup a Frangipani file system:
   Taking a Petal snapshot.
   And copying it to tape.
Performance – Experimental
 Non-volatile memory (NVRAM)
   Solved Frangipani server latency problems.
   Placed in between physical disks and Petal server.
 Ideal testbed:
   100 Petal nodes. (small array controllers)
   50 Frangipani servers. (typical workstations)
 Reality:
   7 333Mhz DEC Alpha 500 5/333 as Petal servers.
   Each has 9 DIGITAL RZ29 disks, 4.3 GB each.
   Connected to 24 port ATM switch 155 Mbit/s link.
Single Machine Performance
 Why AdvFS?
   Significantly faster than BSD-derived UFS file
    system.
   Can stripe files across multiple disks.
   Uses a write-ahead log like Frangipani.
 Frangipani FS doesn’t use local disks while
  AdvFS using locally attached disks.
 For MAB, unmount file system at end of each
  phase. Same reason as the tests performed for
  log-based FS.
Single Machine Performance

Table 1: Modified Andrew Benchmark with unmount
operations

    Table 2: Frangipani Throughput and CPU Utilization
Scaling
Frangipani Scaling on Modified Andrew
Benchmark
                      60
Elapsed time (secs)

                      45                                     Compile
                                                             Scan
                                                             Stat
                      30                                     Copy
                                                             Create
                      15

                      0
                           1   2     3       4       5   6

                               Frangipani Machines
Scaling (cont'd)
Frangipani scaling on Uncached Read
         throughput(MB/s)   70.0

                            52.5

                            35.0

                            17.5

                              0
                                   1     2    3    4    5    6

                                       Frangipani Machines
Scaling (cont'd)
Frangipani scaling on write.
         throughput(MB/s)   70.00

                            56.25

                            42.50

                            28.75

                            15.00
                                    1   2   3    4   5    6

                                    Frangipani Machines
Discussion
 I am bit worried about its locking granularity. What if we
  can lock individual blocks rather than files or directories ?
  How would affect the overall performance of the system ?
 Petal is using data replication for high availability.
  Maintaining consistency among of several copies in a
  distributed system is inherently difficult so how does Petal
  deal with this issue ?

                                                                  33
Conclusions
 Frangipani is feasible to build because of its two-
 layer structure.
   all shared state is on a Petal disk
    easy to add, delete, and recover servers
   Frangipani servers do not communicate with each
    other: simple to design, implement, debug, and test
 Frangipani performance is comparable to a
  productions DIGITAL Unix file system (AdvFS).
 Still in early prototype stage, need more experience
  to improve scalability, finer-grained locking and etc.
 Applications:
   Design of Compaq’s VersaStore products
   predates many of the storage and NAS appliances in
    the industry today.
Discussions
 During logging and recovery, each entry in the log is given
  a monotonically increasing sequence number and each
  log record has a version number for the block it updates.
  These are used to signify the end of the log (if the next
  entry is less than the current one), or an old block (if the
  block number in the record is less than the on disk version
  number). However, these numbers have to be
  implemented as some sort of integers in the system. How
  is overflow of these taken care of? I realise that this would
  take an unusually high number of writes, but wouldn't this
  potentially be an issue otherwise?

                                                              35
Discussions
 Petal optionally replicates data for high availability. How
  does this affect the locking and synchronization? When a
  certain file is to be updated, are it's inodes and data
  blocks simultaneously locked and updated on all the Petal
  servers on which it exists? Also, since Petal can continue
  functioning as long as a single disk containing the file is
  available, isn't it possible that there will be inconsistent
  versions of the file if any of the servers with replicated
  data is unavailable at any time? How are files merged in
  such situations?

                                                                 36
 What is the benefit of using Petal to build Frangipani? And
  what is the benefit of using "so-called" virtual disk to
  provide a large address space?
 Do you think implementing a cluster file system on top of a
  disk-based storage structure, like Petal, is better than
  implementing directly on top of the file systems of an
  operating system?
 The bottleneck of such a system seems to be the network
  bandwidth, the Petal server throughput and its disk access
  time. So why does it need to implement Frangipani as an
  operating system module, which both reduce the reliability
  and portability? Implementing it in the user level seems

                                                            37
 4) Do you think it is a balanced architecture? The
  Frangipani server deals with the requests of clients and its
  disk only acts as the cache. So it seems the Frangipani
  server needs small disk, but fast CPU and network
  interface. While, the Petal server needs large disks and
  even fast network interface.
 5) The system still needs manual administration when
  adding/removing the either Frangipani server or Petal
  server. Do you think it scales well?

                                                             38
 "Only metadata is logged, not user data, so a user has no
  guarantee that the file system state is consistent from his
  point of view after a failure.” Is it acceptable for the users’
  data to be inconsistent after a failure and any existing
  distributed file system solve this problem well?

The chunk size in Petal virtual disk is 64kb, yet in the
filesytem, Frangipani, there are 4kb block and 512b inode,
that means some file operation will wait for others, right?

                                                               39
You can also read