Clustered LIO Using RBD - Mike Christie Red Hat Oct 28, 2014

 
CONTINUE READING
Clustered LIO Using RBD

Mike Christie 
Red Hat
Oct 28, 2014
Agenda

    ●   State of HA SCSI Target Support in Linux.
    ●   Difficulties adding Active/Active Support.
    ●   Future Work.

2
State of Linux Open Source HA SCSI Targets

●   Active/Passive.
●   Pacemaker support for IET, TGT, SCST and LIO.
     ●   Node level failover when target node goes down.
●   Relies on virtual ports/portals (IP take over for iSCSI,
    NPIV for FC) or implicit ALUA failover.
●   Missing final pieces of support for distributed SCSI
    Persistent Reservations.

3
iSCSI Active/Passive With Virtual IPs

                           Server1

                                     Switch A
                                                                          ● Server1 accesses the two targets/GWs one at a time
Virtual IP 192.168.56.22
                                                                            through one or more Virtual IPs.
                                                                          ● eth2 and eth4 are used by Corosync/Pacemaker for

                                                                            cluster membership and cluster aware devices like
                                                                            DRBD.
                 eth1                                     eth3            ● If the active target goes down, corosync/pacemaker will

                                                                            activate the passive target.
                                                                          ● Server1's TCP/IP layer and/or iSCSI/multipath layer will

                           eth2                 eth4                        detect the disruption and perform recovery like
                                                                            packet retransmission, iSCSI/SCSI command retry or
                                                                            relogin.

     (Active) iqn.2003-04.com.test              (Passive) iqn.2003-04.com.test

4
Active/Active HA LIO Support

    ●   Benefits:
         ●   Simple initiator support.
              ●   Boot, failover, failback, setup.
         ●   Support for all SCSI transports in common a
             implementation.
         ●   Possible performance improvement.
    ●   Drawbacks:
         ●   Complex target implementation.
              ●   Distributed error handling, setup, and command execution.

5
iSCSI HA Active/Active

               Server1

                           Switch A                                   ● Server1 accesses the two targets/GWs through two
                                                                       paths: 192.168.10.22 and 192.168.1.23.
                                                                      ● Both targets access the same RBD devices at the

IP: 192.168.100.22        IP: 192.168.1.23                              same time.
                                                                      ● eth2 and eth4 are used by Corosync/Pacemaker
               eth1                           eth3
                                                                        for DLM/CPG and cluster membership.
                                                                      ● If a node or paths to a node become unreachable

                         eth2         eth4                              Server1's multipath layer will mark those paths as
                                                                        unusable until they come back online.
  (Active) iqn.2003-04.com.test       (Active) iqn.2003-04.com.test

                                RBD

  6
Implementation Challenges

    ●   Request execution.
    ●   Synchronizing error recovery across nodes.
    ●   Distributing setup information.

7
Distributed Request Execution

    ●   COMPARE AND WRITE
        ●   Atomically read, compare, and if matching, write N bytes
            of data.
        ●   Used by ESXi (known as ATS) for finely grained locking.
        ●   If multiple nodes are executing this request at the same
            time then locking is needed.
             ●   Patches posted upstream to push the execution to the backing
                 device.

8
Persistent Reservation (PR) Support

    ●   PRs are a set of commands used to control access to a
        device.
    ●   Used by clustering software like Windows Clustering and
        Red Hat Cluster Suite to prevent client nodes from
        accessing the device.
    ●   Initiator sends PR requests to the target which inform it
        what set of I_T Nexuses (SCSI ports) can access the
        device, and what type of access they have.
         ●   This info must be copied across cluster.
         ●    Ports can be added/removed and access restrictions can be
             changed any time.

9
HA Active/Active PR example

           Server1                            1) Server1 sends PR register command to
                                                register Sever1 and Node1's ports to allow
                                                access to LUN $N.
     1.
                                              2) Node1 stores PR info locally.
                                              3) Node1 copies data to Node2.
                     Switch A
     4.   6.                            5.    4) Node1 returns successful status to Server1.
                                              5) The process is now repeated for Server1
                                                 and Node2's ports (remote copy and return
2.        eth1
                        3.                       of status are skipped in example).
                                     eth3
                                              6) Server1 sends a PR reserve command to
                                                 establish the reservation. This prevents other
                 eth2         eth4               Server nodes from being able to access
     Node 1                          Node 2      LUN $N (this info will also be copied to node2
                                                 and a node1 will return a status code to
                                                 Sever1).
                        RBD

10
Persistent Reservation Implementation

 ●   Possible solutions:
      ●   Use Corosync/Pacemaker and DLM to distribute PR info across
          nodes.
      ●   Pass the PR execution to userspace and use the Corosync cpg
          library messaging to send the PR info to the nodes in the cluster.
      ●   Have a cluster FS/device that is used to store the PR info in.
      ●   Add callbacks to the LIO kernel modules or pass PR execution to
          userspace, so devices like RBD can utilize their own
          locking/messaging.

11
Distributed Task Management

 ●   When a command times out, the OS will send SCSI
     Task Management requests (TMFs) to abort
     commands and reset devices.
 ●   The SCSI device reset request is called LOGICAL
     UNIT RESET (LUN RESET).
      ●   SCSI spec defines the required behavior.
           ●   Abort all commands.
           ●   Terminate other task management functions.
           ●   Send an event (SCSI Unit Attentions) through all paths
               indicating the device was reset.

12
HA Active/Active LUN RESET Example

                     Server1                               1) Sever1 cannot determine the state of a
                                                           command. To get the device in a known state it
                                                           sends a LUN RESET.
                                                           2) Node1 begins processing the reset by
                                                           internally blocking new commands and aborting
                                 Switch A
                                                           running commands.
     1.   4.    5.                                    6.
                                                           3) Node1 sends a message to Node2
                                                           instructing it to execute the distributed reset
2.                                    3.
                                                           process.
                eth1                               eth3
                                                           4) After all the reset steps, like command
                                                           cleanup, have been completed on both nodes,
               Node1           eth2         eth4           Node1 returns a success status to Server1.
                                                   Node2
                                                           5) And 6) Node1 and Node2 send Unit
                                                           Attention notifications through all paths that are
                                                           accessing the device that was reset.
                                      RBD

     13
LUN RESET Handling

     ●   Experimenting with passing part of the TMF handling to
         userspace.
          ●   Use cpg to interact with LIO on all nodes.
          ●   Extend LIO configfs interface, so userspace can block devices and
              perform the required reset operations.
     ●   Possible future work/alternative.
          ●   Add Linux kernel block layer interface to abort commands and
              reset devices.
               ●   request_queue->rq_abort_fn(struct request *)
               ●   request_queue->reset_q_fn(struct request_queue *)
               ●
                   New BLK_QUEUE_RESET notifier_block event.
          ●   LIO would use this to allow backing device to do the heavy lifting.

14
Offloaded Task Management

                     Server1
                                                                         1) Server1 sends LUN RESET to Node1.
                                                                         2) Node1 calls RBD's
                                                                            request_queue->reset_q-fn()
                                 Switch A                      5.
          1.                                                                RBD translates that to new rbd/rados reset
               6.                                                        operaiton.
                                                                         3) RBD/rados aborts commands and sends
                                                                         other clients accessing device notification that
                    eth1                                eth3             its commands were aborted due to reset.
                                                                         4) RBD client on Node2 handles rados reset
                                                                         notification by firing new
                               eth2         eth4
     2.        Node1                                Node2           4.   BLK_QUEUE_RESET event.
                                                                         5) LIO handles BLK_QUEUE_RESET event
                                                                         by sending SCSI UAs on paths accessing
                                                                         LUN through that node.
                                                                         6) RBD client on Node1 notifies reset_q_fn
                                      RBD                                caller the reset was successful. LIO then
                                                                         returns a success status and UAs as needed.
                                                   3.

15
Management

     ●   Have only just begun to look into this.
     ●   Must support VMware VASA, Oracle Storage Connect,
         Red Hat libStorageMgmt, etc.
     ●   Must have setup info like UUIDs, inquiry info, SCSI
         settings synced up on all nodes.
     ●   Prefer to integrate with existing projects.
          ●   Extend the LIO target library, rts lib, and lio utils to
              support clustering?
          ●   Extend existing remote configuration daemons like
              targetd (https://github.com/agrover/targetd)?

16
Questions?

     ●   I can be reached at mchristi@redhat.com.

17
You can also read