Fabric Resiliency Best Practices - DATA CENTER

Page created by Jeremy Hodges
 
CONTINUE READING
Fabric Resiliency Best Practices - DATA CENTER
DATA CENTER
Fabric Resiliency Best Practices
Provides best practices for deploying and utilizing advanced Brocade Fabric OS
(FOS) features to identify, monitor, and protect Fibre Channel (FC) SANs from
problematic device and media behavior.
DATA CENTER                                                                                                                                                                                                BEST PRACTICES GUIDE

CONTENTS
Introduction........................................................................................................................................................................................................................................ 3

Fabric Resiliency .............................................................................................................................................................................................................................. 4

Maintaining an Optimal FC SAN Environment...................................................................................................................................................................... 4
         Bottleneck Detection ..................................................................................................................................................... 4
         Fabric Watch/Port Fencing............................................................................................................................................ 5
         Edge Hold Time Configuration....................................................................................................................................... 5

Device Latencies.............................................................................................................................................................................................................................. 5
         Assessing Device Latency Severity ............................................................................................................................... 6
                  Moderate Device Latencies ................................................................................................................................... 6
                  Severe Device Latencies ........................................................................................................................................ 6
         Latency Detection .......................................................................................................................................................... 6
                  Timeout Notification on F_Ports ............................................................................................................................ 7
         Latency Mitigation Action .............................................................................................................................................. 7
                  Mitigation Action Based on Bottleneck Detection ................................................................................................ 7
                  Action Based on Timeouts ..................................................................................................................................... 7
                  Fabric Configuration ............................................................................................................................................... 8

Faulty Media...................................................................................................................................................................................................................................... 8
         Automatically Detecting and Mitigating Faulty Media ................................................................................................. 9
                  Fabric Watch ........................................................................................................................................................... 9
                  Bottleneck Detection.............................................................................................................................................. 9

Summary of Recommendations............................................................................................................................................................................................... 9

Appendix A: Configuring Bottleneck Detection ...................................................................................................................................................................10
         Enabling/Disabling Bottleneck Detection .................................................................................................................. 10
         Displaying List of Ports with Bottleneck Detection Enabled ...................................................................................... 11
         Changing Bottleneck Detection Settings on a Port .................................................................................................... 11
         Displaying the History of Bottlenecks on a Port ......................................................................................................... 12
         Bottleneck Alerts .......................................................................................................................................................... 12

Appendix B: Configuring Bottleneck Detection Mitigation Behavior ...........................................................................................................................13

Appendix C: Configuring Port Fencing ....................................................................................................................................................................................14
         Port Fencing Threshold Recommendations ............................................................................................................... 14

Appendix D: Configuring Edge Hold Time .............................................................................................................................................................................16

Fabric Resiliency Best Practices                                                                                                                                                                                                                   2
DATA CENTER                                                                                        BEST PRACTICES GUIDE

INTRODUCTION
Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard Fibre Channel (FC)
media can significantly impact the performance of FC fabrics and the applications they support. And in
most real-world scenarios, these issues cannot be corrected or completely mitigated within the fabric itself—
the behavior must be addressed directly. However, with the proper knowledge and capabilities, the fabric
can often identify and in some cases mitigate or protect against the effects of these “misbehaving”
components to provide better fabric resiliency. This document provides a high-level description of the most
commonly experienced detrimental device behaviors and how to use Brocade® products and features to
protect your data center.

Brocade has a long history of developing industry-leading hardware and software. Brocade also dedicates a
significant amount of engineering development and resources to providing features and functionality to
assist the SAN administrator in monitoring the fabric and troubleshooting problems. Capabilities such as
Advanced Performance Monitoring and Fabric Watch focus specifically on these areas. As FC fabrics have
evolved in size, capabilities, and complexity, Brocade has responded with more advanced monitoring and
troubleshooting features. In Fabric OS (FOS®) 6.1, Brocade introduced Port Fencing as part of the optional
Fabric Watch offering. In FOS 6.3, Brocade added a new set of base features referred to as Bottleneck
Detection. This was extended in FOS 6.4 with broader monitoring, improved configuration, and detection
capabilities for additional types of bottlenecks.

Although there are certain aspects of today’s data centers that are common in most environments, no two
data centers are exactly alike, and no “one size fits all” set of configuration parameters apply universally to
all environments. Brocade has worked directly with customers to develop recommendations and guidelines
for most environments. However, you should always validate all recommendations for your particular
needs.

Brocade also offers extensive Professional Services to assist you with tuning and optimizing all of the
features discussed in this document customized for your data center. For details, visit:
http://www.brocade.com/services-support/professional-services/index.page.

Further details on all of the features described in this document can be found in the following product
documents:

•    Fabric OS 6.4 Administrator’s Guide (53-1001763-01)

•    Fabric OS 6.4 Command Reference Manual (53-1001764-01)

•    Fabric OS 6.4 Fabric Watch Administrator’s Guide (53-1001770-01)

It is assumed that readers of this document are already familiar with basic functionality and how to use
features such as Port Fencing, Bottleneck Detection, and Fabric Watch.

Fabric Resiliency Best Practices                                                                                     3
DATA CENTER                                                                                      BEST PRACTICES GUIDE

FABRIC RESILIENCY
Two primary aspects of fabric resiliency are captured in this document:

1.   Detecting “abnormal behavior” in external components (typically servers/hosts or storage devices) that can
     negatively impact the fabric so that you can identify and fix the faulty device.

2.   Providing mechanisms that protect the fabric from adverse effects caused by a faulty component, including
     one or more actions that can be invoked automatically by a switch when faulty behavior is detected to
     contain and isolate the impact of the misbehaving component in the fabric. This should be considered a
     temporary measure: ultimately, the faulty or improperly configured component must be addressed to
     resolve the problem completely and permanently.

There are two common classes of abnormal behavior originating from fabric components:

•    Misbehaving high-latency end devices (hosts or storage): End devices that do not respond as quickly
     as expected and cause the fabric to hold frames for excessive periods of time. This can result in
     application performance degradation or in extreme cases even I/O failure. Common examples of
     moderate device latency include disk arrays that are overloaded and hosts that cannot process data as
     fast as they request it. Severe latencies are caused by badly misbehaving devices that stop receiving,
     accepting, or acknowledging frames for excessive periods of time.

•    Faulty media (fiber optic cables and SFPs/optics): Faulty media can cause frame loss due to excessive
     CRC errors, invalid transmission words, and other conditions. This may result in I/O failure and
     application performance degradation.

Note that FC switches cannot correct bad node behavior or faulty media—they can only attempt to alert and
compensate for it. Ultimately the problems must be addressed in the host or target devices or media where
they actually occur.

MAINTAINING AN OPTIMAL FC SAN ENVIRONMENT
While there are many features available in FOS to assist with monitoring, protecting, and troubleshooting
fabrics, several recent enhancements have been implemented that focus exclusively on this area. This
document focuses specifically on those newer features and related capabilities that help provide optimum
fabric resiliency. Most are available and supported on the majority of 4 Gigabit per second (Gbps) and
8 Gbps platforms, provided that the most recent FOS releases are used. (Visit http://my.brocade.com or
consult with your vendor for the latest supported FOS releases.) Some features may require optional
licensing. This section discusses these features, minimum release levels, licensing requirements, and
platform limitations. Brocade highly recommends that you review the additional documentation listed
earlier to understand all of the tools available for maintaining a FC SAN environment. Be sure also to read
the FOS Release Notes.

NOTE: In order to use all of the capabilities described in this document, switches need to be running
FOS 6.4.0 or later.

Bottleneck Detection
Bottleneck Detection was introduced in FOS 6.3.0 with monitoring for device latency conditions and then
enhanced in FOS 6.4.0 with added support for congestion detection on both E_Ports and F_Ports. FOS 6.4
also added improved reporting options and simplified configuration capabilities. The FOS 6.3.1b release
(and later) included enhancement in the algorithm for detecting device latency, making it more accurate.
Bottleneck Detection does not require a license and is supported on both 4 and 8 Gbps platforms.

Fabric Resiliency Best Practices                                                                                   4
DATA CENTER                                                                                                          BEST PRACTICES GUIDE

Fabric Watch/Port Fencing
Fabric Watch is an optional (licensed) feature that was enhanced in FOS 6.1.0 with the addition of Port
Fencing. This capability allows a switch to monitor specific behaviors and protect a switch by blocking a
port when specified thresholds have been reached.

Edge Hold Time Configuration
Edge Hold Time configuration is a new capability added in the FOS 6.3.1b release, however, it is not
documented in the FOS 6.3 or FOS 6.4 Command Reference. See Appendix D for details on its use.

There is no license required to configure the Edge Hold Time setting.

DEVICE LATENCIES
A device experiencing latencies responds more slowly than expected. The device does not return buffer
credits (through R_RDY primitives) to the transmitting switch fast enough to support the offered load, even
though the offered load is less than the maximum physical capacity of the link connected to the device, as
shown in Figure 1.

Figure 1. Buffer backup on ingress port 6 on B1 causes congestion upstream on S1, port 3

Once it exhausts all available credits, the switch port connected to the device needs to hold additional
outbound frames until a buffer credit is returned by the device. When a device isn’t responding in a timely
fashion, the transmitting switch is forced to hold frames for longer periods of time resulting in high buffer
occupancy. This in turn results in the switch lowering the rate at which it returns buffer credits to other
transmitting switches. This effect propagates through switches (and potentially multiple switches with
devices attempting to send frames to devices attached to the switch with the high-latency device) and
ultimately impacts the fabric.

                                   Hosts                                                                     Hosts

                                                                  4. All servers using
                                                                     ISL impacted

                                                                                             X
                                                            3. Continuing backflow               X
                                   2. Backflow of credit       of credit exhaustion to                X
                                     exhaustion depleting      ISL port on Switch B
                 B                    credits on ISL po
                                                      rt                                 B
                                      on Switch A
                                   A                                                              X        A
        1. Buffer credits
           exhausted
                     X                                                                       X
  Storage                                                               Storage
   arrays                                                                arrays
                                                                                             5. Connection to
                                                                                                second storage
                                                                                                device now at risk

Figure 2. Latency on a switch can propagate through the fabric

Fabric Resiliency Best Practices                                                                                                       5
DATA CENTER                                                                                        BEST PRACTICES GUIDE

NOTE: The impact to the fabric (and other traffic flows) varies based on the severity of the latency exhibited
by the device. The longer the delay caused by the device in returning credits to the switch, the more severe
the problem.

Assessing Device Latency Severity
Moderate Device Latencies
Moderate device latencies are defined as those not severe enough to cause frame loss. If the time
between successive credit returns by the device is between a few hundred microseconds to tens of
milliseconds, then the device exhibits moderate latencies since this delay is typically not enough to cause
frame loss (frame loss typically occurs above 100 ms). This causes a drop in performance of traffic flows
using the fabric but typically does not cause frame drops or I/O failures.

When a device exhibits mild to moderate latency behavior, applications may see a drop in performance but
typically not I/O failure. However, the higher the latency, the greater the chance that an end user will
experience degraded performance.

Severe Device Latencies
Severe device latencies result in frame loss, which triggers the host SCSI stack to detect failures and to
retry I/Os. This process can take tens of seconds (possibly as long as 30 – 60 seconds), which can cause a
very noticeable application delay and potentially result in application errors. If the time between successive
credit returns by the device is in excess of 100 milliseconds, then the device is exhibiting severe latency.
When a device exhibits severe latency, the switch is forced to hold frames for excessively long periods of
time (in the order of hundreds of milliseconds). When this time becomes greater than the established
timeout threshold, the switch drops the frame (per Fibre Channel standards). Frame loss in switches is also
known as C3 (Class 3) discards or timeouts.

Since the effect of device latencies often spreads through the fabric, frames can be dropped due to
timeouts not just on the F_Port to which the misbehaving device is connected but also on E_Ports carrying
traffic to the F_Port. Dropped frames typically cause I/O errors that result in a host retry and can result in
significant decreases in application performance. The implications of this behavior are compounded and
exacerbated by the fact that frame drops on the affected F_Port (device) result not only in I/O failures to the
misbehaving device (which would be expected), but also the fact that frame drops on E_Ports may cause
I/O failures for unrelated traffic flows involving other hosts (which would not typically be expected).

Latency Detection
Bottleneck Detection on F_Ports. It is a recommended best practice to use the Bottleneck Detection
feature to detect devices that exhibit latency. Bottleneck Detection is a comprehensive feature that can be
used to detect a wide range of device latencies from mild to severe. See “Appendix A: Configuring
Bottleneck Detection” for details on how to enable Bottleneck Detection.

Once Bottleneck Detection is enabled, the switch monitors F_Ports for latency symptoms. Specifically, it
looks for conditions in which the time delay between successive buffer credit returns from a device is higher
than expected. When the condition is detected, Bottleneck Detection reports latency bottlenecks at F_Ports
based on configurable thresholds. These reports can then be leveraged to:

•    Determine the severity and duration of the latency behavior

•    Determine the specific device port on which device latencies are occurring

•    Determine the actual device latency in the range of 100 microseconds to hundreds of milliseconds

Fabric Resiliency Best Practices                                                                                     6
DATA CENTER                                                                                        BEST PRACTICES GUIDE

Timeout Notification on F_Ports
It is a recommended best practice to use Fabric Watch to detect frame timeouts, that is, frames that have
been dropped because of severe latency conditions (the Fabric Watch “C3TX_TO” area available in version
6.3 for 8 Gbps ports and available in FOS 6.3.1b/6.4.0 and later for 4 Gbps ports). If the number of timed-
out frames on an F_Port exceeds the currently effective threshold settings, Fabric Watch can notify the user
through one of the following mechanisms:

•    Send an SNMP trap

•    Log a RASlog message

•    Send an e-mail alert

•    Log a SYSlog message

Latency Mitigation Action
Mitigation Action Based on Bottleneck Detection
FOS 6.4.0 and later includes an enhancement to Bottleneck Detection that allows the switch to provide
some fabric-level mitigation when device latency is detected but Port Fencing thresholds have not yet been
reached. When latency is detected on a port, frames held in the transmit port connected to the
misbehaving device are dropped for a short period of time. This allows the switch to return credits to other
transmitting switches, thus allowing other traffic flows to move at a faster rate. This action protects other
flows from a severe performance drop resulting from a single misbehaving device. If a misbehaving device
continues to exhibit latencies for several seconds, the port is disabled via Port Fencing (if Port Fencing has
been enabled and configured).

See” Appendix B: Configuring Bottleneck Detection” for information on configuring Bottleneck Detection
Mitigation capability. Note that this capability is not documented in any existing FOS product
documentation.

Action Based on Timeouts
Quarantining a misbehaving high-latency device based on timeouts: When a misbehaving device exhibits
extremely high latencies causing frame timeouts, it is likely also causing a severe fabric impact and should
be removed from the fabric. Port Fencing based on timeouts is another optional feature that can be used to
quarantine a high latency device and mitigate the impact on the fabric (8 Gbps platform support available in
FOS 6.3 and later, 4 Gbps platform support available in FOS 6.3.1b and later). Brocade recommends
enabling Port Fencing for transmit timeouts on F_Ports.

See “Appendix C: Configuring Port Fencing” for details on how to enable Fabric Watch Port Fencing. When
the number of frames dropped due to timeouts on an F_Port reaches a user-configured threshold, the port
is fenced (blocked). This disables the port, requiring user intervention to bring it back online. Once the
F_Port of the offending device is fenced, no further actions are required. The default/recommended
threshold settings noted in Appendix C can safely disable the misbehaving device, preventing an impact to
the fabric without causing a false trigger (fencing a port when there isn’t a high-latency device).

Reducing timeouts on unrelated flows: Per Fibre Channel standards, frames are dropped in switches if
they’ve been held in the switch buffers for longer than the established Hold Time, a value calculated from
several configurable fabric parameters. Unless any of these fabric parameters (R_A_TOV, E_D_TOV,
WAN_TOV, or MAX_HOPs) have been changed from their defaults, the Hold Time is calculated to be 500 ms.

In most environments, fabric parameters on all switches in a fabric should match, and thus the Hold Time
should be consistent throughout a fabric. When congestion conditions cause frames to drop in the core of
the fabric, where there tend to be more flows/traffic, there will be more disruption. In order to reduce frame
drops on E_Ports on core switches, the edge switches that host the end server/storage devices can be

Fabric Resiliency Best Practices                                                                                     7
DATA CENTER                                                                                         BEST PRACTICES GUIDE

configured to have a shorter Hold Time compared to the core switches by using the Edge Hold Time feature
(available in FOS 6.3.1b and later). This setting lowers the Hold Time on the edge of the network, which
reduces the likelihood of frame loss on the core of the network, effectively mitigating the impact of the
misbehaving device. It is a recommended best practice to enable the Edge Hold Time feature.

See “Appendix D: Configuring Edge Hold Time” for details on how to enable the Edge Hold Time feature.
Note that enabling and configuring the Edge Hold Time is a non-disruptive operation.

Fabric Configuration
Fabrics can be architected to mitigate some impacts of device latency. Isolating the device flows
(host/storage pair) that exhibit high latencies by either putting them in their own fabric or on their own
blade/switch will contain the impact of the latencies to the fabric or blade/switch containing the high-
latency device flows. Features such as Integrated Routing (Fibre Channel Routing) and local switching
provide architectural-level solutions that limit the need for more complex monitoring and mitigation
capabilities. However, using fabric design as a protection mechanism does require some knowledge of
which devices are likely to exhibit latency.

FAULTY MEDIA
In addition to high-latency devices causing disruptions to data centers, fabric problems are often the result
of faulty media. Faulty media can include bad cables, SFPs, extension equipment, receptacles, patch
panels, improper connections, and so on. Media can fault on any port type (E_Port or F_Port) and fail, often
unpredictably and intermittently, making it even harder to diagnose. Faulty media involving F_Ports results
in an impact to the end device attached to the F_Port and to devices communicating with this device.
Failures on E_Ports can have an even greater impact. Many flows (host/target pairs) can simultaneously
traverse a single E_Port. In large fabrics, this can be hundreds or even thousands of flows. In the event of
a media failure involving one of these links, it is possible to disrupt some or all of the flows utilizing the
path.

Severe cases of faulty media, such as a disconnected cable, can result in a complete failure of the media,
which effectively brings a port offline. This is typically easy to detect and identify. When this occurs on an
F_Port, the impact is specific to flows involving the F_Port. E_Ports are typically redundant so severe
failures on E_Ports typically only result in a minor drop in bandwidth as the fabric automatically utilizes
redundant paths. And error reporting built into FOS readily identify the failed link and port, allowing for
simple corrective action and repair.

With moderate cases of faulty media, failures occur but the port can remain online or transition between
online and offline. This can cause repeated errors, which can occur indefinitely or until the media fails
completely. When these types of failures occur on E_Ports, the result can be devastating, as there can be
repeated errors that impact many flows. This can result in significant impacts to applications that last for
prolonged durations. Signatures of these types of failures include the following:

•    CRC errors on frames

•    Invalid Words (includes encoder out errors)

•    State Changes (Ports going offline/online repeatedly)

•    Credit loss: Complete loss of credit on a VC on an E_Port prevents traffic from flowing on that VC, which
     results in frame loss and I/O failures for devices utilizing the VC

Fabric Resiliency Best Practices                                                                                      8
DATA CENTER                                                                                         BEST PRACTICES GUIDE

Automatically Detecting and Mitigating Faulty Media
Fabric Watch
It is a recommended best practice to enable Fabric Watch to monitor for CRC errors, Invalid Words and
State Changes. It is recommended to configure for alerts on reaching low thresholds and fence (disable) a
port when reaching high thresholds. See “Appendix C: Configuring Port Fencing” for details on how to
enable and configure Fabric Watch Port Fencing.

Fabric Watch monitoring. Fabric Watch monitors can be enabled to automatically detect most of the faulty
media conditions previously noted. For example, Fabric Watch can monitor CRC errors (available in FOS
6.1.x), Invalid Words (available in FOS 6.1.x), and State Changes (ports transitioning between offline and
online, available in FOS 6.3). Fabric Watch generates alerts based on user-defined thresholds for these
conditions. The most common cause of credit loss is corruption to credit return messages (VC_RDY or
R_RDY) due to faulty media. Credit corruption is tracked by an encoder out error which is an Invalid Word
error. Monitoring and mitigating Invalid Word issues protects against credit loss.

Fabric Watch quarantine. Fabric Watch also provides a mechanism that quarantines the badly behaving
component with the optional action of Port Fencing. Port Fencing is available for each of the previously
noted conditions and is recommended to automatically protect the fabric from these error conditions. The
recommended thresholds specified in “Appendix C: Configuring Port Fencing” have been tested and tuned
to quarantine components that are misbehaving to the point at which they are likely to cause a fabric-wide
impact. They do not falsely trigger on normally behaving components.

Bottleneck Detection
The Bottleneck Detection feature can detect different types of congestion in a fabric. Lost buffer credits
can result in extreme congestion by slowing the aggregate throughput of a connection. Bottleneck
Detection can detect ports that are blocked due to lost credits and generate special “stuck VC” and “lost
credit” alerts for the E_Port with the lost credits (available in FOS 6.3.1b and later). Bottleneck Detection
can also generate alerts on downstream E_Ports that are also blocked due to an upstream congestion
condition such as an E_Port with lost credits or a high-latency device.

See “Appendix A: Configuring Bottleneck Detection” for best practice recommendations on configuring and
using this feature.

SUMMARY OF RECOMMENDATIONS
The following are recommended features and capabilities to improve the overall resiliency of Brocade
FOS-based FC fabric environments:

•    Enable Fabric Watch to detect frame timeouts (the Fabric Watch “C3TX_TO” area).

•    Enable Port Fencing for transmit timeouts on F_Ports.

•    Enable the Edge Hold Time feature.

•    Enable Fabric Watch to monitor (alert) for CRC errors, Invalid Words and State Changes and fence on
     extreme behavior.

•    Enable Edge Hold Time in core/edge configurations.

•    Enable Bottleneck Detection for congestion conditions.

Fabric Resiliency Best Practices                                                                                      9
DATA CENTER                                                                                           BEST PRACTICES GUIDE

APPENDIX A: CONFIGURING BOTTLENECK DETECTION
Enabling/Disabling Bottleneck Detection
When Bottleneck Detection is enabled, RASlog alerts can also be enabled to be sent when the bottleneck
conditions at a port exceed a specified threshold.

On the switch with target port connections, log in with admin level privileges.

Enter bottleneckmon --enable to enable Bottleneck Detection on an F_Port or FL_Port.
bottleneckmon --enable
[ -alert ] [ -thresh threshold ] [ -time window ] [ -qtime quiet_time]
[slot/]portlist [[slot/]portlist]...
If the alert parameter is not specified, alerts are not sent, but a history of bottleneck conditions for the port
can be viewed. The thresh, time, and qtime parameters are also ignored if the alert parameter is not
specified.

It is a recommended best practice to use the default values for the thresh (0.1), time (300), and qtime
(300) parameters. If the time parameter is changed, it should be set to 300 or higher.

Example of enabling bottleneck detection (preferred use case): The following example enables bottleneck
detection on all F_ and FL_Ports in the switch with RASlog alerts using default values for threshold and
time. Alerts are logged when a port is experiencing a bottleneck condition for 10% of the time (default
value) over any period of 300 seconds (default value) with a minimum of 300 seconds (default value)
between alerts.
switch:admin> bottleneckmon --enable -alert *
The following example enables bottleneck detection on ports 3–7 using default values for threshold and
time. No alerts will be delivered to report bottleneck conditions, but the bottleneck history can be viewed
using the CLI.
switch:admin> bottleneckmon --enable 3-7
Example of disabling bottleneck detection:

1.   Connect to the switch to which the target port belongs and log in as admin.

2.   Enter bottleneckmon --disable to disable bottleneck detection on a port.

Example of disabling bottleneck detection on port 3:
switch:admin> bottleneckmon --disable 3

Fabric Resiliency Best Practices                                                                                       10
DATA CENTER                                                                                        BEST PRACTICES GUIDE

Displaying List of Ports with Bottleneck Detection Enabled
1.   Connect to the switch to which the target ports belong and log in as admin.

2.   Enter bottleneckmon --status to display the ports on which Bottleneck Detection is enabled.

NOTE: When using Virtual Fabrics, the output displays ports that do not belong to the logical switch if the
ports were moved out of the logical switch after Bottleneck Detection was enabled on them.

switch:admin> bottleneckmon --status
Port         Alerts?          Threshold                         Time (s)        Quiet Time (s)
=======================================================================
3             N                --               --             --
4             Y              0.100             300             300
5             Y              0.100             300             300
6             N                --               --             --

Changing Bottleneck Detection Settings on a Port
The default settings for Bottleneck Detection are the recommended settings. The settings are configurable
in the event that a user has specific reasons for modifying them, but in most cases, the default settings
should not be changed. Some examples of reasons to change the defaults could include transient events
that cause moderate congestion that are considered normal. Increasing the time or threshold may
accommodate such events. Using the procedure described below, RASlog alerts can be enabled or
disabled along with configuration of the following settings:

•    Threshold: the percentage of 1-second intervals required to generate an alert)

•    Time: the time window in seconds in which bottleneck conditions are monitored and compared against
     the threshold

•    Quiet Time options

Bottleneck Detection must be disabled on a port before any of the settings can be modified. To change
settings on a port:

1.   Connect to the switch to which the target port belongs and log in as admin.

2.   Enter bottleneckmon --disable to disable Bottleneck Detection on the port.

3.   Enter bottleneckmon --enable to enable Bottleneck Detection, specify the new threshold values, and set the
     alert option.

The following example changes the Bottleneck Detection settings on port 4. In this example, the bottleneck
--status commands show the before and after settings.

switch:admin> bottleneckmon –status

Port          Alerts?         Threshold          Time (s)        Quiet Time (s)
==============================================================================
4             Y               0.800                 300           300

switch:admin> bottleneckmon –-disable 4
switch:admin> bottleneckmon –-enable –thresh 0.6 –time 420 4
switch:admin> bottleneckmon –-status
Port          Alerts?          Threshold         Time (s)        Quiet Time (s)
==============================================================================
4           Y              0.600                  420             300

Fabric Resiliency Best Practices                                                                                    11
DATA CENTER                                                                                      BEST PRACTICES GUIDE

Displaying the History of Bottlenecks on a Port
Use bottleneckmond –show to display a 3-hour history of bottleneck conditions for an individual port:

1.   Connect to the switch to which the target port belongs and log in as admin.

2.   Enter the bottleneckmon --show command to display a history of the bottleneck severity for a specific port.

Example of displaying the bottleneck history for port 3 in 5-second windows over a period of
30 seconds:

fcr_saturn1:root> bottleneckmon --show -interval 5 -span 30 3
=============================================================
       Mon Jun 15 18:54:35 UTC 2010
=============================================================

From                          To               affected secs
=============================================================
Jun 15 18:54:30          Jun 15 18:54:35            80.00%
Jun 15 18:54:25          Jun 15 18:54:30            40.00%
Jun 15 18:54:20          Jun 15 18:54:25             0.00%
Jun 15 18:54:15          Jun 15 18:54:20             0.00%
Jun 15 18:54:10          Jun 15 18:54:15            20.00%
Jun 15 18:54:05          Jun 15 18:54:10            80.00%

Bottleneck Alerts
Example of a bottleneck detection alert on an F_Port:

2010/03/16-03:40:47, [AN-1003], 21760, FID 128, WARNING, sw0, Latency
bottleneck at slot 0, port 38. 100.00 percent of last 300 seconds were
affected. Avg. time b/w transmits 80407.3975 us.
Example of a “stuck VC” alert on an E_Port:

2010/03/16-03:40:48, [AN-1010], 21761, FID 128, WARNING, sw0, Severe
latency bottleneck detected at slot 0 port 38.
Example of a lost credit alert on an E_Port:
2010/03/15-22:50:41:187781, [C2-5021], 2/0, SLOT 4 | CHASSIS, WARNING, ?,
S1,P0(16): Link Timeout, ftx=83276 tov=2000 (>1000), OID:0x43128010,
c2_ops.c, line: 4058, comp:insmod, ltime:2010/03/15-22:50:41:187778.

NOTE: This is an internal RASlog message.

Fabric Resiliency Best Practices                                                                                   12
DATA CENTER                                                                                         BEST PRACTICES GUIDE

APPENDIX B: CONFIGURING BOTTLENECK DETECTION MITIGATION BEHAVIOR
To enable Bottleneck Detection-based mitigation action on all F_Ports in a switch use:
bottleneckmon --enable -act

To enable/disable mitigation action after enabling Bottleneck Detection, use:

bottleneckmon --config -act              OR

bottleneckmon --config -noact

To enable/disable mitigation action after enabling Bottleneck Detection for a specific port, use:

bottleneckmon --config -act  OR

bottleneckmon --config -noact 

All F_Ports with Bottleneck Detection enabled and the -act flag set are subject to mitigation action.
Ports excluded from Bottleneck Detection (using the --exclude operation) are also excluded from
mitigation action.

Fabric Resiliency Best Practices                                                                                     13
DATA CENTER                                                                                        BEST PRACTICES GUIDE

APPENDIX C: CONFIGURING PORT FENCING
Use the portFencing CLI command to enable error reporting for the Fabric Watch Port Fencing feature on all
ports of a specified type and to configure the ports to report errors for a specific area. Supported port types
include E_Ports, F_Ports, and physical ports. A specified port type can be configured to report errors for one
or more areas.

Port Fencing monitors ports for erratic behavior and disables a port if specified error conditions are
met. The portFencing CLI command enables or disables the Port Fencing feature for an area of a
class. You can customize or tune the threshold of an area using the portthConfig CLI command.

Use portFencing to configure Port Fencing for C3_TX_TO. For example:

portfencing –-enable fop-port –area C3TX_TO

The same command can be used to configure Port Fencing on link reset. For example:
portfencing –-enable fop-port –area LR

Use portThconfig to customize Port Fencing thresholds.:

switch:admin> portthconfig --set port -area crc -highthreshold -value 2 -trigger above
-action email

switch:admin> portthconfig --set port -area crc -highthreshold -trigger below -action
email

switch:admin> portthconfig --set port -ar crc -lowthreshold -value 1 -trigger above -
action email

switch:admin> portthconfig --set port -ar crc -lowthreshold -trigger below -action
email

To apply the new custom settings so they become effective:

switch:admin> portthconfig --apply port -area crc -action cust -thresh_level
custom

To display the port threshold configuration for all port types and areas:

switch:admin> portthconfig --show

Port Fencing Threshold Recommendations
Port Fencing Threshold recommendations three areas are shown in the following table.

       Area             Recommended Threshold

 Link Reset                        5

 State Change                      7

 TX_TO                             5

Fabric Resiliency Best Practices                                                                                    14
DATA CENTER                                                                                    BEST PRACTICES GUIDE

Threshold recommendations for CRC errors and Invalid Words. CRC errors and Invalid Words can occur on
normal links. They have also been known to occur during certain transitions such as server reboots. When
these errors occur more frequently, they can cause a severe impact. While most systems can tolerate
infrequent CRC errors or Invalid Words, other environments can be sensitive to even infrequent instances.
The overall quality of the fabric interconnects is also a factor.

When establishing thresholds for CRC errors and Invalid Words, consider the following. In general,
“cleaner” interconnects can have lower thresholds as they should be less likely to introduce errors on the
links. Moderate (recommended), conservative and aggressive threshold recommendations are provided in
the table below. After selecting the type of thresholds for an environment, set the low threshold with an
action of ALERT (RASlog, e-mail, SNMP trap). The alert will be triggered whenever the low threshold is
exceeded. Set the high threshold with an action of Fence. The port will be fenced (disabled) whenever the
high threshold is detected. Aggressive threshold suggestions do not include settings for low, and instead
only have the high values to trigger fencing action.

       Area            Moderate/Recommended      Aggressive Threshold      Conservative Threshold
                              Threshold

 CRC                                Low 5               High 2                      Low 5
                                   High 20                                         High 40

 Invalid Word                      Low 25               High 25                    Low 25
                                   High 40                                         High 80

Fabric Resiliency Best Practices                                                                                15
DATA CENTER                                                                                                         BEST PRACTICES GUIDE

APPENDIX D: CONFIGURING EDGE HOLD TIME
A user can configure the Edge Hold Time using the following commands. The switch does not need to be
disabled to modify the hold time. Use the Configure edge hold time option to turn this feature on or off.

configure

Not all options will be available on an enabled switch.
To disable the switch, use the "switchDisable" command.

Configure...

Fabric parameters (yes, y, no, n): [no] yes

 Configure edge hold time (yes, y, no, n): [yes]
 Edge hold time: (100..500) [100]

The edge_hold_time value is persistently stored in the configuration file. All configuration file operations
such as upload and download are supported for this feature.

NOTE: This setting is available only in FOS 6.3.1b and later.

© 2010 Brocade Communications Systems, Inc. All Rights Reserved. 06/10 GA-BP-300-00

Brocade, the B-wing symbol, BigIron, DCFM, DCX, Fabric OS, FastIron, IronView, NetIron, SAN Health, ServerIron, TurboIron, and
Wingspan are registered trademarks, and Brocade Assurance, Brocade NET Health, Brocade One, Extraordinary Networks,
MyBrocade, and VCS are trademarks of Brocade Communications Systems, Inc., in the United States and/or in other countries.
Other brands, products, or service names mentioned are or may be trademarks or service marks of their respective owners.

Notice: This document is for informational purposes only and does not set forth any warranty, expressed or implied, concerning
any equipment, equipment feature, or service offered or to be offered by Brocade. Brocade reserves the right to make changes
to this document at any time, without notice, and assumes no responsibility for its use. This informational document describes
features that may not be currently available. Contact a Brocade sales office for information on feature and product availability.
Export of technical data contained in this document may require an export license from the United States government.

Fabric Resiliency Best Practices                                                                                                     16
You can also read