Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA

Page created by Lee Johnston
 
CONTINUE READING
Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA
Zero Overhead Monitoring for Cloud-native
            Infrastructure using RDMA
Zhe Wang, Shanghai Jiao Tong University; Teng Ma, Alibaba Group; Linghe Kong,
   Shanghai Jiao Tong University; Zhenzao Wen, Jingxuan Li, Zhuo Song, Yang
  Lu, Yong Yang, and Tao Ma, Alibaba Group; Guihai Chen, Shanghai Jiao Tong
                       University; Wei Cao, Alibaba Group
           https://www.usenix.org/conference/atc22/presentation/wang-zhe

   This paper is included in the Proceedings of the
     2022 USENIX Annual Technical Conference.
                     July 11–13, 2022 • Carlsbad, CA, USA
                                978-1-939133-29-8

                                       Open access to the Proceedings of the
                                     2022 USENIX Annual Technical Conference
                                                 is sponsored by
Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA
Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA

        Zhe Wang1 , Teng Ma2 , Linghe Kong1 , Zhenzao Wen2 , Jingxuan Li2 , Zhuo Song2 , Yang Lu2
                            Yong Yang2 , Tao Ma2 , Guihai Chen1 , Wei Cao2
                                    1 Shanghai Jiao Tong University
                                            2 Alibaba Group

                         Abstract                                    In order to ensure the service level agreement (SLA) [52],
Cloud services have recently undergone a major shift from         the whole infrastructure is monitored with not only the upper-
monolithic designs to microservices running on the cloud-         layer application metrics, but also the fundamental system
native infrastructure, where monitoring systems are widely        metrics [84]. The novel cloud-native infrastructure, however,
deployed to ensure the service level agreement (SLA). Nev-        brings new challenges/demands to cloud-native monitoring,
ertheless, the traditional monitoring system no longer fulfills   along with two major issues in commercial deployments.
the demands of cloud-native monitoring, which is observed            First, traditional monitoring systems [10, 11, 65] occupies
from the practical experience in Alibaba cloud. Specifically,     host (physical/virtual machine, PM/VM) resources to col-
the monitor occupies resources (e.g., CPU) of the monitored       lect, process and upload metrics (Figure 1), which inevitably
infrastructure, disturbing services running on it. For example,   causes resource contentions with cloud services — enabling
enabling monitor causes jitters/declines of online services in    monitors causes jitters/declines of online services in Alibaba
Alibaba’s “double eleven” shopping festival with high loads.      “double eleven” shopping festival (Figure 3). To ensure ser-
On the other hand, the quality of service (QoS) of monitoring     vice SLA with resource constraints, the deployed monitor at
itself, which is vital to track and ensure SLA, is not guaran-    the host should have no resource occupation.
teed with the high loaded system.                                    Second, the quality of service (QoS) of monitoring itself
   In this paper, we design and implement a novel monitor-        is not guaranteed, which fails to support massive metrics
ing system, named Z ERO, for cloud-native monitoring. First,      with rapid variations in cloud-native monitoring. The la-
Z ERO achieves zero overhead to collect raw metrics from the      tency/throughput of monitoring jitters severely due to the high
monitored hosts using one-sided remote direct memory access       system loads or small CPU quota set by the cloud provider
(RDMA) operations, thus avoiding any interferences to cloud       (Figure 4). However, monitoring system with high QoS is
services. Second, Z ERO adopts receiver-driven model to col-      vital to track and ensure SLA of monitored services [80, 84].
lect monitoring metrics with high QoS, where credit-based            To resolve the limitations of traditional monitoring system
flow control and hybrid I/O model are proposed to mitigate        and fulfill the demands of cloud-native monitoring, we design
network congestion/interference and CPU bottlenecks. Z ERO        and implement a novel Z ERO monitoring system in this paper.
has been deployed and evaluated in Alibaba cloud. Deploy-         Z ERO proposes a receiver-driven model, which collects raw
ment results show that Z ERO achieves no CPU occupation at        metrics from the monitored host via one-sided RDMA opera-
the monitored host and supports 1 ∼ 10k hosts with 0.1 ∼ 1s       tions, i.e., RDMA read. Based on the Z ERO framework, the
sampling interval using single thread for network I/O.            monitoring system is expected to achieve no CPU occupation
                                                                  at the monitored host, low latency and high throughput, finally
1   Introduction                                                  avoiding any interferences to services and fulfilling the QoS
                                                                  requirements of large-scale distributed monitoring.
Recent shifts in the production cloud environment from mono-         However, there still exist several challenges to achieve
lithic designs to microservice-based architecture [33,34] have    the above goals. As shown in Figure 1, traditional moni-
made cloud-native infrastructure the cornerstone of cloud         tor collects and processes raw metrics from the monitored
computing services. The cloud-native applications consist of      processes, then upload metrics to the remote host, which in-
thousands of single-concern, loosely-coupled microservices        evitably causes CPU occupations. How to manage memory
running on containerized platforms [76]. The underlying sys-      regions of system/application metrics and expose them to the
tems are treated as disposable and immutable, finally enabling    remote host, finally achieving zero-overhead monitoring via
highly available, flexible and scalable cloud services.           RDMA read, is challenging. On the other hand, the remote

USENIX Association                                                          2022 USENIX Annual Technical Conference         639
Microservice                                              Microservice   Microservice                                    Microservices
                                                                                                                                       Microservice

                                                                                                               Control      Register               CPU    Zero Data
                                                                                                               Plane                                      Copy Plane
       Collector                                                                                                           Blocking              Shared Memory
                                                                                                                ZERO
                                                                   Collect                                      Agent
         CPU
                                                                                                                           Running             Meta Data Metrics
                                            Process
                                         Upload                                          Local Agent                               Host (PM/VM)
                                               CPU occupation
                                                  Network Stac k

        kernel                                                                                                           QP Connection                 WRITE   READ
                                  TCP/UDP
                 Linux Kernel

                                   IP layer

                                                                       Host (PM/VM)
        stack
                                Traffic Control

                                Device Driv er

                                                                                                                                 ZERO Controller
          TCP/IP RPC Performance jitter
                                                                                           Remote               Processing Thread                    Collecting Thread
           Proxy/Broker
                                                                                          Controller          Persistence + Visualization         Credit-FC + Hybrid I/O
                    RPC
               Controller                                                    Analyze       Display                  Figure 2: Z ERO monitoring system.

              Figure 1: Traditional monitoring System.
                                                                                                       superior to the traditional sender-driven model (i.e., agent ac-
                                                                                                       tively uploads metrics as per heartbeat): i) collecting metrics
monitoring host becomes bottlenecks in the receiver-driven                                             on demand to ensure the QoS of monitoring on latency; ii)
model, as a tradeoff of offloading local monitoring overhead                                           limiting the total in-flight data of concurrent flows to avoid net-
to the remote host. The remote host performs RDMA read on                                              work congestion/interference. Accordingly, Z ERO proposes
many monitored hosts, resulting in incast problem [85] 1 . The                                         Credit-FC to mitigate the incast problem while fulfilling the
remote host not only collects metrics, but also processes raw                                          latency/throughput requirements of monitoring. On the other
metric for further operations, all of which are CPU intensive.                                         hand, Z ERO introduces hybrid I/O model (a combination of
How to enable large-scale monitoring with network/CPU bot-                                             event driven and busy polling mechanisms) and adopts thread
tlenecks is challenging as well.                                                                       dispatching to remedy the CPU bottlenecks of collecting and
   To access raw metrics with no CPU occupation, Z ERO                                                 processing metrics, respectively.
proposes the novel control plane and data plane. For the                                                   As case studies, we integrate application (Redis [12]) met-
ease of clarity, we separate Z ERO into local agent and remote                                         rics, system (kernel/containers [27, 76]) metrics and eBPF [3]
controller (Figure 2). To achieve high scalability in reliable                                         metrics into the Z ERO framework to demonstrate its gen-
connection (RC) mode [26, 82], system/application metrics                                              erality and flexibility. We also share our experience about
are managed by one agent and share one queue pair (QP)                                                 building large-scale monitoring system using RDMA.
connection. In the control plane, Z ERO agent provides univer-                                             The major contributions of this paper are summarized as
sal interfaces for systems/applications to register the memory                                         follows:
regions of their metrics at the RDMA NIC (RNIC). The                                                   • We propose the first zero-overhead monitoring system,
metadata of these metrics are recorded at the control region.                                             Z ERO, to resolve limitations of traditional monitoring sys-
Z ERO controller can thus acquire metadata of metrics from                                                tem in cloud-native monitoring.
the control region as the prerequisite to access raw metrics.                                          • We tackle several challenges of zero-overhead monitoring,
All metrics only need to register once if the metadata is not                                             including data plane with no CPU involvement, network
updated, after which Z ERO agent enters blocking mode. In                                                 congestion/interference caused by monitoring traffics, and
the data plane, the memory regions of metrics (data region)                                               CPU bottlenecks at the controller.
are exposed to the agent process via shared memory, finally to                                         • We have deployed and evaluated Z ERO in Alibaba cloud.
the remote controller. The Z ERO controller can thus perform                                              Z ERO achieves no CPU occupation at the monitored host
RDMA read on the data region directly without involving                                                   and supports 1 ∼ 10k hosts with 0.1 ∼ 1s sampling intervals.
memory copies and CPU usages at the monitored host. As                                                    We also share our experience with Z ERO.
a result, Z ERO achieves disposable overhead in the control
                                                                                                           The paper is organized as following. Section 2 introduces
plane and zero overhead in the data plane.
                                                                                                       the background and motivation. Section 3 proposes zero-
   To deal with the network/CPU bottlenecks at the controller,
                                                                                                       overhead monitoring. Section 4 designs and implements
Z ERO proposes credit-based flow control (Credit-FC) and hy-
                                                                                                       Z ERO framework. Section 5 presents case studies. Section 6
brid I/O model. We observe that the receiver-driven model is
                                                                                                       evaluates the proposed design. Section 7 introduces the ex-
   1 Incast problem happens when multiple senders transfer data to one                                 perience and future work. Section 8 discusses related works
receiver simultaneously.                                                                               and Section 9 concludes this paper.

640    2022 USENIX Annual Technical Conference                                                                                                            USENIX Association
1400

                                                                         Request Latency (ms)
                                                                                                           Request Latency

                                                                                                                                                              CPU Utilization (%)
                                                                                                1200        CPU Utilization
                                                                                                1000                                                    100
                                                                                                 800
                                                                                                 600
                                                                                                                                                        50
                                                                                                 400
                                                                                                 200
                                                                                                   0                                                    0
                                                                                                       0             100          200      300
                                                                                                                              Time (s)
                                                                                                1400

                                                                         Request Latency (ms)
                                                                                                           Request Latency

                                                                                                                                                              CPU Utilization (%)
                                                                                                1200        CPU Utilization
                                                                                                1000                                                    100
                                                                                                 800
                                                                                                 600
                                                                                                                                                        50
                                                                                                 400
                                                                                                 200
                                                                                                   0                                                    0
                                                                                                       0       500         1000    1500   2000   2500
                                                                                                                              Time (s)
         Figure 3: Monitor interfering with services.
                                                                      Figure 4: Monitoring jitters with high system loads in Netdata
                                                                      (upper) and Prometheus (bottom).
2     Background and Motivation
In this section, we further elaborate cloud native monitoring
                                                                      should have no CPU occupation to avoid contentions with
and the inherent limitations of traditional monitoring system.
                                                                      cloud services.
                                                                         How to ensure QoS of monitoring? Cloud-native monitor-
2.1    Cloud-Native Monitoring                                        ing needs to support massive metrics with rapid variations.
                                                                      Cloud providers, such as Alibaba, Netflix and Uber, need to
Monitoring system deployed at the cloud-native infrastructure,
                                                                      monitor millions of metrics with hundreds to thousands of mi-
namely, cloud-native monitoring, is indispensable to ensure
                                                                      croservices [80]. Besides, these application metrics (financial
the SLA of cloud services. Monitor collects the bottom-layer
                                                                      transactions, social network and e-commerce [67, 94]) and
system metrics, such as the utilization of physical resources
                                                                      system metrics (CPU, memory, network [6, 54]) have rapid
(CPU, memory, etc.). Based on system metrics, monitor-
                                                                      variations with a time scale of seconds/milliseconds. To track
ing system performs health checks on the underlying system,
                                                                      and ensure SLA of services, monitoring requires high QoS
makes early alert and provides suggestions to administra-
                                                                      from the perspective of latency and throughput.
tors [18]. Furthermore, by analyzing the historical resource
consumption and performance variations, cloud providers
improve system utilization and lower operational expenses             2.2                        Traditional Monitoring System
(OpEx) [27, 28, 41]. On the other hand, the upper-layer ap-
plication metrics, e.g., requests per second of key-value ser-        We next introduce traditional distributed monitoring sys-
vice [12,45], directly reflect user activities and functional state   tems [10, 51, 65] and elaborate their limitations as the mo-
of applications. The monolithic applications are decoupled to         tivation of Z ERO. As shown in Figure 1, multiple collec-
thousands of microservices [33], all of which are monitored           tors acquire application/system metrics via specific interfaces,
to track and ensure the SLA.                                          meanwhile raw metrics are processed as the final outcomes
   The novel cloud-native applications together with the es-          (Section 5). After that, the collected metrics are uploaded to
sential infrastructure bring new challenges/demands to cloud-         the remote controller for further analysis and visualization.
native monitoring, along with two major issues in commercial          Each step occupies host CPUs for memory copy, calcula-
deployment.                                                           tion, network transmission, etc., which inevitably interfer-
   How to avoid monitor interfering with services? Microser-          ing with services running on the host. On the other hand,
vices have much stricter requirements of QoS compared with            traditional monitor massively relies on the kernel’s TCP/IP
typical applications [33]. However, the cloud-native environ-         network stack to transmit metrics. However, kernel data pro-
ment is highly resource constrained. For example, Alibaba             cessing overhead has became the main bottleneck of end-
cloud adopts mixed deployment of CPU-intensive online ser-            to-end latency/throughput [90]. State-of-the-art works thus
vice [42] and I/O-intensive batch jobs [93] at the same host, to      offload network functionality from kernel to user-space net-
maximize resource utilization [43, 78] and reduce long-term           work stack [39, 53, 64] or hardware [35, 44, 70].
capital expenses (CapEx). Microsoft Azure also reports that              We next elaborate the limitations of traditional monitors in
80% of VMs only have 1 ∼ 2 vCPU cores [27]. The monitor               our real deployment. We consider two representative open-

USENIX Association                                                                                     2022 USENIX Annual Technical Conference                                      641
sourced monitors, i.e., Netdata [10] and Prometheus [11].          RDMA support. Z ERO then leverages one-sided RDMA
We observed that enabling monitors causes jitters of online        operation to read raw metrics/counters, to achieve no
service in Alibaba “double eleven” shopping festival with          CPU/kernel involvements at the monitored host. RDMA has
high loads. We use Netdata to monitor a high-loaded host           been widely used in data centers and provides new character-
running Redis services with 1s sampling interval. As shown         istics of low latency (as low as 1 µs), high bandwidth (more
in Figure 3, monitor process causes jitters of Redis service,      than 100Gbps) and kernel/CPU bypass. RDMA supports
i.e., the throughput declines by 6.25% while tail latency in-      both one-sided and two-sided operations. The one-sided op-
creases by 2× periodically, due to the CPU occupation in each      erations directly operate on the remote memory via read and
monitoring cycle. We analyze that reasons for such “interfer-      write without involving the remote server’s CPU. To per-
ence spikes” are two folds. First, the deployment of service       form one-sided RDMA operations, one needs to register the
and monitor processes may adopt default CPU scheduling or          memory region (MR) at the RNIC of remote host and acquire
specific CPU bonding [48], where contentions happen when           the generated remote protection key (rkey). The two-sided
service/monitor processes are scheduled/bonded to the same         operations, i.e., send and recv, communicate via an interface
CPU core. Second, the CPU utilization keeps on high water          similar to socket. In the following paper, we refer to RDMA
level and exhibits burst natures, especially during sales pro-     read, write, send and recv as READ, WRITE, SEND and
motion with high loads. Thus, the duty-cycled monitoring           RECV, respectively. RDMA hosts create queue pairs (QP)
process with slight CPU occupation (1 ∼ 5%) already causes         consisting of a send queue and a receive queue, then post
severe interference (Figure 3). The CPU breakup of monitor         RDMA operations on send/receive queue to communicate
shows that the uploading phase occupies 5 ∼ 10% of the to-         with the remote host. RDMA transport supports reliable or
tal CPU utilization while the collecting phase occupies the        unreliable connection (RC/UC) and unreliable datagram (UD).
majority. On the other hand, the QoS of monitoring is highly       One-to-one connections between QPs are required in RC/UC
affected by the system load or CPU quota set by the cloud op-      mode, whereas one-to-many communication is supported in
erator. The latency of monitoring increases by more than 10×,      UD mode. Different transport types support different subsets
when host CPU is saturated or CPU quota is reached (Fig-           of RDMA operations, and READ operation is only supported
ure 4). Assigning dedicated cores for monitoring may avoid         in RC mode.
these problems, however, causes large wastes of resources          Challenges. While Z ERO is expected to achieve zero-
and high CapEx. Besides, the monitoring process may be             overhead monitoring, there still exist two challenges to make
blocked to request metrics via service interface (Section 5).      the idea practical. First, we observe that most CPU time of
                                                                   traditional monitor are spent on collecting metrics from sys-
3     Zero-overhead Monitoring                                     tem/application processes (Section 2.2). Z ERO also needs to
                                                                   eliminates such overheads in its data plane, besides the trans-
To avoid resource contentions with services and ensure QoS of      mission overheads offloaded to RNIC. Second, controller is
monitoring, we propose the zero-overhead monitoring system,        bottlenecked on both network and CPU with large number of
namely Z ERO, for cloud-native monitoring. Z ERO exploits          monitored hosts, as all monitoring overheads are offloaded
features of monitoring metrics and RDMA capability in mod-         to the remote controller. With all these challenges, the key
ern data center. The basic idea is that most of raw metrics        innovation of Zero lies in effectively exploiting one-sided
are counters updated at fix memory regions — the remote            RDMA and designing the separate control/data plane to re-
controller can thus obtain these metrics by performing RDMA        alize zero-overhead monitoring. Zero further incorporates
read on these memory regions — without any CPU involve-            several designs to resolve practical issues (network conges-
ment at the monitored side. We next elaborate possibilities        tion/interference, scalability) in distributed monitoring.
and challenges of realizing Z ERO.
Metric features. Z ERO is based on two features of moni-           4     Design and Implementation
toring metrics. First, most of the monitoring metrics are
counters. Systems/applications update these metrics at fixed       In this section, we present the overview of the Z ERO frame-
memory region after initialization. For instance, the number       work. We then introduce the design and implementation of
of stored/evicted key-value pairs in Redis and the number of       Z ERO in details.
sent/received packets recorded by kernel stack are all counters.
Second, processing of raw metrics is simple algebraic calcu-
                                                                   4.1    Overview
lation and can be offloaded to the controller. For example,
Redis exports statistic data on the raw metrics and per-CPU        As shown in Figure 2, Z ERO proposes the novel control plane
counters are summed to get the final kernel metrics. These         and data plane to collect raw metrics without CPU involve-
features are general to system/application metrics and Z ERO       ments at the monitored host. Z ERO adopts receiver-driven
can thus support various metrics as an universal framework         model to collect metrics from large number of hosts, and deals
(Section 5).                                                       with the network and CPU bottlenecks at the controller.

642    2022 USENIX Annual Technical Conference                                                              USENIX Association
address offset type size rkey                  address offset structure rkey        RNIC and the metadata of MRs (e.g., address, type, size,
                                                                                     rkey) are recorded at the control region (Figure 5a). Note
                            offset
                                     address                               address
                                                                                     that the MR is pinned after metrics are registered and will
 Multiple READs

                                               Two READs
                                     address                                         be released only after metrics expire. Z ERO agent registers
                                     address                 data          control   the control region at the RNIC and builds a QP connection
                                     address
                                                            region         region    with the remote controller in advance. The controller can thus
                                     address                               address
                                                                                     acquire the metadata of metrics by reading and parsing the
                                     address
                                                                                     control region, then access raw metrics.
                                                                                        The control plane has disposable overhead. Z ERO agent
                  (a) Inefficient READ.                    (b) Efficient READ.
                                                                                     inevitably occupies host resources to deal with registration re-
Figure 5: READ w/o (left) or w/ (right) memory management.                           quests. However, we observe that the control-plane overhead
                                                                                     is disposable for most of metrics. We divided metrics into
   The scalability of RDMA-based system is constrained by                            three types according to the practice in Alibaba cloud. First,
the on-chip memory (SRAM) of RNIC [26, 82]. To achieve                               metrics of the underlying systems and persistent services, e.g.,
high scalability at the controller, all metrics at the host are                      database and storage services [56, 91], are usually immobile
managed by one agent and share one QP connection. In the                             once registered. Second, some services may be dynamically
control plane, Z ERO agent provides interface for systems and                        created/destroyed. For example, e-commerce services are
applications to register the memory region of their metrics                          periodically expanded/shrunk according to the number of
(data region) at the RNIC. The metadata of metrics (e.g.,                            users shopping online [60]. Third, some user requests are
address, rkey) are written into the control region. Z ERO con-                       served by ephemeral serverless functions to mitigate the cost
troller can thus obtain metadata by reading the control region                       of long-lived services with intermittent activities [25, 33].
and access to raw metrics via reading the data region. The                              Generally, Z ERO agent enters into blocking mode with no
agent process is blocked unless systems/applications need                            CPU occupation (Figure 2). When services change, Z ERO
to register/update metrics. In the data plane, the memory                            agent resumes to (de)register metrics and update the corre-
region of metrics are exposed to the agent process via shared                        sponding control region. For the first two cases, Z ERO agent
memory, finally to the remote controller. The controller per-                        handles registration requests and updates control regions in-
forming READ on the data region to acquires metrics, which                           frequently with negligible overhead. Z ERO controller also
achieves zero copy and no CPU involvement at the monitored                           only reads control regions once in a long period. In the server-
host. In the real deployment, the control-plane overhead is                          less case, Z ERO agent may frequently (de)register metrics
usually disposable — the metrics only need to register once —                        of serverless functions. Z ERO agent further uses WRITE to
then keeps in use or updates after a long time. The data plane                       write the updated metadata into the remote control region
has no CPU occupation as expected.                                                   (Figure 2). Z ERO controller can thus obtain raw metrics by
   Z ERO supports large-scale monitoring via single controller.                      only reading the data region with one RTT. Note that the over-
Z ERO adopts thread dispatching, where only 1 ∼ 2 threads                            head of Z ERO using WRITE or SEND/RECV is still much
are used to collect metrics and the rest cores are used to                           lower than that of the traditional monitor (Section 6.2).
process metrics in parallel for further operations, e.g., visual-                    Data plane. To eliminate the overhead of collecting metrics
ization and persistence. Z ERO controller uses receiver-driven                       from multiple processes and avoid frequent memory copies,
model, i.e., issuing READ requests on the monitored hosts                            Z ERO exclusively adopts shared memory in its data plane.
to collect raw metrics, which is superior to the traditional                         Specifically, Z ERO agent uses mmap operations, which are
sender-driven model. Z ERO achieves high monitoring QoS                              invoked when registering metrics, to expose MRs of metrics
by posting READ requests on demand and avoids network                                (data region) to the agent process, finally to the remote con-
congestion/interference by controlling the total in-flight data.                     troller. The mmap operation takes 4KB page as the basic unit.
Accordingly, credit-based flow control (Credit-FC) and hy-                           However, metrics are not necessarily locating at the page
brid I/O model are proposed to remedy the incast problem                             header (Figure 5). Z ERO agent calculates the page header of
and CPU bottleneck, respectively.                                                    metrics to mmap their pages. As shown in Figure 5, Z ERO
                                                                                     agent records the page address and page offset in the control
                                                                                     region. To protect data region from being modified by local
4.2               Z ERO Control Plane and Data plane
                                                                                     agent or malicious remote host, Z ERO sets read-only access
We introduce the Z ERO control plane and data plane, together                        to the data region via mmap flags and uses the rkey mechanism
with Z ERO framework usage and interface in this subsection.                         inherently supported by RDMA. Z ERO controller obtains raw
Control plane. Z ERO agent deals with registration requests                          metrics by performing READ on the data region.
from systems/applications, which uses UNIX domain socket                                The data plane has zero overhead. The data plane achieves
for inter-process communications (IPC). To handle the re-                            zero CPU occupation, zero copy, and no extra memory foot-
quest, the MRs of monitoring metrics are registered at the                           print at the monitored host, via the shared memory design.

USENIX Association                                                                             2022 USENIX Annual Technical Conference          643
// type one, specifying attributes of variables                    Framework usage and interface. Z ERO can be easily de-
struct disk my_disk{                                               ployed at hosts (PM/VM) with RDMA support. Z ERO agent
    .disk           = "sda",                                       and controller need to be initiated at the local and remote host
    .hash    = 0x000f3456, ...                                     respectively. Systems/applications then invoke agent API
} __attribute__((section(".zero_init"));                           to manage and (de)register their metrics. The agent accom-
//type two, using allocator                                        plishes all control-plane operations, e.g., mmap, updates of
struct disk *my_disk = zero_malloc(sizeof(struct disk));           control region, when handling (de)register requests.

 Figure 6: Management interface for two types of metrics.          4.3    Scaling-out Monitoring
                                                                   We next present how to support large-scale monitoring with
An alternative solution of copying metrics when updating           single Z ERO controller. Z ERO proposes credit-based flow
causes frequent CPU occupations for memory copies, and             control (Credit-FC) and hybrid I/O model, to avoid network
extra memory footprints. Besides, Z ERO data plane ensures         congestion/interference and remedy CPU bottlenecks respec-
the read-write consistency between remote and local mem-           tively. Z ERO controller needs to collect and process metrics
ory. Most of application/system metrics are defined as atomic      from large number of hosts, while fulfilling the monitoring
variables, which are updated atomically in the shared mem-         QoS in latency and throughput. To achieve this goal, the
ory. The atomic update only needs 1∼3ns in Intel Haswell           controller adopts thread dispatching to collect and process
architecture [74] — three orders of magnitude lower than that      metrics in parallel with individual threads.
of RDMA operations (1µs) — the memory consistency is               Collecting metrics. The controller only assigns 1 ∼ 2 threads
guaranteed between update and READ. For non-atomic vari-           to collect metrics. Z ERO achieves high efficient network I/O
ables, Z ERO uses bit flags to indicate the states of updating.    with single thread by posting READ requests then polling
FaRM [31] and Pilaf [68] use a similar method to ensure data       completions on multiple QPs in batch. This is feasible be-
consistency with READ. Z ERO controller will check whether         cause both post_send and poll_cq are fast non-block op-
metrics are read correctly via bit flags and retry in the next     erations. According to our experiment, the batch operations
cycle. We eliminate all synchronous locks for zero overhead,       only add negligible latency (tens of µs).
which may cause inconsistency under rare race conditions              Receiver-driven model is superior to send-driven model.
while ensuring accuracy for most cases.                            Issuing READ on data region turns out a receiver-driven
   Memory management is indispensable. While local host            model to collect metrics from multiple hosts. The receiver-
achieves zero overhead in Z ERO data plane, we observe that        driven model has two benefits compared with the traditional
simply performing READ on massive metrics cannot achieve           sender-driven model. The controller posts READ requests
desirable performance. Because metrics are distributed across      on demand to meet the target latency or updating frequency
the process/kernel space with discrete memory addresses,           in monitoring. Besides, it also facilitates to avoid network
Z ERO controller needs to read large number of entries in          congestion/interference by limiting the total in-flight data of
control region as well as metrics in data region (Figure 5a).      concurrent flows. We next intuitively formulate the scale-out
However, the bandwidth of READ falls rapidly and the la-           ability of such receiver-driven model. The monitored hosts
tency is nearly doubled when the number of MRs increases           have different requirements in terms of updating interval U
from 100 to 10k, due to evictions in the RNIC SRAM [26,82].        and data size S, i.e., controller needs to collect S bytes in
Z ERO introduces memory management to reduce the number            every U seconds for a specific host. Assuming the bandwidth
of MRs and READ requests required to collect massive met-          B of the receiver is fully utilized, the maximum number of
rics. Specifically, Z ERO proposes two memory-management           supported hosts n = B × U/S. Our deployment shows that
mechanisms for two types of metrics (Figure 6). First, many        Z ERO supports at least 1k hosts with 128KB metrics and
metrics are global variables or data structures. One can mark      100ms sampling interval.
these metrics by specifying attributes of variables [13]. The      Credit-based flow control. Concurrent READ requests gen-
compiler will distribute these metrics to the same data seg-       erate burst network traffics, which are transmitted from multi-
ment. Second, metrics are defined as pointers to variables.        ple hosts to the controller simultaneously, resulting in severe
Z ERO provides a memory allocator for these metrics. Specifi-      incast problems. Z ERO introduces Credit-FC to remedy the
cally, the MRs of metrics are allocated with continuous space      incast problem, which works as follows.
via the allocation API. The core idea of both methods is con-         Large-sized READ requests are segmented into fix-sized
catenating metrics to the same MR to support massive metrics.      fragments with 4KB page size. Z ERO chooses such moderate
Besides, data region is aligned as struct and recorded at the      size due to three considerations: i) page is the basic unit of
control region for the ease of memory parsing at the controller.   shared memory, which can accommodate 1k 32-bit metrics
As shown in Figure 5b, Z ERO controller only needs to post         and fulfill the demands of most services; ii) the number of
one READ request on the same MR to get a list of metrics.          READ requests is bounded as the total size of metrics in

644   2022 USENIX Annual Technical Conference                                                               USENIX Association
single host is general hundreds of KBs; iii) the small size        Redis [12], Linux kernel [81] and eBPF [3].
facilitates congestion control in severe incast.
                                                                   Redis Case. Redis [14] has been widely deployed in Alibaba
   Subsequently, credits are used to limit the total in-flight     cloud as database, cache, and message broker, providing low-
data of concurrent flows (identified by a QP). Posting READ        latency in-memory data structure storage services. Traditional
requests or polling completion events will consume or regain       monitor acquires Redis metrics by requesting INFO interface
credits for the target flow. The state-of-the-art works [44, 70]   of Redis server. Traditional monitor thus occupies the re-
adopt bandwidth-delay product based flow control (BDP-FC),         source of Redis server and the host to obtain metrics. As a
which bounds the in-flight data per flow by the BDP of the         comparison, Z ERO only needs to register metrics and requires
network. However, BDP is large enough to cause congestion          no resource occupation for collecting metrics. There exist
and trigger explicit congestion notification (ECN) packets or      more than two hundred metrics in each Redis service instance.
priority-based flow control (PFC) pause frames [95], with          The naive implementation is performing READ on these met-
large number of concurrent flows (Figure 11c). Z ERO pro-          rics one-by-one (see Figure 5a), resulting in high latency and
poses Credit-FC to limit the total in-flight data. Specifically,   CPU utilization. To resolve this problem, we use allocation
the credit of each flow C f is set to T /n, where T is the total   API (type two) to allocate and structuralize Redis metrics
credit and n is the number of concurrent flows. Finally, Credit-   with continuous memory. As shown in Figure 5b, Z ERO only
FC effectively avoids triggering ECN/PFC (Section 6.3) and         needs to register and perform READ once for each Redis
network interference with service traffics (Section 7).            instance in our implementation. Z ERO controller then parses
Hybrid I/O model. To avoid the thread performing network           metrics as struct according to the memory structure.
I/O being saturated, Z ERO proposes the hybrid I/O model
incorporating event driven and polling mechanisms, which           Linux Kernel Case. Linux kernel exports system metrics
is similar to NAPI in Linux [73]. The event-driven I/O can         to user space via proc interface, which creates files under
effectively avoid CPU occupation for busy polling. Each QP         /proc directory and bonds corresponding kernel functions.
is associated with an event channel to notify a new (first) com-   Traditional monitor usually needs to read hundreds of proc
pletion event. The I/O multiplexing interface, e.g., epoll, is     files to get all system metrics, which incurs extra overhead
used to listen the fds of multiple event channels. The collect-    for user/kernel-space processing. With Z ERO framework,
ing thread blocks until completion events are notified from        kernel metrics are registered at the Z ERO agent then exported
one or multiple QPs, then polling the in-flight requests of        to the Z ERO controller directly. We use Z ERO to monitor
the corresponding QPs. We observe that the event-driven            metrics managed by container namespace for the container-
model effectively reduces CPU utilization with large number        based services [1]. Kernel metrics are usually implemented
of in-flight requests (Figure 12b). However, when the mon-         as lock-free per-CPU counters to avoid locking overhead.
itored data size is too small with only several requests after     Z ERO controller needs to READ all replications of metrics
segmentation, the epoll syscall and thread blocking incur          in each CPU core, locating at separate pages. Production
high variations in READ latency (Figure 12a). The controller       cloud environment adopts fine-grained resource assignment
thus uses busy polling for hosts with small number of re-          and isolation, in which more than 90% hosts only have 1 ∼ 4
quests. In the hybrid I/O model, Z ERO assigns two threads to      CPU cores [27]. Z ERO only needs 1 ∼ 4 requests to obtain
perform event-driven polling and busy polling, respectively.       kernel metrics.
Note that both threads share Credit-FC.                            eBPF Case. The extend Berkeley Packet Filter (eBPF) [3,66]
Processing metrics. The controller dispatches multiple             is an evolving technology, which can dynamically attach pro-
threads to processing raw metrics in parallel. Specifically, the   gram to running kernel for tracing, instructing, and even con-
MRs of the collected metrics are placed into the appropriate       trolling the kernel code path. eBPF has been widely used in
queue where each MR can be handled by one of multiple pro-         cloud computing for monitoring [6, 54], networking [38, 83],
cessing threads. Each MR concatenated by a list of metrics is      virtualization [21, 22] and security [29, 30]. We use eBPF to
parsed as struct directly according to the metadata recorded       monitor traffics and retransmissions of large number of TCP
at the control region. The parsed metrics are then processed       connections in MaxCompute [32, 93] service. eBPF provides
by reproducing the same calculations which are originally          in-kernel data structure, called map, to enable control and
performed by the monitored host. Finally, the controller im-       data messages delivery within kernel or between kernel and
ports processed metrics into InfluxDB [7] for persistence and      user space. eBPF attaches probes to kernel/application func-
uses Grafana [5] for visualization.                                tions at runtime and exports metrics, events and histograms
                                                                   to eBPF map. User process reads the entry of eBPF map via
                                                                   syscall. However, reading large number of entries incurs
5   Case Study                                                     large overhead due to frequent syscalls. To integrate eBPF
                                                                   into Z ERO framework, eBPF array map is adopted, which
In this section, we present how to integrate application/system    supports mmap operations (from Linux kernel 5.5+) and can
metrics into the Z ERO framework using three typical cases of      thus export its memory to Z ERO agent directly.

USENIX Association                                                          2022 USENIX Annual Technical Conference         645
Name        Nodes          Hosts         OS kernel       Intel Xeon CPU code               Mellanox NIC           Protocol      ECN   PFC
    Cluster1   65 × PMs     1024 × VMs        Linux 5.5        E5-2682 (64 cores)         2 × 25GbE ConnectX-4 Lx     RoCEv1/2        7     3
    Cluster2    9 × PMs   1024 × Containers   Linux 3.10   Platinum 8369B (64 cores)       200GbE ConnectX-6 Dx       RoCEv1/2        3     3

                                                   Table 1: Deployment environment.

6      Evaluation                                                            Metric         Monitor        Redis          Kernel         eBPF
                                                                              Total         Baseline     0.7 ∼ 19.3     0.5 ∼ 1.6     0.8 ∼ 12.5
                                                                             Latency       Z ERO RPC    0.08 ∼ 0.18    0.14 ∼ 0.36    0.10 ∼ 1.02
6.1      Evaluation Setup                                                     (ms)            Z ERO     0.05 ∼ 0.14    0.07 ∼ 0.23    0.08 ∼ 0.87
                                                                                            Baseline      0.5 ∼ 45       0.01 ∼ 4       0.08 ∼ 6
In our evaluation, we adopt a multi-phase deployment with                     Agent        Z ERO RPC    0.01 ∼ 0.55    0.08 ∼ 0.9     0.05 ∼ 0.68
two typical clusters, as summarized in Table 1. Initially, we                  CPU          Control
deploy Zero in a test environment (Cluster1) with a rational                Utilization       plane     0.05 ∼ 0.07     0.8 ∼ 1.5     0.04 ∼ 0.05
                                                                               (%)            Data
scale that mimics the production environment for demon-                                       plane         0              0               0
stration. We next deployed Z ERO in a public-cloud envi-
ronment (Cluster2), which covers common cloud services                                    Table 2: Summary of Z ERO overhead.
in production. The deployment scale has continued to grow
according to the feedback of canary testing and the actual
demands of services. In Cluster1, services operates in guest                of TCP connections varies from 1k to 16k, with a default
VMs with 4 vCPU cores. The RNICs are virtualized and                        value of 1k. The number of work requests varies from 8 to
assigned to VMs via passthrough [87]. In Cluster2, services                 128 for a host, with a default value of 32.
are deployed in containers running on bare-metal servers [92].            • Number of Hosts: One Z ERO controller is deployed to mon-
Each VM/container is monitored independently to evaluate                    itor multiple hosts running Z ERO agent. We increase the
the scalability of Z ERO. Note that both configurations are                 number of hosts (from 64 to 1024) to evaluate the scalability
typical in Alibaba cloud-native platform [15]. We use Z ERO                 of Z ERO framework.
to monitor typical services, e.g., Redis [14], container [1] and
                                                                             We use the state-of-the-art monitoring system named Net-
MaxCompute [32, 93].
                                                                          data [10], existing NX monitor in Alibaba cloud, and Z ERO
   We evaluate Z ERO performance from three aspects:
                                                                          RPC as comparison benchmarks:
• CPU Utilization: The CPU utilization is defined as occu-
                                                                          • Netdata: Netdata is widely used by cloud providers, e.g.,
  pied CPU time per second. We use perf tool to measure
                                                                            Amazon Web Services (AWS) and Microsoft Azure [8].
  the CPU utilization of both Z ERO agent and controller. We
                                                                            Netdata integrates application/system metrics in one agent
  verify the CPU involvement of monitor in control plane
                                                                            by requesting application interface or reading proc files
  and data plane. We also need to concern about the CPU
                                                                            respectively. The agent uploads metrics to the controller as
  utilization of Z ERO controller for scalability.
                                                                            per heartbeat or the controller acquires metrics via sending
• Latency: The latency of Z ERO is the time used to READ all                RPC requests to the agent.
  metrics from the monitored host. For traditional monitor,
                                                                          • NX: NX is a network monitoring tool deployed in Alibaba
  the latency consists of time used to collect/process all met-
                                                                            cloud. NX exports network metrics as logs, e.g., info of
  rics and time used to upload all metrics. We verify whether
                                                                            TCP connections, then uploads the collected metrics via
  monitor can meet up the updating frequency of metrics by
                                                                            log service. We have integrated Z ERO framework into the
  latency.
                                                                            NX monitor to improve its performance.
• Throughput: The throughput is the collected bytes per sec-
                                                                          • Z ERO RPC: Z ERO RPC adopts the same data plane to
  ond during each monitoring period. We verify whether
                                                                            access raw metrics. However, Z ERO RPC is implemented
  monitor can support massive metrics by throughput.
                                                                            via SEND/RECV instead of READ. The controller issues
   We test the impact of the following parameters on monitor-               RPC requests to the agent, after which agent returns metrics
ing performance:                                                            as response. We explore design alternations of two-sided
• Sampling Interval: The required sampling interval is de-                  RDMA via Z ERO RPC.
  termined by the updating frequency of metrics. We eval-
  uate Z ERO with 10 ∼ 1000ms sampling intervals and use
  1000ms by default.                                                      6.2     Z ERO Overhead
• Number of Instances/Connections/Requests: All of three                  We evaluate the CPU utilization and latency of Z ERO in
  parameters impact the CPU utilization and data size of                  monitoring Redis, container and MaxCompute services. We
  monitoring. The number of service/container instances                   present overall performance and summarize two key observa-
  varies from 10 to 40, with a default value of 10. The number            tions, followed by detailed micro-benchmarks for each case.

646      2022 USENIX Annual Technical Conference                                                                         USENIX Association
(a) Agent CPU vs. Instances.       (b) Latency vs. Instances.     (c) Agent CPU vs. Intervals.      (d) Latency vs. Intervals.

                Figure 7: Monitor performance with 10 − 40 Redis instances and 10 − 1000ms sampling interval.

   (a) Agent CPU vs. Instances.       (b) Latency vs. Instances.     (c) Agent CPU vs. Intervals.      (d) Latency vs. Intervals.

             Figure 8: Monitor performance with 10 − 40 container instances and 10 − 1000ms sampling interval.

    (a) Agent CPU vs. Entries.         (b) Latency vs. Entries.      (c) Agent CPU vs. Intervals.      (d) Latency vs. Intervals.

               Figure 9: Monitor performance with 1 − 16k TCP connections and 10 − 1000ms sampling interval.

Overall. Table 2 summaries the overhead of Z ERO monitor.          the CPU utilization of baselines reaches very high values with
We focus on the monitoring latency and the CPU utilization         lower sampling intervals. Z ERO RPC eliminates the over-
of Z ERO agent. Both Netdata or NX are referred as baselines.      head of collecting metrics, however, the CPU utilization for
First, Z ERO monitor reduces latency by one/two order of mag-      posting SEND/RECV requests and polling completions still
nitudes compared with baselines. Our following breakdown           increases with lower sampling intervals. In summary, the
of total latency reveals that baseline methods spend most of       benefit of Z ERO to cloud services is enabling higher SLA of
time to collect metrics from system/application processes.         infrastructure, which effectively avoids performance jitters
The TCP-based baselines actively upload metrics and achieve        caused by CPU interference. Zero also improves monitor-
similar latency in uploading phase as Z ERO READ. Z ERO            ing performance, which reduces latency by 1∼2 orders of
RPC has higher latency than Z ERO READ because each RPC            magnitude and increases throughput by 3∼6× (Section 6.3).
requires at least two RTTs [45, 68]. Second, Z ERO agent
achieves disposable overhead in control plane and zero over-       Redis Case. The monitoring performance of Redis case is
head in data plane. The CPU utilization of Z ERO control plane     shown in Figure 7. Netdata uses single or multiple threads to
only increases slightly when registering more MRs, which is        collect metrics from multiple Redis instances. With the incre-
not affected by the sampling interval. The CPU utilization of      ment of service instances, the CPU utilization of single- and
Z ERO data plane is always zero as expected. On the contrast,      multi-thread Netdata increase from 0.4 ∼ 0.5% to 1.5 ∼ 2%,
                                                                   which already severely interferes Redis services (Figure 3).

USENIX Association                                                            2022 USENIX Annual Technical Conference               647
(a) Average latency vs. Hosts.       (b) Total latency vs. Hosts.      (c) Throughput vs. Hosts.      (d) Controller CPU vs. Hosts.

                           Figure 10: Monitor performance with 64 − 1024 hosts × 128KB data in Cluster1.

   (a) Average latency vs. Hosts.       (b) Throughput vs. Hosts.       (c) Single run with 256 Hosts.    (d) Controller CPU vs. Hosts.

                           Figure 11: Monitor performance with 64 − 1024 hosts × 128KB data in Cluster2.

With 10× lower sampling interval, the Netdata CPU increases            (16 ∼ 250ms), due to its outdated implementation, which
10×, while the gap of CPU utilization between single- and              traverses all TCP connections using the kernel tcp_diag in-
multi-thread becomes larger. However, the multi-thread Net-            terface. NX-eBPF introduces eBPF to low the overhead of
data achieves 0.7 ∼ 1ms latency (Figures 7b and 7d). Netdata           getting TCP info, while Z ERO further eliminates the syscall
thus trades off CPU utilization with latency. On the contrary,         and memory copy overhead of reading eBPF map. As shown
Z ERO has negligible CPU utilization (< 0.1%) in its control           in Figures 9b and 9d, Z ERO performs 40%∼80% lower la-
plane, which is a one-off expense independent of sampling              tency than that of NX. Similar with Redis and Linux kernel
intervals as shown in Figure 7c. Z ERO data plane has zero             case, the CPU utilization of Z ERO data plane is still zero.
CPU overhead denoted by the tiny pillar in Figures 7a and 7c.
Z ERO reduces latency by one/two order of magnitudes com-              6.3    Z ERO Scalability
pared with Netdata, as shown in Figures 7b and 7d.
                                                                       We then evaluate the scalability of Z ERO. The controller
Linux kernel case. Figure 8 illustrates the performance of
                                                                       adopts receiver-driven model to obtain raw metrics, which
Linux kernel case, where Z ERO monitors multiple container
                                                                       issues TCP-based RPC or READ requests to the agent. The
instances. As shown in Figures 8a and 8c, the CPU utilization
                                                                       collecting phase at the monitored host is omitted to compare
of the Z ERO data plane is zero. The CPU utilization of Z ERO
                                                                       the raw performance. The controller adopts busy polling
control plane is high, because of the frequent invocations of
                                                                       by default. The ideal result is obtained via ib_read_bw [16].
mmap when registering per-CPU kernel metrics. However, the
                                                                       Our three key observations are summarized as follows.
total CPU time of Z ERO control plane is fixed, i.e., 3.4 ∼ 14.7
                                                                       READ achieves better performance. The TCP-based base-
ms to register 10 ∼ 40 instances. As shown in Fig 8c, the
                                                                       line achieves much higher average latency compared with
overhead of Z ERO control plane is only 0.8% independent
                                                                       Z ERO (Figure 10). The total elapsed time of collecting met-
of sampling intervals, while the Netdata CPU increase to 4%
                                                                       rics from all hosts is usually 2 ∼ 3× the average latency in
with 10ms sampling interval. Z ERO achieves 0.07 ∼ 0.1ms
                                                                       TCP, resulting in low throughput. We analyze that TCP suffers
latency, which is an order of magnitude lower than that of
                                                                       from low efficient congestion control (CC) [20, 37] and the
Netdata (0.5 ∼ 1.6ms). Z ERO may have higher READ latency
                                                                       processing overhead of kernel stack [24]. Both factors incur
than the uploading latency of Netdata. Because kernel metrics
                                                                       large delay to concurrent RPC requests, resulting in variations
are per-CPU counters, Z ERO controller needs n requests to
                                                                       of starting/ending time. On the contrary, Z ERO eliminates
obtain n copies of counters in each CPU.
                                                                       the overhead of kernel stack via RDMA. The average latency
eBPF case. Figure 9 shows the performance of monitoring                and elapsed time of Z ERO are nearly equivalent (Figures 10a
TCP connections of with NX, NX-eBPF and Z ERO. NX mon-                 and 10b). Z ERO also achieves lower CPU compared with the
itor has the highest CPU utilization (1.6 ∼ 59%) and latency           baseline as shown in Figure 10d.

648    2022 USENIX Annual Technical Conference                                                                  USENIX Association
Hybrid-I/O model is highly CPU efficient. We then evalu-
                                                                    ate the hybrid-I/O model in Cluster1. We post all requests in
                                                                    the beginning as the maximum outstanding READ requests
                                                                    is bounded to 16 by RNIC. As shown in Figure 12b, the CPU
                                                                    utilization of both epoll and polling increases with large data
                                                                    size, while epoll achieves lower CPU utilization. To high-
                                                                    light the gap, we set lower sampling interval to monitor 128
                                                                    hosts with 256KB data. Results show that the CPU utilization
                                                                    of epoll and polling are 13.5/40.9% and 20.2/72.6 respec-
   (a) Average latency vs. Hosts.   (b) Controller CPU vs. Hosts.
                                                                    tively with 100/10ms interval. However, the event-driven
Figure 12: Monitor performance with 8 − 128 WRs × 4KB.              polling has high performance variations with small number of
                                                                    requests (Figure 12a), due to epoll syscall and thread block-
                                                                    ing. With large-sized data, such overhead is averaged across
Credit-FC avoids triggering ECN/PFC. In Cluster1, we                many in-flight requests. We thus adopt epoll for general cases
evaluate the Z ERO performance w/ Credit-FC and BDP-FC,             (32 ∼ 128 requests) and polling for hosts with small-sized
and w/o any flow control (FC). For tests w/o FC, the controller     data (< 32 requests).
posts all requests in the beginning. The BDP-FC adopts
a fixed credit of BDP (16KB with 25GbE bandwidth and
5 ∼ 6µs RTT). Interestingly, PFC pause is not triggered until       7   Experience and Future Work
256 QPs for Z ERO w/o FC, due to the large dynamic buffer
of switches [17] and the high threshold of PFC pause action         In this section, we share our experience of building large-scale
(XOFF) [95]. Besides, the RNIC limits the maximum of                monitoring system using RDMA.
outstanding READ requests to 16. Accordingly, the total             Achieving high scalability and availability. The scalability
credit is set to 8MB = 128 × 64KB. As shown in Figure 10a,          of RDMA-based distributed systems is limited by the number
both Credit-FC and BDP-FC effectively avoid PFC. The gap            of QPs [26, 62, 82], which are cached in the limited SRAM of
between Z ERO and ideal is origin from the extra overhead           RNIC. Even several works [46, 47] adopt UD to reduce the
in monitoring (Figure 10c), e.g., virtualization, initializing      number of QPs, Z ERO uses READ to bypass the monitored
and posting work requests (WRs) for each VM. The 25GbE              host, which is only supported in RC mode. Z ERO adopts QP
bandwidth is still saturated with the C f ≥ BDP.                    sharing and grouping to remedy the QP constraints. First,
   In Cluster2, the network has much higher BDP (96KB with          Z ERO agent manages all system/application metrics in a host
200GbE bandwidth and 4µs RTT) and the ECN is enabled by             sharing one QP connection. Second, Z ERO controller moni-
default. We observe a large fraction of ECN marked packets          tors a group of QPs/hosts in each period, e.g., 64 hosts with
and high latency in BDP-FC (Figure 11a), because the ECN            1 ∼ 10ms period, then switches to another group of QPs/hosts
minimum threshold [95] is set to ∼ 1000KB. Credit-FC still          to avoid frequent QP evictions. To achieve high availabil-
works with 512/1024 QPs due to the posting and arrival delay        ity, each agent will build QP connections with at least three
of READ requests with single thread (Section 4.3), where the        controllers in the practical deployment. Similar to most dis-
build-up buffer should reach 2/4MB with the smallest quota          tributed systems, all these controllers run a consensus-based
of 4KB. We observe similar phenomenon with 1024 QPs in              coordination service [40] to detect failures, and Z ERO can
Cluster1. As shown in Figure 11b, Credit-FC only utilizes           switch to a standby controller seamlessly when the active
half of the 200 GbE bandwidth — because the credit is much          controller is down.
smaller than the BDP of network to avoid congestion/ECN             Avoiding network interference. In the practical deployment,
(4 ∼ 16KB vs. 96KB) — the bottleneck lies in the switch             monitoring traffics co-exist with service traffics and inevitably
buffer/threshold instead of end host. The throughput also           impacts the network performance of services, due to the con-
degrades rapidly because large number of QPs share RNICs            tentions at both agent and switches. Note that the controller
at both agent and controller.                                       has no such concerns with dedicated server for monitoring.
   We next zoom into the difference between Cluster1 and            Before Z ERO, existing deployment adopts several mecha-
Cluster2. Cluster1 set much higher PFC threshold than the           nisms for traffic isolation, which inevitably brings other side
ECN threshold in Cluster2, resulting in larger credit to satu-      effects. For example, a thorough solution is physically iso-
rate the bandwidth. However, the tradeoff is that the pause         lating traffics of services and monitoring with independent
duration occupies ∼ 99% (Figure 10b) when PFC is triggered.         NICs and links [2]. However, physical isolation incurs large
As a comparison, ECN reacts to congestion and recovers              CapEx and is only suitable to high-priority services neces-
traffic rapidly (Figure 11c). Another benefit of Credit-FC is       sitating high SLA. Another solution is assigning a separate
reducing the CPU utilization for busy polling via avoiding          and lower-priority queue for monitoring traffics [35]. The
network congestion as shown in Figures 10d and 11d.                 persistent high loads of services may cause starvation of mon-

USENIX Association                                                            2022 USENIX Annual Technical Conference           649
itoring traffics and losses of data in consecutive monitoring       As pioneering works, Pilaf [68] enables the clients to directly
periods. Besides, Z ERO built on RDMA is sensitive to such          read data from the server memory via RDMA read. Clients
timeouts, which may cause QP state machine errors [72]. We          use CRC64 to search for the inconsistency of data caused
thus abandon the traditional method of traffic isolation and        by the possible read-write races on the server. RFP [79, 86]
resort to receiver-driven CC to avoid network interference.         explores one-sided RDMA to provide another alternative so-
   Z ERO provides a new perspective to mitigate network in-         lution for RPC, which uses RDMA read to fetch the response
terference, i.e., limiting the credit of monitoring traffics when   result. On the other hand, several works [46, 61, 86] explore
co-existing. Specifically, the controller adopts group switch-      how to optimize the raw performance of one-sided RDMA. To
ing with 64 QPs/hosts in each group. The QPs of next group          the best of our knowledge, Z ERO is the first work to leverage
will be pre-fetched to RNIC SRAM for warming up. The                one-sided RDMA for distributed monitoring.
total credit T in Credit-FC is set to 256KB, which maximally        Monitoring system. There are plenty of works targeting for
adds ∼ 20/10µs queuing delay with 100/200GbE bandwidth.             the design of monitoring system [65]. Yet, all these works
Note that the maximum build-up queue is much less than              focus on data analytic [51, 80], tracing bugs [59, 63] and vi-
256KB due to the posting and arrival delay (Section 6.3). As        sualization [71]. Distinct from these works, where monitors
a comparison, the traditional send-driven model easily causes       are tightly coupled with the monitored applications and hosts,
network jitters with burst traffics. Besides, such settings have    Z ERO decouples the monitor from the monitored infrastruc-
a negligible impact on monitoring QoS and single controller         ture and eliminates the monitoring overhead completely.
supports 1 ∼ 10k hosts with 0.1 ∼ 1s intervals. The agent           Cloud-native monitoring. Netdata [10] enables users to
only occupies 0.01 ∼ 0.1% bandwidth of the monitored host.          quickly identify and troubleshoot issues, and make data-
Receiver-driven CC. Compared with existing sender-driven            driven decisions according to the pre-built visible dashboards.
CC, the receiver-driven CC achieves several benefits in mon-        Prometheus [11] is an open-sourced monitoring system with
itoring. Existing CC mechanisms, e.g., DCQCN [95] and               complete ecosystem to extract time series data from the cloud-
TIMELY [69], react to congestion after switch buffer/queue          native applications, and it focuses on collecting metrics via a
reach threshold. Besides, they aim to achieve equal bandwidth       powerful query language called PromQL. Stackdriver [4] is
sharing across multiple flows, and cannot avoid interference        the logging and monitoring solution of Google, which is inte-
between service traffics and monitoring traffics. As a compar-      grated tightly into Google Cloud. Likewise, Z ERO is deeply
ison, the receiver-driven CC avoids network congestion and          used in the cloud-native ecosystem of Alibaba cloud.
interference in advance by limiting the total in-flight data of
monitoring. On the other hand, existing CC mechanisms, e.g.,
DCQCN and HPCC [57], are complex in deployment and                  9    Conclusion
requires ECN or in-network telemetry (INT) supports from
                                                                    We propose the Z ERO monitoring system framework, ex-
switches. However, ECN or INT capability are not always
                                                                    ploiting one-sided RDMA read for remote monitoring. Z ERO
supported, e.g., in Cluster1. The Credit-FC in our deployment
                                                                    achieves zero-overhead monitoring via the novel control plane
is simple and effective to avoid triggering PFC/ECN.
                                                                    and data plane. Z ERO supports large-scale distributed moni-
   We also observe several limitations of current Credit-FC.
                                                                    toring via credit-based FC and hybrid I/O model. Z ERO thus
It only adopts credit without pacing [50] and cannot support
                                                                    paves the way for integrating RDMA into the monitoring
massive concurrent flows with a 4KB transmission fragment.
                                                                    systems, which desires to benefit from the high performance
Besides, it is not a universal CC mechanism for data-center
                                                                    of RDMA while avoiding poor scalability. We deploy Z ERO
traffics, which dedicates to avoid network congestion and
                                                                    in Alibaba cloud-native platform to evaluate its performance.
interference caused by monitoring traffics in the Z ERO frame-
                                                                    The deployment results show that Z ERO resolves interference
work. In our future work, we will try to resolve these limi-
                                                                    problem of traditional monitor and easily fulfills both latency
tations from two aspects. First, we will consider both host
                                                                    and throughput requirements in cloud-native monitoring.
bandwidth and ECN threshold [88] with a combination of
credit- and pacing-based CC, to achieve full bandwidth uti-
lization while avoiding network congestion. Second, we will         10    Acknowledgment
explore the universal receiver-driven model in cloud networks,
which has the benefits of CPU offloading via RDMA and more          We sincerely thank the anonymous shepherd and review-
convenient CC [77].                                                 ers for their insightful comments and feedback. This work
                                                                    was supported in part by NSFC grant 62141220, 61972253,
                                                                    U1908212, 72061127001, 62172276, 61972254, the Program
8     Related Work                                                  for Professor of Special Appointment (Eastern Scholar) at
                                                                    Shanghai Institutions of Higher Learning, Alibaba Innova-
One-sided RDMA. In the system area, it is a trend to lever-         tive Research (AIR) Program. Corresponding author: Linghe
age one-sided RDMA operations to bypass the server CPUs.            Kong (linghe.kong@sjtu.edu.cn).

650    2022 USENIX Annual Technical Conference                                                               USENIX Association
You can also read