Bi-direction Context Propagation Network for Real-time Semantic Segmentation

Page created by Steven Park
 
CONTINUE READING
Bi-direction Context Propagation Network for Real-time Semantic Segmentation
Bi-direction Context Propagation Network for Real-time Semantic Segmentation

                                                                 Shijie Hao1      Yuan Zhou1,2          Yanrong Guo1           Richang Hong1
                                                                                        1
                                                                                            Hefei University of Technology
                                                                                        2
                                                                                            2018110971@mail.hfut.edu.cn

                                                                Abstract
arXiv:2005.11034v3 [cs.CV] 2 Jun 2020

                                            Spatial details and context correlations are two types of
                                        important information for semantic segmentation. Generally,
                                        shallow layers tend to contain more spatial details, while
                                        deep layers are rich in context correlations. Aiming to keep
                                        both advantages, most of current methods choose to forward-
                                        propagate the spatial details from shallow layers to deep
                                        layers, which is computationally expensive and substantially
                                        lowers the model’s execution speed. To address this problem,
                                        we propose the Bi-direction Context Propagation Network
                                        (BCPNet) by leveraging both spatial and context information.
                                        Different from the previous methods, our BCPNet builds bi-           Figure 1. Illustrations of the previous forward spatial details propa-
                                        directional paths in its network architecture, allowing the          gation and our backward context propagation mechanism.
                                        backward context propagation and the forward spatial detail
                                        propagation simultaneously. Moreover, all the components
                                        in the network are kept lightweight. Extensive experiments           realize these two key points by 1) keeping high-resolution
                                        show that our BCPNet has achieved a good balance between             feature maps in the network pipeline, and 2) using the di-
                                        accuracy and speed. For accuracy, our BCPNet has achieved            lated convolution [23], respectively [11, 24, 8, 3, 26]. For a
                                        68.4 % mIoU on the Cityscapes test set and 67.8 % mIoU               typical semantic segmentation network, it is well known that
                                        on the CamVid test set. For speed, our BCPNet can achieve            spatial details exist more in the shallow layers, while context
                                        585.9 FPS (or 1.7 ms runtime per image) at 360 × 640 size            information is mainly found in the deep layers. Therefore,
                                        based on a GeForce GTX TITAN X GPU card.                             to enhance the segmentation accuracy via appropriately co-
                                                                                                             operating low-level spatial details and high-level context
                                                                                                             information, these methods try to propagate spatial details
                                                                                                             from shallow layers to deep layers, as shown in Fig.1 (a).
                                        1. Introduction
                                                                                                             However, keeping feature maps of relatively-high resolu-
                                           Semantic segmentation is one of the most challenging              tion in the network tends to bring in higher computational
                                        tasks of computer vision, which aims to partition an image           costs, and thus substantially lowers the execution speed. To
                                        into several non-overlapping regions according to the cate-          quantify the influence of feature resolutions on FLOPs and
                                        gory of each pixel. As its unique role in visual processing,         FPS, we conduct an experiment on the well-known ResNet
                                        many real-world applications rely on this technology such as         [9] framework. For a fair comparison, the last fully con-
                                        self-driving vehicle [17, 20], medical image analysis [7, 6]         nected layer of ResNet is removed. As shown in Fig.2, along
                                        and 3D scenes recognition [28, 19]. Some of these applica-           with the increasing feature resolutions, the model’s FLOPs
                                        tions require high segmentation accuracy and fast execution          substantially increase, and therefore the model’s execution
                                        speed simultaneously, which makes the segmentation task              speed becomes much lower at the same time.
                                        more challenging. In recent years, the research aiming at the           To achieve a good balance between accuracy and speed,
                                        balance between segmentation accuracy and speed is still far         we propose a new method called Bi-direction Context
                                        from satisfactory.                                                   Propagation Network (BCPNet) as shown in Fig.3. Dif-
                                           Generally, there are two key points for obtaining a sat-          ferent from the previous methods that only target to pre-
                                        isfying segmentation: 1) maintaining spatial details, and 2)         serve spatial details, our BCPNet is designed to effectively
                                        aggregating context information. Most of current methods             backward-propagate the context information within the net-
Bi-direction Context Propagation Network for Real-time Semantic Segmentation
The rest of this paper is organized as follows. First, we
                                                                     introduce the related work in Section 2. Then, we provide
                                                                     the details of our method in Section 3 followed by our ex-
                                                                     periments in Section 4. Finally, we conclude the paper in
                                                                     Section 5.

                                                                     2. Related work
Figure 2. Influence of feature resolutions on FLOPS and speed.           In this section, we review the semantic segmentation
“normalized feature resolution” represents the rate between the      methods which are closely related to our research. First,
feature map size and the input image size. “res-50” represents the
                                                                     the methods based on forward propagation of spatial de-
ResNet-50 network. “res-101” represents the ResNet-101 network.
                                                                     tails within the network pipeline are introduced. Then, we
                                                                     introduce the methods focusing on segmentation speed.
work pipeline, as shown in Fig.1 (b). By propagating the             2.1. Methods of forward-propagating spatial details
context information aggregated from the deep layers back-
ward to the shallow layers, the shallow-layer features become           DilatedNet [23] is a pioneering work based on the forward
more context-aware, as exemplified in Fig.4. Therefore, the          spatial detail propagation, which introduces the dilated con-
shallow-layer features can be directly used for the final pre-       volution to enlarge the receptive field through inserting holes
diction. In our method, the key component is the Bi-direction        to convolution kernels. Therefore, some downsampling op-
Context Propagation (BCP) module (Fig.3 (b)), which effi-            erations (like pooling) can be removed from the network,
ciently enhances the context information of shallow layers by        which avoids the decrease of feature resolutions and the loss
building the bi-directed paths, i.e., the top-down and bottom-       of spatial details. Following [23], a large number of methods
up paths. In this way, the need for keeping high-resolution          are proposed, such as [24, 8, 3, 26]. In particular, most of
feature maps all along in the network pipeline is freed, which       these methods pay attention to improving the context repre-
not only facilitates the design of a network with less FLOPs         sentations of the last convolution layer. For example, [26]
and a faster execution speed, but also maintains a relatively        introduces the Pyramid Pooling Module (PPM), in which
high segmentation accuracy. Specifically, the lightweight            multiple parallel average pooling branches are applied to the
BCPNet is designed, which just totally contains about 0.61           last convolution layer, aiming to aggregate more context cor-
M parameters in a typical segmentation task. Extensive ex-           relations. [3] extendes PPM into the Atrous Spatial Pyramid
periments validate that our BCPNet achieves a good balance           Pooling (ASPP) module by further introducing the dilated
between segmentation accuracy and execution speed. For ex-           convolution [23]. By using the dictionary learning, [24]
ample, as for the speed, our BCPNet achieves 585.9 FPS on            proposes EncNet to learn the global context embedding. Re-
360 × 640 input images. Even for 1024 × 2048 input images,           cently, He et al. [8] propose the Adaptive Pyramid Context
our BCPNet still achieves 55 FPS. As for the segmentation            Network (APCNet), in which the Adaptive Context Module
accuracy, our BCPNet has obtained 68.4 % mIoU on the                 (ACM) is built to aggregate pyramid context information.
cityscapes [5] test set and 67.8 % mIoU on the CamVid [2]               Discussion. Aiming to leverage spatial details and con-
test set.                                                            text correlations, the above methods all propose to propagate
   The contributions of this paper can be summarized as              spatial details from shallow layers to deep layers, as shown
following aspects:                                                   in Fig.1 (a). Nevertheless, the forward spatial detail propa-
                                                                     gation within the network pipeline is computationally expen-
   • First, different from the previous methods that only            sive, and substantially lowers the model’s execution speed
     forward-propagate spatial details, we introduce the             (as shown in Fig.2). Therefore, although these methods im-
     backward context propagation mechanism, which en-               prove segmentation accuracy, the balance between accuracy
     sures the feature maps of shallow layers being aware of         and speed is still far from satisfactory. Taking DeepLab [3]
     semantics.                                                      for example, for a 512 × 1024 input image, the model totally
                                                                     needs 457.8 G FLOPs, at the speed of 0.25 FPS only, as
   • Second, we propose the Bi-direction Context Propaga-            shown in Table.1 and Table.2.
     tion Network (BCPNet), which realizes the backward              2.2. Methods focusing on speed
     context propagation with high efficiency by combining
     the top-down and bottom-up paths.                                   Many real-world applications, such as vision-based self-
                                                                     driving systems, require both segmentation accuracy and
   • Third, as a tiny-size semantic segmentation model, our          efficiency. In this context, large research efforts have been
     BCPNet has achieved the state-of-the-art balance be-            paid to speeding up the semantic segmentation while main-
     tween accuracy and speed.                                       taining its accuracy. For example, Badrinarayanan et al. [1]
Bi-direction Context Propagation Network for Real-time Semantic Segmentation
Figure 3. Overview of our BCPNet. “w-sum” represents the weighted sum. “s-conv” represents the separable convolution.

propose SegNet by scaling the network into a small one. Seg-        contains 4.8 M parameters and BiSeNet-2 contains 49 M
Net contains 29.5 M parameters, and can achieve 14.6 FPS            parameters. As for 768 × 1536 input images, BiSeNet-1 in-
on 360 × 640 input images. Moreover, it has achieved 56.1           volves 14.8 G FLOPs and BiSeNet-2 involves 55.3 G FLOPs.
mIoU % on the Cityscapes test set. Paszke et al. [14] pro-          Different from the above methods, we propose to use the
pose ENet by employing the strategy of early downsampling.          context propagation mechanism to construct a real-time se-
ENet contains 0.4 M parameters, and can achieve 135.4               mantic segmentation model.
FPS on 360 × 640 images. Moreover, ENet has achieved
58.3% mIoU on the Cityscapes [5] test set. Zhao et al. [25]         3. Proposed method
propose ICNet by using the strategy of multi-scale inputs
and cascaded frameworks to construct a lightweight net-             3.1. Architecture of BCPNet
work. ICNet contains 26.5 M parameters, and achieves 30.3               As shown in Fig.3, our BCPNet is a variant of encoder-
FPS on 1024 × 2048 input images. Moreover, ICNet has                decoder framework. As for the encoder, we choose the
achieved 67.1 % mIoU on the CamVid [2] test set. Yu et              lightweight MobileNet [16] as our backbone network (Fig.3
al. [22] propose BiSeNet by independently constructing a            (a)). Aiming at an efficient implementation, we do not use
spatial path and a context path. The context path is used to        any dilated convolutions to keep the resolutions of the feature
extract high-level context information, and the spatial path        map in the backbone network. Instead, the feature map is
is used to maintain low-level spatial details. BiSeNet has          fast downsampled to the 1/32 resolution of the input image,
achieved 68.4% mIoU at 72.3 FPS on 768 × 1536 images                which makes the network computationally modest. Since
of the Cityscapes test set. Recently, [10] proposed DFANet          a lightweight network are less capable to learn an accurate
based on the strategy of feature reuse. DFANet has achieved         context representation, our method further aggregate the
71.3 % mIoU at 100 FPS on 1024 × 1024 images of the                 context as in [26] to preserve the segmentation accuracy.
Cityscapes test set. Lo et al. [12] propose EDANet by em-           However, this process is fundamentally different from [26]
ploying an asymmetric convolution structure and dense con-          in two aspects. First, instead of using a parallel manner,
nections, which only includes 0.68 M parameters. EDANet             we aggregate the context by sequentially adopting two max-
can achieve 81.3 FPS based on 512 × 1024 input images.              pooling operations. Second, instead of the average pooling,
Moreover, it still obtains 67.3 % mIoU on the Cityscapes            we use max-pooling to aggregate context information, which
test set.                                                           is empirically useful in our experiments, as shown in Table.5.
    Discussion. Most of the real-time methods concen-               Specifically, we set the kernel size of max-pooling as 3 × 3
trate on scaling the network into a small one, such as              and its stride as 2. Based on the encoder, as shown in Fig.3, a
[1, 14, 25, 10]. However, they ignore the significant role          hierarchy of the feature maps can be obtained. The resolution
of context information in most cases. Although BiSeNet              can be finally condensed into 1/128, which has the global
[22] constructs a spatial path and a context path to learn spa-     perception of the image semantics.
tial details and context correlations respectively, this method         As for the decoder, it contains two parts, i.e., the BCP
is still computationally expensive. For example, BiSeNet-1          module (Fig.3 (b)) and the classifier (Fig.3 (c)). For the
Bi-direction Context Propagation Network for Real-time Semantic Segmentation
proposed BCP module, we design a bi-directional struc-              map is still aware of object boundaries, and contains much
ture, which enables context information efficiently backward-       less noises. The ablation study in Section.4.4 also validates
propagated to the shallow-layer features, thus making them          the effectiveness of our BCP module.
sufficiently aware of the semantics. Finally, the enhanced
features at the Layer-3 level are sent into the 1 × 1 convolu-      4. Experiments
tional classifier for obtaining the output of the encoder, i.e.
                                                                       In this section, we conduct extensive experiments to vali-
the pixel-wise label prediction of an image.
                                                                    date the effectiveness of our method. First, we compare our
3.2. Details of BCP module                                          method with other methods in terms of the model parameters
                                                                    and FLOPs, implementation speed, and segmentation accu-
    The key component of our BCPNet is the BCP module.              racy. Then, we provide an ablation study for our method.
Aiming to facilitate the segmentation accuracy, we build
the BCP module composed of two top-down paths and one               4.1. Comparison on parameters and FLOPs
bottom-up path. As shown in Fig.3 (b), a top-down path is
                                                                        The number of parameters and FLOPs are important eval-
firstly employed to backward-propagate the context infor-
                                                                    uation metrics for real-time semantic segmentation. There-
mation to shallow layers. Then, a bottom-up path is built
                                                                    fore, in this section, we provide extensive parameters and
to forward-propagate the context and spatial information
                                                                    FLOPs analysis for our method, which is summarized in
aggregated in shallow layers to deep layers again. Finally,
                                                                    Table.1. For a clear comparison, we classify the current re-
another top-down path is employed to backward-propagate
                                                                    lated methods into four categories, i.e., models of large size,
the final context information. In the BCP module, the top-
                                                                    medium size, small size, and tiny size. 1) A large-size model
down path is designed to backward-propagate the context
                                                                    means the network’s parameters and FLOPs are more than
information to shallow layers, while the bottom-up path is
                                                                    200 M and 300 G, respectively . 2) A medium-size model
designed to forward-propagate the spatial information to
                                                                    means the network’s FLOPs are between 100 G and 300 G.
the deep layers. The bi-directional design caters to the key
                                                                    3) A small-size model means the network’s parameters are
concern of a semantic segmentation task, i.e. spatial- and
                                                                    between 1 M and 100 M, or FLOPs are between 10 G and
contextual-awareness. Here, both types of paths have the
                                                                    100 G. 4) A tiny-size model means the network’s parameters
same network structure. Specifically, each layer is composed
                                                                    are less than 1 M, and FLOPs are less than 10 G. According
of two operations, i.e., the separable convolution [4] and the
                                                                    to this, our BCPNet clearly belongs to the category of tiny-
scalar weight sum, which are lightweight. The former one is
                                                                    size model, as it only contains 0.61 M parameters in total,
made up of a point-wise convolution and a depth-wise con-
                                                                    and only involves 4.5 G FLOPs even for the 1024 × 2048
volution. As for the latter, the weighted sum summarizes the
                                                                    input image.
information of neighboring layers by introducing learnable
                                                                        Compared with the small-size models, our BCPNet shows
scalar weights, as shown in Eq.1:
                                                                    a consistently better performance. For example, as for
                                                                    TwoColumn [21], it has about 50 times FLOPs than us. As
                 F = Θl · S l + σl+1 · C l+1                 (1)    for BiSeNet [22], our BCPNet only has about 1.2 % param-
                                                                    eters and 4.6 % FLOPs of BiSeNet-2, and has about 11 %
where S l represents the features from the lth layer (contain-      parameters and 17 % FLOPs of BiSeNet-1. As for ICNet
ing more spatial details), and C l+1 represents features from       [25], our BCPNet only has its 2 % parameters and 16 %
the (l + 1)th layer (containing more context correlations).         FLOPs. As for the two versions of DFANet [10], our FLOPs
The scalars of Θl and σl+1 are learnable weights for S l and        are comparable to them. But we have much less parame-
Cl+1 , respectively.                                                ters, e.g., our BCPNet just has about 13 % parameters of
   Because of the lightweight components, our BCP moldule           DFANet-B and 8 % parameters of DFANet-A.
only contains 0.18 M parameters for a typical segmentation              Compared with other tiny models, our BCPNet still has a
task. However, its effectiveness is satisfying. For example,        better performance. As for ENet [14], although our BCPNet
we visualize two versions of Layer-3 in Fig.4, which are            has more parameters (0.2 M), it just has about 13 % FLOPs
processed with/without the BCP module. We find the feature          of ENet. As for EDANet [12], our parameter number is
map (Fig.4 (b)) without being refined by the BCP module             comparable. But our BCPNet just has about 13 % FLOPs of
contains the rich details, such as boundaries and textures.         EDANet.
However, it is not hlpful for the final segmentation as it is not
                                                                    4.2. Comparison on speed
semantics-aware, since this layer contains too many noises
and lacks sufficient context information. On the contrary,              In this section, we provide the comparison of implemen-
after being refined by the BCP module, the feature maps             tation speed, which is summarized in Table.2.
(Fig.4 (c)) clearly become aware of the regional semantics,             As for the small-size models, our BCPNet presents a con-
such as person, car, and tree. Moreover, the refined feature        sistent faster execution speed. For example, our BCPNet is
Figure 4. Visualization of the feature maps of Layer-3 before and after processed by our BCP module. In particular, as for (b) and (c), red
(blue) color represents the pixel has a higher (lower) response.

                                                        Table 1. FLOPs analysis.
                    Method     Params     360 × 640 713 × 713 512 × 1024 768 × 1536 1024 × 1024 1024 × 2048
                                                           Large Size
                                               (params > 200M and F LOP s > 300G)

                DeepLab [3] 262.1 M            -            -       457.8 G              -              -               -
                PSPNet [26] 250.8 M            -         412.2 G       -                 -              -
                                                              Medium Size
                                                       (100G < F LOP s < 300G)

                  SQ [18]          -           -            -       -         -                         -            270 G
                FRRN [15]          -           -            -    235 G        -                         -              -
               FCN-8S [13]         -           -            -   136.2 G       -                         -              -
                                                          Small Size
                                         (1M < params < 100M or 10G < F LOP s < 100G)

           TwoColumn [21]           -         -             -       57.2 G         -                    -
            BiSeNet-2 [22]       49 M         -             -           -       55.3 G                  -               -
                SegNet [1]      29.5 M      286 G           -           -          -                    -               -
                ICNet [25]      26.5 M        -             -           -          -                    -            28.3 G
            DFANet-A [10]        7.8 M        -             -        1.7 G         -                  3.4 G             -
            BiSeNet-1 [22]       5.8 M        -             -           -       14.8 G                  -               -
            DFANet-B [10]        4.8 M        -             -           -          -                  2.1 G             -
                                                              Tiny Size
                                                   (params < 1M and F LOP s < 10G)

              EDANet [12]       0.68 M         -            -          8.97 G            -              -               -
                ENet [14]        0.4 M       3.8 G          -             -              -              -               -
              Our BCPNet        0.61 M      0.51 G       1.12 G        1.13 G         2.53 G         2.25 G          4.50 G

235.7 FPS faster than TwoColumn [21] based on 512 × 104                 faster than DFANet-B on 720 × 960 input images. For
input images. Compared with the two versions of BiSeNet                 1024 × 1024 input images, the BCPNet and DFANet-B are
[22], our BCPNet has a clear advantage over them. For ex-               comparable, but it is faster than DFANet-A by 16 FPS.
ample, on 360 × 640 input images, our BCPNet is 456.5 FPS                  As for the tiny-size models, our method still yields better
faster than BiSeNet-2 and 382.4 FPS faster than BiSeNet-1.              performances. For example, compared with ENet, our BCP-
When the resolution increases to 720 × 1280, our BCPNet                 Net is 450 FPS faster on the 360 × 630 input images, and
is 86.9 FPS faster than BiSeNet-2 and 52.5 FPS faster than              88 FPS faster on 720 × 1280 input images. Compared with
BiSeNet-1. Compared with ICNet [25] achieving 30.3 FPS                  EDANet, on 512 × 1024 input images, our BCPNet is faster
on 1024 × 2048 input images, our BCPNet is still 25 FPS                 by 69 FPS.
faster than it. Compared with two versions of DFANet [10],
                                                                           Of note, the results of our model are obtained with a
our BCPNet is 60 FPS faster than DFANet-A, and 20 FPS
                                                                        GeForce TITAN X GPU card, while the results of other
methods are all obtained with a NVIDIA TITAN X GPU
card, which is slightly better than ours.

4.3. Comparison on accuracy
   In this section, we investigate the effectiveness of our
method in terms of accuracy. We first introduce the im-
plement details for our experiments. Then, we compare
the accuracy of our method and the current state-of-the-art
methods on the Cityscapes [5] and CamVid [2] datasets.

4.3.1   Implementation details
Our codes are implemented on the pytorch platform∗ . Fol-
lowing [8, 3], we use the “poly” learning rate strategy, i.e.,
                        iter     power
lr = init lr × (1 − total  iter )      . For all the experiments,   Figure 5. Visualized segmentation results of our method on the
we set the initial learning rate as 0.1 and the power as 0.9.       Cityscapes dataset.
Aiming to reduce the risk of over-fitting, we adopt data aug-
mentation in our experiments. For example, we randomly
flip and scale the input image from 0.5 to 2. We choose the         DFANet-B has 7.9 times parameters, and BiSeNet-1 has
Stochastic Gradient Descent (SGD) as the training optimizer,        9.5 times parameters more than ours. As for TwoColumn
in which the momentum is set as 0.9 and weight decay is set         [21], whose FLOPs are 50 times than us, the accuracy of
as 0.00001. For the CamVid dataset, we set the crop size and        our BCPNet is only 4.5 % lower. Compared with BiSeNet-2
mini-batch as 720 × 720 and 48. For the Cityscapes dataset,         [22], whose parameters are 80 times than us, our BCPNet
we set the crop size as 1024 × 1024. Due to the limited GPU         has 6.3 % accuracy lower. Despite the lower accuracy, our
resources, we set the mini-batch as 36. For all experiments,        BCPNet is much faster than TwoColumn and BiSeNet-2. For
the training process ends after 200 epochs.                         example, on 512 × 1024 input images, our method is 235
                                                                    FPS faster than TwoColumn. On 360 × 640 input images,
                                                                    our method is 456.5 FPS faster than BiSeNet-2.
4.3.2   Cityscapes                                                     Compared with the tiny-size models, the BCPNet has a
The Cityscapes [5] dataset is composed of 5000 fine-                clear advantage in accuracy. For example, as for ENet [14],
annotated images and 20000 coarse-annotated images. In              our method’s accuracy is higher by 10.1 %. As for EDANet
our experiments, we only use the fine-annotated subset. The         [12], our method’s accuracy is higher by 1.1 %.
dataset totally includes 30 semantic classes. Following                Finally, we visualize some segmentation results of our
[27, 26], we only use 19 classes of them. The fine-annotated        method on Fig.5.
subset contains 2975 images for training, 500 images for
validation, and 1525 images for testing. The results on the         4.3.3   CamVid
Cityscapes dataset are summarized in Table.3.
                                                                    The CamVid dataset [2] is collected from high-resolution
   Performance. Compare with the medium-size mod-
                                                                    video sequences of road senses. This dataset contains 367
els, the accuracy of our method is slightly worse, but still
                                                                    images for training, 101 images for validation, and 233 im-
achieves an acceptable accuracy level. For example, as for
                                                                    ages for testing. The dataset totally includes 32 semantic
SQ [18], which has 60 times FLOPs larger than ours, our
                                                                    classes. Following [25, 10], only 11 classes of them are
BCPNet still obtains 8.6 % higher accuracy. As for FRRM
                                                                    used in our experiments. The results of our method on the
[15], although it achieves 3.4% accuracy higher than us,
                                                                    CamVid dataset are summarized in Table.4.
our BCPNet only has 0.5 % of its FLOPs. Moreover, on
                                                                       Performance.        Although our BCPNet contains
512 × 1024 input images, our BCPNet is faster than FRRN
                                                                    lightweight parameters and FLOPs, it’s accuracy is higher
by 248 FPS. Compared with FCN-8s [13], our BCPNet has
                                                                    than most of the current state-of-the-art methods. Compared
only 1/120 of its FLOPs, while our accuracy is still by 5.3
                                                                    with the large-size DeepLab [3], of which the parameters
% higher.
                                                                    and FLOPs are about 429 and 408 times larger than ours,
   Compared with the small-size models, the BCPNet has
                                                                    the accuracy of our BCPNet is still higher than it by 6.2 %.
achieved comparable accuracy with ICNet [25], DFANet-B
                                                                    As for the small-size models, our method still has a clear
[10], and BiSeNet-1 [22]. But our BCPNet has much less
                                                                    advantage of speed-accuracy balance. Compared to SegNet
parameters. For example, ICNet has 43 times parameters,
                                                                    [1], whose parameters and FLOPs are about 48 and 560
  ∗ https://pytorch.org                                             times than us, our BCPNet’s accuracy is still higher than it
Table 2. Speed analysis for our BCPNet. † represents the method’s speed is calculated on a GeForce TITAN X GPU card. In particular, the
speed of remaining methods is calculated on a NVIDIA TITAN X GPU card. Generally, the NVIDIA TITAN X GPU card is faster than the
GeForce TITAN X GPU card.

                             360 × 640 512 × 1024 720 × 960 720 × 1280 768 × 1536 1080 × 1920 1024 × 1024 1024 × 2048
     Method        Params
                             ms fps     ms    fps ms fps ms       fps ms     fps   ms    fps  ms    fps    ms    fps
                                                            Large Size
                                                (params > 200M and F LOP s > 300G)

   DeepLab [3]     262.1 M    -     -    4000 0.25      -    -    -     -    -          -      -      -      -      -       -       -
                                                             Medium Size
                                                       (100G < F LOP s < 300G)

    SQ [18]            -      -     -      -      -     -    --     -   -    -     -                  -      -      -       60    16.7
  FCN-8S [13]          -      -     -     500     2     -    --     -   -    -     -                  -      -      -        -     -
   FRRN [15]           -      -     -     469    2.1    -    --     -   -    -     -                  -      -      -        -     -
                                                         Small Size
                                        (1M < params < 100M or 10G < F LOP s < 100G)

TwoColumn [21]      -         -  -        68    14.7    -  -     -     -   -     -                     -     -      -        -
 BiSeNet-2 [22]  49 M        8 129.4       -     -      -  -    21 47.9 21 45.7                43     23     -      -        -     -
   SegNet [1]   29.5 M       69 14.6       -     -      -  -    289 3.5    -     -            637     1.6    -      -        -     -
   ICNet [25]   26.5 M        -  -         -     -     36 27.8   -     -   -     -              -      -     -      -       33    30.3
 [10] DFANet-A 7.8 M          -  -         6    160    8  120    -     -   -     -              -      -    10     100       -     -
 BiSeNet-1 [22]  5.8 M       5 203.5       -     -      -  -    12 82.3 13 72.3                24    41.4    -      -        -     -
 DFANet-B [10] 4.8 M          -  -         -     -     6  160    -     -   -     -              -      -    8      120       -     -
                                                             Tiny Size
                                                 (params < 1M and F LOP s < 10G)

  EDANet [12]      0.68 M -     -   12.3 81.3 -              -      -    -    -    -     -             -     -      -       -       -
    ENet [14]       0.4 M  7 135.4   -    -   -              -     21 46.8    -    -    46           21.6    -      -       -       -
  our BCPNet†      0.61 M 1.7 585.9 4 250.4 5.5             181    7.4 134.8 9.8 102.6 18.2           55    8.6   116.2    18.2    55

by a large margin (about 21.1 % higher). Compared with                of accuracy. As shown in Table.5, we find that without
ICNet [25], whose parameters and FLOPs are about 43 and               our backward context propagration mechanism, the back-
6.2 times larger than us, our BCPNet’s accuracy is higher             bone network (containing 0.43 M parameters) only achieves
than it by 0.7 %. Compared with two versions of DFANet                58.891% mIoU on the Cityscapes validation set. By intro-
[10], our BCPNet presents a consistently better performance.          ducing the BCP module, the model’s accuracy is increased
For example, as for DFANet-A, whose parameters are about              to 67.842 % mIoU (about 9 % higher) with only 0.18 M
12.7 times than us, our method’s accuracy is higher by about          additional parameters, which firmly demonstrates the effec-
3.1 %. As for DFANet-B, whose parameters are about 7.8                tiveness of our BCP module. We further investigate the
times than us, our method’s accuracy is higher by about               influence of different pooling operations used in context ag-
8.5 %. Compared with two tiny-size models, our method                 gregation. We find using 3 × 3 max pooling yields a better
has achieved 1.4 % accuracy higher than EDANet [12],                  performance. When we replace the 3 × 3 max pooling with
and 16.5 higher % than ENet [14]. Of note, although our               3 × 3 average pooling, the performance decreases to 67.311
method’s accuracy is lower than BiSeNet-2 [22] by 0.9 %,              % mIoU (abot 0.5 % lower). When we replace the 3 × 3 max
our BCPNet just has about 1.2 % parameters of BiSeNet-1.              pooling with 5 × 5 max pooling, the performance decreases
Moreover, compared with lightweight version of BiSeNet-1,             to 65.763 % mIoU (about 2 % lower). This can be caused
our BCPNet is more accurate by 2.2 %, and only has about              by the mismatch between the (relatively) large kernel size
10 % of its parameters.                                               and the (relatively) small feature resolution. Cropped size
                                                                      also plays an important role for the final accuracy, which
4.4. Ablation study                                                   has been mentioned by [8]. We find using a larger cropped
   In this section, we conduct an ablation study to investi-          size yields a better performance. For example, when we
gate the influence of our components in BCPNet in terms               adopt 1024 × 1024 crop size during the training process, the
Table 3. Results of our method on the Cityscapes test set.       accuracy, we introduce the backward context propagation
                      Method      Params       mIoU                  mechanism, apart from the forward spatial detail propaga-
             Large Size:                                             tion. Both types of the propagation is enabled in the con-
                 DeepLab [3]      262.1 M      63.1                  structed BCP module. To enhance the efficiency, BCPNet
             Medium Size:                                            does not keep the high-resolution feature maps all along
                     SQ [18]          -        59.8                  the pipeline. In addition, the entire network is constructed
                  FRRN [15]           -        71.8                  with lightweight components. Extensive experiments val-
                FCN-8S [13]           -        63.1
                                                                     idate that our BCPNet has achieved a new state-of-the-art
             Small Size:
                                                                     balance between accuracy and speed. Due to the tiny size of
             TwoColumn [21]           -        72.9
              BiSeNet-2 [22]       49 M        74.7                  our method, we can import the network into mobile devices,
                   SegNet [1]     29.5 M       56.1                  and facilitate several vision-based applications that require
                   ICNet [25]     26.5 M       69.5                  real-time segmentation, such as self-driving and surgery as-
              DFANet-A [10]        7.8 M       71.3                  sistance.
              BiSeNet-1 [22]       5.8 M       68.4
              DFANet-B [10]        4.8 M       67.1                  References
             Tiny Size:
                EDANet [12]       0.68 M       67.3                   [1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet:
                    ENet [14]      0.4 M       58.3                       A deep convolutional encoder-decoder architecture for
                Our BCPNet        0.61 M       68.4                       image segmentation. IEEE Transactions on Pattern
                                                                          Analysis and Machine Intelligence, 39(12):2481–2495,
                                                                          2017.
     Table 4. Results of our method on the CamVid test set.
                      Method Params mIoU                              [2] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic
              Large Size:                                                 object classes in video: A high-definition ground truth
                 DeepLab [3]     262.1 M       61.6                       database. Pattern Recognition Letters, 30(2):88–97,
              Small Size:                                                 2009.
              BiSeNet-2 [22]       49 M        68.7
                                                                      [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,
                  SegNet [1]      29.5 M       46.4
                  ICNet [25]      26.5 M       67.1                       and A. L. Yuille. Deeplab: Semantic image segmen-
              DFANet-A [10]        7.8 M       64.7                       tation with deep convolutional nets, atrous convolu-
              BiSeNet-1 [22]       5.8 M       65.6                       tion, and fully connected crfs. IEEE Transactions on
              DFANet-B [10]        4.8 M       59.3                       Pattern Analysis and Machine Intelligence, 40(4):834–
              Tiny Size:                                                  848, 2017.
                EDANet [12]       0.68 M       66.4
                   ENet [14]      0.4 M        51.3
                                                                      [4] F. Chollet. Xception: Deep learning with depthwise
                Our BCPNet        0.61 M       67.8                       separable convolutions. In Proceedings of the IEEE
                                                                          Conference on Computer Vision and Pattern Recogni-
                                                                          tion, pages 1251–1258, 2017.
Table 5. An ablation study of our method on the Cityscapes valida-    [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-
tion set.                                                                 zweiler, R. Benenson, U. Franke, S. Roth, and
  Backbone Crop size           Pooling BPC Params mIoU                    B. Schiele. The cityscapes dataset for semantic urban
      3       768 × 768         7          7    0.43 M   58.891           scene understanding. In Proceedings of the IEEE Con-
      3       768 × 768     3 × 3 max      3    0.61 M   67.842           ference on Computer Vision and Pattern Recognition,
      3       768 × 768     3 × 3 avg      3    0.61 M   67.311           pages 3213–3223, 2016.
      3       768 × 768     5 × 5 max      3    0.61 M   65.763
      3      1024 × 1024    3 × 3 max      3    0.61 M   68.626       [6] Y. Guo, P. Dong, S. Hao, L. Wang, G. Wu, and D. Shen.
                                                                          Automatic segmentation of hippocampus for longitudi-
                                                                          nal infant brain mr image sequence by spatial-temporal
                                                                          hypergraph learning. In International Workshop on
performance can be improved to 68.626 % mIoU.                             Patch-based Techniques in Medical Imaging, pages
                                                                          1–8. Springer, 2016.
5. Conclusion
                                                                      [7] Y. Guo, Y. Gao, and D. Shen. Deformable mr prostate
   To achieve a good balance between accuracy and speed                   segmentation via deep feature learning and sparse patch
in semantic segmentation, we propose a new Bi-direction                   matching. IEEE Transactions on Medical IDmaging,
Context Propagation Network (BCPNet). To preserve the                     35(4):1077–1089, 2015.
[8] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao. Adap-      [19] Y. Wang, Y.-T. Huang, and J.-N. Hwang. Monocular
     tive pyramid context network for semantic segmen-              visual object 3d localization in road scenes. In Pro-
     tation. In Proceedings of the IEEE Conference on               ceedings of the 27th ACM International Conference on
     Computer Vision and Pattern Recognition, pages 7519–           Multimedia, pages 917–925, 2019.
     7528, 2019.                                               [20] J. Wu, J. Jiao, Q. Yang, Z.-J. Zha, and X. Chen.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid-               Ground-aware point cloud semantic segmentation for
     ual learning for image recognition. In Proceedings of          autonomous driving. In Proceedings of the 27th ACM
     the IEEE Conference on Computer Vision and Pattern             International Conference on Multimedia, pages 971–
     Recognition, pages 770–778, 2016.                              979, 2019.
[10] H. Li, P. Xiong, H. Fan, and J. Sun. Dfanet: Deep         [21] Z. Wu, C. Shen, and A. v. d. Hengel. Real-time se-
     feature aggregation for real-time semantic segmen-             mantic image segmentation via spatial sparsity. arXiv
     tation. In Proceedings of the IEEE Conference on               preprint arXiv:1712.00213, 2017.
     Computer Vision and Pattern Recognition, pages 9522–      [22] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang.
     9531, 2019.                                                    Bisenet: Bilateral segmentation network for real-time
[11] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet:               semantic segmentation. In Proceedings of the Euro-
     Looking wider to see better.     arXiv preprint                pean Conference on Computer Vision (ECCV), pages
     arXiv:1506.04579, 2015.                                        325–341, 2018.
                                                               [23] F. Yu and V. Koltun. Multi-scale context aggregation by
[12] S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin. Ef-
                                                                    dilated convolutions. arXiv preprint arXiv:1511.07122,
     ficient dense modules of asymmetric convolution for
                                                                    2015.
     real-time semantic segmentation. In Proceedings of the
     ACM Multimedia Asia on ZZZ, pages 1–6. 2019.              [24] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang,
                                                                    A. Tyagi, and A. Agrawal. Context encoding for se-
[13] J. Long, E. Shelhamer, and T. Darrell. Fully convolu-
                                                                    mantic segmentation. In Proceedings of the IEEE Con-
     tional networks for semantic segmentation. In Proceed-
                                                                    ference on Computer Vision and Pattern Recognition,
     ings of the IEEE Conference on Computer Vision and
                                                                    pages 7151–7160, 2018.
     Pattern Recognition, pages 3431–3440, 2015.
                                                               [25] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet
[14] A. Paszke, A. Chaurasia, S. Kim, and E. Culur-                 for real-time semantic segmentation on high-resolution
     ciello. Enet: A deep neural network architecture               images. In Proceedings of the European Conference
     for real-time semantic segmentation. arXiv preprint            on Computer Vision (ECCV), pages 405–420, 2018.
     arXiv:1606.02147, 2016.
                                                               [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid
[15] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full-         scene parsing network. In Proceedings of the IEEE
     resolution residual networks for semantic segmentation         Conference on Computer Vision and Pattern Recogni-
     in street scenes. In Proceedings of the IEEE Conference        tion, pages 2881–2890, 2017.
     on Computer Vision and Pattern Recognition, pages
                                                               [27] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy,
     4151–4160, 2017.
                                                                    D. Lin, and J. Jia. Psanet: Point-wise spatial attention
[16] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and               network for scene parsing. In Proceedings of the Euro-
     L.-C. Chen. Mobilenetv2: Inverted residuals and linear         pean Conference on Computer Vision (ECCV), pages
     bottlenecks. In Proceedings of the IEEE Conference             267–283, 2018.
     on Computer Vision and Pattern Recognition, pages         [28] N. Zhao. End2end semantic segmentation for 3d indoor
     4510–4520, 2018.                                               scenes. In Proceedings of the 26th ACM international
[17] M. Siam, S. Elkerdawy, M. Jagersand, and S. Yoga-              conference on Multimedia, pages 810–814, 2018.
     mani. Deep semantic segmentation for automated driv-
     ing: Taxonomy, roadmap and challenges. In 2017 IEEE
     20th International Conference on Intelligent Trans-
     portation Systems (ITSC), pages 1–8. IEEE, 2017.
[18] M. Treml, J. Arjona-Medina, T. Unterthiner,
     R. Durgesh, F. Friedmann, P. Schuberth, A. Mayr,
     M. Heusel, M. Hofmarcher, M. Widrich, et al.
     Speeding up semantic segmentation for autonomous
     driving. In MLITS, NIPS Workshop, volume 2, page 7,
     2016.
You can also read