Quality-Aware Network for Human Parsing

Page created by Rick Palmer
 
CONTINUE READING
Quality-Aware Network for Human Parsing
Quality-Aware Network for Human Parsing

                                                         Lu Yang1 , Qing Song*1 , Zhihui Wang1 , Zhiwei Liu2 , Songcen Xu3 and Zhihao Li3
                                                                       1
                                                                         Beijing University of Posts and Telecommunications
                                                                      2
                                                                        Institute of Automation Chinese Academy of Sciences
                                                                               3
                                                                                 Noah’s Ark Lab, Huawei Technologies
                                                                                        {soeaver, priv, wangzh}@bupt.edu.cn
arXiv:2103.05997v1 [cs.CV] 10 Mar 2021

                                                                    zhiwei.liu@nlpr.ia.ac.cn, {xusongcen, zhihao.li}@huawei.com

                                                                       Abstract

                                            How to estimate the quality of the network output is an
                                         important issue, and currently there is no effective solu-
                                         tion in the field of human parsing. In order to solve this
                                         problem, this work proposes a statistical method based on
                                         the output probability map to calculate the pixel quality
                                         information, which is called pixel score. In addition, the
                                         Quality-Aware Module (QAM) is proposed to fuse the dif-
                                         ferent quality information, the purpose of which is to es-
                                         timate the quality of human parsing results. We combine
                                         QAM with a concise and effective network design to pro-
                                         pose Quality-Aware Network (QANet) for human parsing.
                                         Benefiting from the superiority of QAM and QANet, we
                                         achieve the best performance on three multiple and one sin-         Figure 1. Examples of category pixel score on CIHP val set.
                                         gle human parsing benchmarks, including CIHP, MHP-v2,               Category pixel score can well reflect the quality of parsing results
                                         Pascal-Person-Part and LIP. Without increasing the train-           in different human parts.
                                         ing and inference time, QAM improves the APr criterion by
                                         more than 10 points in the multiple human parsing task.             human appearance, semantic ambiguity of different human
                                         QAM can be extended to other tasks with good quality es-            parts, and complex background, the practical application of
                                         timation, e.g. instance segmentation. Specifically, QAM im-         human parsing is still difficult.
                                         proves Mask R-CNN by ∼1% mAP on COCO and LVISv1.0                      In recent years, many efforts have been made to im-
                                         datasets. Based on the proposed QAM and QANet, our                  prove the performance of human parsing. Structured hier-
                                         overall system wins 1st place in CVPR2019 COCO Dense-               archy [43, 44, 21], edge and pose supervisions [36, 40, 58],
                                         Pose Challenge, and 1st place in Track 1 & 2 of CVPR2020            attention mechanism [53, 52, 55] and self-correction [27]
                                         LIP Challenge. Code and models are available at https:              have made remarkable progress. However, one important
                                         //github.com/soeaver/QANet.                                         aspect has not been paid enough attention to, the quality es-
                                                                                                             timation of human parsing results. The purpose of quality
                                                                                                             estimation is to score the prediction results of the system,
                                         1. Introduction                                                     thereby ranking outputs [2, 28], filtering out low-quality re-
                                                                                                             sults [42] or eliminating data noise [27, 3]. Quality estima-
                                            Predicting the semantic category of each pixel on the hu-
                                                                                                             tion has great potential in performance analysis and practi-
                                         man body is a fundamental task in multimedia and computer
                                                                                                             cal application for human parsing. There are a large number
                                         vision, which usually known as human parsing [29, 60, 13,
                                                                                                             of network outputs that have a low-score with high-quality,
                                         53, 44]. According to the number of human in the image, it
                                                                                                             or a high-score with low-quality. In practical application,
                                         can be divided into single human parsing [29, 43, 44] and
                                                                                                             the parsing scores of most methods can only reflect the qual-
                                         multiple human parsing [13, 53, 52]. Due to the diverse
                                                                                                             ity of detection results, but can not reflect the quality of
                                           * The   corresponding author is Qing Song.                        parsing results [60, 53, 40, 38], which makes it difficult to

                                                                                                         1
Quality-Aware Network for Human Parsing
effectively filter out low-quality results by threshold. More-        believe that it can be used in more tasks [46, 24, 11, 51].
over, in some researches, parsing results will be output in              Benefiting from the superiority of QANet, our overall
the form of a structured hierarchy [12, 43, 44, 21]. In these         system wins 1st place in CVPR2019 COCO DensePose
methods, the score of each human part is the same and does            Challenge1 , and 1st place in Track 1 & 2 in CVPR2020 LIP
not reflect the quality of semantic segmentation, so it is dif-       Challenge2 .
ficult to select the optimal level as the output.
    The purpose of this work is to accurately estimate the            2. Related Work
quality of human parsing results. We use the box score                Human Parsing. Thanks to the progress of convolutional
predicted by the detector as the discriminant quality infor-          neural network [25, 19, 39, 34] and open sourcing of large-
mation. We extract the high confidence region from the                scale human parsing datasets [29, 60, 13], human parsing
probability map of network output, and take the average               has made great progress. For single human parsing, most
confidence of this region as the pixel quality information,           of the work is devoted to introducing human body structure
which is recorded as the pixel score (Figure 1). Referring            prior and improving network design. Wang et al. [43, 44]
[20] and [52], we also construct a lighter FCN to regress             use deep graph networks and hierarchical human structures
the IoU between the network output and the ground-truth,              to exploit the human representational capacity. [40, 58]
which is used to express the task quality information of              enhance the semantic information of human body by intro-
the network output and record it as the IoU score. On this            ducing edge or pose supervisions. Li et al. [27] introduce a
basis, we present Quality-Aware Module (QAM) as a way                 purification strategy to tackle the problem of learning with
of building comprehensive and concise quality estimation              label noises, to promote the reliability of the supervised la-
for human parsing. QAM fuses the different quality infor-             bels as well as the learned models. In the aspect of mul-
mation to generate a quality score to express the quality of          tiple human parsing, many effective top-down frameworks
human parsing result by using a flexible exponential weight-          have been proposed. According to whether the human de-
ing way. QAM is a post-processing mechanism independent               tector [39, 61, 64] and human parser are end-to-end trained,
of network structure, which can be combined with various              the top-down methods can be divided into one-stage and
methods. In this paper, we follow the simple and effective            two-stage. One-stage top-down is a multi task learning sys-
network design concept [49], combined with QAM to pro-                tem based on R-CNN [10]. Yang et al. present Parsing R-
pose a network suitable for both single and multiple human            CNN [53] and RP R-CNN [52], which introduce multi-scale
parsing, called Quality-Aware Network (QANet). QANet                  features [5, 7], attention mechanism [45, 54] and global se-
takes ResNet-FPN [19, 30] or HRNet [41] as the backbone,              mantic information to enhance the human visual percep-
generates high-resolution semantic features through seman-            tion ability of convolutional neural network, making break-
tic FPN [23], and predicts parsing result and IoU score by            throughs in multiple human parsing and dense pose estima-
two different branches. Finally, QANet uses the quality               tion [15] tasks. Two-stage top-down is a more flexible solu-
score to express the quality of human parsing results.                tion and the best-performing currently. Generally speaking,
    We extensively evaluate our approach on three mul-                a single human parsing network can be used as the parser
tiple and one single human parsing benchmarks, includ-                for two-stage top-down method [40, 27].
ing CIHP [13], MHP-v2 [60], Pascal-Person-Part [48] and
LIP [29]. QANet has achieved state-of-the-art perfor-                 Quality Estimation. The purpose of quality estimation (or
mances on these four benchmarks. Especially in terms of               quality assessment) is to score or rank the prediction re-
APr criterion, QANet is ahead of [53, 40, 52, 21] by more             sults of a system. The cross-entropy loss function can be
than 10 points on CIHP val set. In addition, QAM is a                 regarded as an intuitive quality estimation, and most dis-
post-processing quality estimation mechanism, which can               criminant models use it to express the confidence of their
be combined with other advanced methods. Experiments                  results, such as image classification [25, 19], object detec-
show that QAM can be directly integrated into the region-             tion [39, 32] and semantic segmentation [34]. However,
based multiple human parsing methods such as Parsing R-               more and more systems need customized quality estimation
CNN [53] and RP R-CNN [52], and can significantly im-                 methods which are closer to their problems to accurately
prove the performance of both methods. QAM also has                   score or rank the outputs. In the research of recommenda-
good quality estimation ability in instance segmentation              tion system, [28, 37] use the list-wise loss function to com-
task, which can improve Mask R-CNN by ∼1% mAP on                      pare the relationship between objects to achieve the ranking
COCO [31] and LVISv1.0 [16] datasets. The performance                 of the results. In order to build a high performance face
improvement effect of QAM is equivalent to [20], but QAM              recognition system, [42] proposes an unsupervised estima-
does not require any additional training and inference time.          tion method to filter out low-quality face images. Huang et
These results show that QAM is a simple, modularized,                   1 https://cocodataset.org/

low-cost and easily extensible module. We have reason to                2 https://vuhcs.github.io/

                                                                  2
Quality-Aware Network for Human Parsing
Algorithm 1 Pseudocode of Pixel Score in a PyTorch-like
                                                                         style.
                                                                         #   probs: input probability map (N, C, H, W)
                                                                         #   T: threshold, default 0.2
                                                                         #   C: num_classes
                                                                         #   inst_ps: instance pixel score
                                                                         #   cate_ps: category pixel score

                                                                         # predicted category and probability
                                                                         inst_max, inst_idx = max(probs, dim=1)
                                                                         # high confidence mask (hcm)
                                                                         inst_hcm = (inst_max >= T).to(bool)
                                                                         # high confidence value (hcv)
                                                                         inst_hcv = sum(inst_max * inst_hcm, dim=[1, 2])
                                                                         # number of hcm
                                                                         inst_hcm_num = sum(inst_hcm, dim=[1, 2])

                                                                         # instance pixel score, (N, 1)
                                                                         inst_ps = inst_hcv / inst_hcm_num
                                                                         cate_ps = one((N, C - 1))
                                                                         for c in range(1, C):
                                                                            # probability of each predicted category
                                                                            cate_max = inst_max * (inst_idx == c).to(bool)
                                                                             cate_hcm = (cate_max >= T).to(bool)
                                                                             cate_hcv = sum(cate_max * cate_hcm, dim=[1, 2])
                                                                             cate_hcm_num = sum(cate_hcm, dim=[1, 2])
Figure 2. Probability map and high confidence masks with dif-
                                                                             # category pixel score, (N, C)
ferent thresholds. (a) input image, (b) probability map, (c) pars-           cate_ps[:, c - 1] = cate_hcv / cate_hcm_num
ing result, (d) high confidence mask with T = 0.4, (e) T = 0.6,
                                                                         return inst_ps, cate_ps
(f) T = 0.8. For (d) - (e), confidences of the black positions are
lower than the threshold T .

al. [20] use a mask IoU head to learn the quality of the pre-
dicted instance masks. Similarly, Yang et al. [52] present               record it as probability map p ∈ RN×H×W. Each position
a parsing re-scoring network to sense the quality of human               in p expresses the classification confidence of correspond-
parsing maps and gives appropriate scores.                               ing pixel. Therefore, the probability map can be regarded
    However, for human parsing, the existing methods can                 as the quality estimation of each pixel. By making full use
not accurately express the quality of results, and can not ef-           of the pixel quality information of the probability map, we
fectively estimate the quality of each human part indepen-               can estimate the quality of the whole human body or part.
dently.                                                                  A naı̈ve solution is to average the confidence of all posi-
                                                                         tions in the probability map to represent the quality of the
3. Methodology                                                           parsing result. However, this scheme is not very accurate.
                                                                         Due to the mechanism of neural network is still difficult to
3.1. Motivation                                                          explain, the output of some difficult samples is always full
   The quality estimation of human parsing is an important               of randomness [62, 63]. As shown in Figure 2, some low
but long neglected issue. Inappropriate scoring or ranking               confidence regions on the probability map often correspond
the network outputs results in a large number of low-score               to the boundary of the instance and some easily confused
with high-quality or high-score with low-quality outputs.                areas, but this does not mean that the prediction is wrong.
Meanwhile, it is difficult for us to filter out low-quality re-          For example, for a human instance with many boundaries
sults by threshold. Therefore, our goal is to propose a sim-             or a human part with small area, the average of the prob-
ple, low-cost solution, which can effectively score the re-              ability map will be lower no matter whether the prediction
sults of human parsing, and can also give the independent                are accurate or not. Therefore, if these areas are included in
score of each human part.                                                the average calculation, it will bring serious deviation to the
                                                                         quality estimation.
3.2. Estimating the Quality with Probability Map
                                                                             To solve the above issue, we adopt a simple and effec-
   Formally, the output of the human parsing network is a                tive scheme. We only calculate the average value of regions
feature map f ∈ RN×C×H×W, where N denotes the number                     with confidence greater than threshold T , and the default
of human instances, C denotes the number of parsing cate-                T = 0.2. More specifically, we calculate the high confi-
gories, H and W are height and width of feature maps. For                dence mask (hcm ∈ RH×W) on the probability map accord-
the feature map f, we take the largest probability predic-               ing to the threshold T . Then the average confidence of hcm
tion among the C categories for each spatial position, and               is calculated to express the pixel quality information of the

                                                                     3
Network Input                                Backbone Network                   Semantic FPN                       Network Output
                                                                    P2
                                                                         P3
                                                                              P4
                                                                                   P5

                                                                                          P2                                                                Softmax loss

                                        GT/Pred boxes                                                                                                       Lovasz loss

                                                                                          P3
                                                                 ResNet-FPN
                                                        or                                                                          pixel score

                                                                                          P4
                                                                                                                                           pred    target
                                                                                   P2                                                      0.75   , 0.67
                      inference stage                                                                                                      0.34   , 0.82      L2 loss
                                                                                                                                           0.87   , 0.95
                                                                                          P5                                               0.55   , 0.51
                                                                                   P3
                                                                                                                                          iou score
                     box score
                                                                                   P4
                           0.98
                           0.91
                                                                                                                            QAM:
                           0.89                                                    P5
                           0.71
                                                                   HRNet

Figure 3. QANet pipeline. The whole image is cropped into several human instance according to the ground-truth boxes (during training)
or predicted boxes (during inference), which are sent to the backbone network. The backbone network outputs multi-scale features through
ResNet-FPN or HRNet. Semantic FPN integrates multi-scale features into high-resolution feature, and then predicts parsing result and its
IoU score by two branches. During inference, we use QAM to estimate the quality of the predicted results for each human instance and
each human part.

human parsing network output, which is recorded as the in-                              late the final quality score Sq :
stance pixel score. In addition, we also use the same method
                                                                                                                     β                1
to calculate the pixel information quality of each predicted                                             Sq = (Sα         γ α+β+γ
                                                                                                                b ∗ Si ∗ Sp )     ,                                       (1)
human part, which is recorded as category pixel score. Al-
gorithm 1 provides the PyTorch-like style pseudo-code of                                where α, β and γ are used to adjust the weight of different
calculating instance pixel score and category pixel score.                              quality information, α = β = γ = 1.0 by default. When
This calculation process is model independent and has less                              using the default weights, QAM can significantly improve
computational overhead.                                                                 the effect of human parsing quality estimation, and can get
                                                                                        better results by adjusting the weights. The idea of QAM
3.3. Quality Aware Module                                                               is to flexibly fuse the different quality information to gener-
                                                                                        ate a more comprehensive quality score, which is a simple,
    In addition to the pixel quality information of the proba-                          low-cost and easily extensible method. QAM can be used
bility map, some other information can be used to estimate                              not only for human parsing, but also for instance segmen-
the quality of human parsing results. Detector not only                                 tation [18, 50] and dense pose estimation [53] after simple
outputs the location information of human instance, but                                 adaptation.
also gives the confidence of classification, which is called
discriminant quality information. Mean intersection over                                3.4. Architecture
union (mIoU) [34] is the basic index to measure the parsing                                 In order to verify the effect of QAM, we propose a two-
performance. Referring [20] and [52], we can regress the                                stage top-down human parsing network, called Quality-
mIoU between the predicted parsing result and the ground-                               Aware Network (QANet). Our design principle starts from
truth through a lightweight FCN, which represents the task                              SimpleNet [49], a strong and concise pose estimation base-
quality information. Discriminant quality information, task                             line. As illustrated in Figure 3, we adopt a ResNet [19]
quality information and pixel quality information all ex-                               with FPN [30] or HRNet [41] as backbone, which pro-
press the output quality of convolution neural network in                               duces multi-scale features P 2 - P 5. Each scale feature
some aspects, and they are complementary to each other.                                 is upsampled by convolutions and bilinear upsampling un-
    We propose Quality-Aware Module (QAM), which fuses                                  til it reaches 1/4 scale, these outputs are integrated into
the different quality information to estimate the output qual-                          a high-resolution feature by element-wise summed [23].
ity. In human parsing task, the box score Sb represents dis-                            Two branches are attached to the high-resolution feature
criminant quality information, IoU score Si represents task                             to predict the parsing results and IoU scores respectively.
quality information and pixel score Sp represents pixel qual-                           The parsing loss is a combination of the cross-entropy loss
ity information. We use the exponential weighting to calcu-                             (Lc ) and the lovasz loss (Lz ) [1]. The IoU branch is a

                                                                                   4
lightweight FCN, and uses MSE loss (Lm ) to regress the                                 Quality Weights
                                                                                                             APp          APp50        APr           APr50
                                                                                         (α, β, γ)
mIoU between the predicted parsing result and the ground-
                                                                                  (a)   (1.0, 0.0, 0.0)      55.6         67.7         40.1          45.0
truth. Thus, the loss function of the whole network is:                           (b)   (0.0, 1.0, 0.0)   56.9(+1.3)   70.7(+3.0)   44.3(+4.2)    50.0(+5.0)
                        L = Lc + Lz + Lm .                            (2)         (c)   (0.0, 0.0, 1.0)   49.0(−6.6)   60.8(−6.9)   49.7(+9.6)    55.4(+10.4)
                                                                                  (d)   (1.0, 1.0, 0.0)   60.2(+4.6)   74.5(+6.8)   45.6(+5.5)    51.4(+6.4)
Finally, we use QAM to calculate the quality scores of the                        (e)   (1.0, 0.0, 1.0)   57.0(+1.4)   70.2(+2.5)   55.0(+14.9)   62.3(+17.3)
whole human instance and each human part respectively.                            (f)   (0.0, 1.0, 1.0)   57.0(+1.4)   70.9(+3.2)   52.5(+12.4)   59.6(+14.6)
                                                                                  (g)   (1.0, 1.0, 1.0)   60.2(+4.6)   74.5(+6.8)   54.0(+13.9)   61.2(+16.2)
                                                                                  (h)   (1.0, 0.5, 3.0)   60.1(+4.5)   74.3(+6.6)   56.2(+16.1)   63.5(+18.5)
4. Experiments
                                                                                 Table 1. Ablation study of Quality Weights. All models are
4.1. Experimental Settings                                                       trained on CIHP train set and evaluated on CIHP val set.
Datasets. Three multiple [13, 60, 48] and one single [29]
human parsing benchmarks are used for performance eval-                          FCOS [61] with ResNet50 as detector. The QANet result is
uation. CIHP [13] is a popular multiple human parsing                            the combination of the original input and the flipped image.
dataset that has 28,280 images for training, 5,000 images for                    For evaluating on LIP, we also use multi-scale test augmen-
validation and 5,000 images for testing with 20 categories.                      tation, following [43, 44].
MHP-v2 [60] includes 25,403 elaborately annotated images                         4.2. Ablation Studies
with 58 fine-grained semantic category labels. The valida-
tion set and test set have 5,000 images respectively. The rest                      In this sub-section, we assess the effects of different set-
15,403 are provided as the training set. PASCAL-Person-                          tings on QAM and QANet by details ablation studies.
Part [48] contains 3,535 annotated images distributed to                         Threshold of Pixel Score. Threshold T is an important
1,717 for training and 1,818 for testing, among which only                       hyper-parameter for calculating the pixel score (see Algo-
7 categories are labeled. LIP [29] is a single human pars-                       rithm 1). The table below shows QANet-ResNet50 with
ing dataset and contains 50,462 images, which are collected                      different thresholds:
from realistic scenarios and divided into 30,462 images for
training, 10,000 for validation and 10,000 for testing.                           threshold T      0.0         0.2          0.4         0.6          0.8
                                                                                   APp /APr     47.5/42.2 60.1/56.3 60.1/56.2 60.0/56.0 59.9/55.1
Evaluation Metrics. The standard mIoU criterion is
adopted for evaluation on human part segmentation, both                          when T = 0.0, the effect of pixel score is negative. Increas-
for single3 and multiple human parsing. We also use Av-                          ing T to filter out low confidence region can significantly
erage Precision based on part (APp ), Average Precision                          improve the effect. Therefore, we choose T = 0.2. As il-
based on region (APr ) and Percentage of Correctly parsed                        lustrated in Figure 1, (category) pixel score can well reflect
semantic Parts (PCP50 ) to measure the performance of mul-                       the quality of parsing results in different human parts.
tiple human parsing, following [13, 60, 53].                                     Quality Weights. Quality weights is used to adjust the pro-
Training. We implement the QAM and QANet based on                                portion of box score, IoU score and pixel score in quality
Pytorch on a server with 8 NVIDIA Titan RTX GPUs. We                             estimation. Table 1 shows the ablation study of quality
use ResNet50, ResNet101 and HRNet-W48 as the back-                               weights on CIHP val set. All experiments are based on
bones to compare different methods fairly. In particular, we                     ResNet50. Row (a), (b) and (c) indicate the performance
only conduct HRNet-W48 experiments on LIP. Following                             of human parsing when only box score, IoU score or pixel
[49], the ground truth human box is made to a fixed aspect                       score is used. It can be seen that the IoU score improves
ratio (4 : 3) by extending the box in height or width. The                       both APp and APr , but pixel improves APr and degrades
default input size of QANet is 512×384. Data augmenta-                           APp . Row (d), (e) and (f) indicate any combination of two
tion includes scale [-30%, +30%], rotation [-40◦ , +40◦ ] and                    scores will improve the APp and APr . Especially when the
horizontal flip. For optimization, we use the ADAM [22]                          pixel score is used, the improvement of APr is very signifi-
solver with 256 batch size. All models are trained for 140                       cant. This proves pixel score is better at representing local
epochs with 2e-3 base learning rate, which drops to 2e-4 at                      (human part) quality. When box score, IoU score and pixel
90 epochs and 2e-5 at 120 epochs. No other training tech-                        score are used with equal weights (Row (g)), APp and APr
niques are used, such as warm-up [14], syncBN [4], learn-                        are increased by 4.6 and 13.9 points respectively. This fully
ing rate annealing [5, 59].                                                      proves that the pixel score and QAM proposed by this work
                                                                                 are effective. After grid search, we use (1.0, 0.5, 3.0) as
Inference. For multiple human parsing, a two-stage top-                          the quality weights to get the best performance on CIHP. We
down paradigm is applied [8, 49]. By default we use a                            also use this weight on both MHP-v2 and PASCAL-Person-
   3 The quality estimation can not be reflected in the mIoU criterion, so       Part. In addition, we use ground-truth boxes (GT-box) to
we use QANet without QAM for single human parsing.                               replace the predicted boxes. The experiments (see Table 2)

                                                                             5
Methods     GT-box QAM     mIoU        APp        APp50      PCP50      APr        APr50
                                               62.9       55.6       67.7        68.9      40.1       45.0
                                X           65.0(+2.1) 53.6(−2.0) 61.9(−5.8) 69.0(+0.1) 38.0(−2.1) 42.8(−2.2)
                   QANet-R50
                                       X 62.9(+0.0) 60.1(+4.5) 74.3(+6.6) 68.9(+0.0) 56.2(+16.1) 63.5(+18.5)
                                X      X 65.0(+2.1) 62.4(+6.8) 76.3(+8.6) 69.0(+0.1) 58.3(+18.2) 66.0(+21.0)
Table 2. The effect of QAM with GT-box or predicted box. All models are trained on CIHP train set and evaluated on CIHP val set.
        Methods   S-FPN Lovasz IoU Large-Res QAM            mIoU        APp        APp50      PCP50       APr         APr50
                                                             54.6       51.8       60.1        59.1       32.5        36.1
                       X                                  56.2(+1.6) 53.6(+1.8) 63.8(+3.7) 62.0(+2.9) 34.7(+2.2) 38.4(+2.3)
        SimpleNet      X      X                           58.9(+4.3) 53.9(+2.1) 64.3(+3.2) 65.9(+6.8) 37.2(+4.7) 41.9(+5.8)
                       X      X     X                     59.3(+4.7) 58.3(+6.5) 71.9(+11.8) 66.0(+6.9) 41.8(+9.3) 47.1(+11.0)
                       X      X     X       X             62.9(+8.3) 60.3(+8.5) 74.5(+14.4) 68.9(+9.8) 45.6(+13.1) 51.4(+15.3)
         QANet         X      X     X       X       X 62.9(+8.3) 60.1(+8.3) 74.3(+14.2) 68.9(+9.8) 56.2(+23.7) 63.5(+27.4)
Table 3. Ablation study of QANet on CIHP dataset. ‘S-FPN’ is Semantic FPN, ‘Lovasz’ is lovasz loss, ‘IoU’ denotes a lightweight FCN
to predict IoU score, ‘Large-Res’ denotes 512×384 input resolution.

      Methods         Input Size pix Acc. mean Acc. mIoU            we have carried out the experiments of 512×384 input
      FCN-8s [34]         –        76.06    36.75   28.29           size, which is smaller than others. In terms of mean Acc.
      Attention [6]       –        83.43    54.39   42.92
                                                                    and mean IoU, our QANet surpasses the best performing
      MMAN [35] 256×256 85.24               57.60   46.93
      JPPNet [29]     384×384        –        –     51.37
                                                                    method HHP† [44] by 0.59 and 0.25 points, respectively.
      CE2P [40]       473×473 87.37         63.20   53.10
      BraidNet [33] 384×384 87.60           66.09   54.42           CIHP [13]. CIHP is the most popular benchmark for mul-
      CorrPM [58] 384×384            –        –     55.33           tiple human parsing. As shown in Table 5, QANet with
      OCR [55]        473×473        –        –     56.65           ResNet50 is better than the current best methods [52, 21] in
      PCNet [57]      473×473        –        –     57.03           terms of mIoU, APp and APr . In terms of APr , QANet out-
      CNIF [43]       473×473 88.03         68.80   57.74           performs RP R-CNN [52] by 13.9 points, outperforms Se-
      HHP [44]        473×473 89.05         70.58   59.25
                                                                    maTree [52] by 12.2 points. Such a performance gain is par-
      SCHP [27]       473×473        –        –     59.36
                                                                    ticularly impressive considering that improvement on multi-
      QANet (ours) 512×384 88.92            71.17   59.61
                                                                    ple human parsing task is very challenging. Combined with
Table 4. Comparison of pixel accuracy, mean accuracy and
mIoU on LIP val set. Our best single model achieves state-of-       the larger capacity backbone HRNet-W48, QANet achieves
the-art with smaller input size (512 × 384 < 473 × 473).            state-of-the-art on CIHP val set, which is 66.1% mIoU,
                                                                    64.5% APp , 75.7% PCP50 and 60.8% APr .
shows that QAM can still significantly improve the perfor-          MHP-v2 [60]. In the middle part of Table 5 also reports
mance of human parsing when using GT-box.                           QANet performance on MHP-v2 val set. Considering
From SimpleNet to QANet. We adopt SimpleNet as our                  that the previous methods do not provide APr criterions,
baseline, which uses ResNet50 as the backbone and fol-              we only compare mIoU and APp . QANet with ResNet50
lowed by three deconvolutional layers [49]. By default,             outperforms RP R-CNN [52] by 3.8 points in mIoU. With
the input size of SimpleNet is 256×192. As shown is Ta-             ResNet101, QANet outperforms CE2P [40] by 2.0 points in
ble 3, the SimpleNet achieves 54.6% mIoU, 51.8% APp and             mIoU and 6.5 points in APp .
32.5% APr on CIHP val set. Semantic FPN, lovasz loss,
IoU score and 512×384 input size outperform the baseline            PASCAL-Person-Part [48]. In the lower part of Ta-
by 8.3% mIoU, 8.5% APp and 13.1% APr . This perfor-                 ble 5, we compare our method against recent methods on
mance has exceeded the best method on CIHP [58, 52, 21],            PASCAL-Person-Part test set. In term APp and APr , our
which shows that our network architecture is concise but            improvement is remarkable. QANet surpasses PGN [13],
effective. Furthermore, our proposed QANet (with QAM)               Parsing R-CNN [53] and RP R-CNN [52] more than 10.0
achieves 62.9% mIoU, 60.1% APp and 56.2% APr . QAM                  points in APp and 15.0 points in APr .
brings more than 10 points APr improvement.
                                                                    Qualitative Results. Some qualitative comparison results
4.3. Quantitative and Qualitative Results                           on CIHP val set are depicted in Figure 4. Qualitative com-
                                                                    parison mainly shows the semantic segmentation ability of
LIP [29]. LIP is a standard benchmark for single human              different methods. Compared with Parsing R-CNN [53]
parsing. Table 4 shows QANet (HRNet-W48) with other                 (first row) and RP R-CNN [52] (second row), QANet (third
state-of- the-arts on LIP val set. In order to ensure that          row) has better performance in occlusion, human contour,
the number of pixels of the input image is almost the same,         complex background and confusing categories.

                                                                6
Datasets    Methods           Backbones Epochs           mIoU       APp       APp50      PCP50       APr       APr50
                                    Bottom-Up
                         PGN† [13]           ResNet101  ∼80             55.8      39.0       34.0       61.0       33.6      35.8
                         Graphonomy [12]     Xception   100             58.6       –          –          –          –         –
                         GPM [17]            Xception   100             60.3       –          –          –          –         –
                         Grapy-ML [17]       Xception   200             60.6       –          –          –          –         –
                         CorrPM [58]         ResNet101  150             60.2       –          –          –          –         –
                               One-Stage Top-Down
                         Parsing R-CNN [53] ResNet50     75             56.3      53.9       63.7       60.1       36.5      40.9
              CIHP [13] Unified [38]         ResNet101  ∼37             55.2      48.0       51.0        –         38.6      44.0
                         RP R-CNN [52]       ResNet50   150             60.2      59.5       74.1       64.9       42.3      48.2
                               Two-Stage Top-Down
                         CE2P [40]           ResNet101  150             59.5       –          –          –         42.8      48.7
                         BraidNet [33]       ResNet101  150             60.6       –          –          –         43.6      49.9
                         SemaTree [21]       ResNet101  200             60.9       –          –          –         44.0      49.3
                         QANet (ours)        ResNet50   140             62.9      60.1       74.3       68.9       56.2      63.5
                         QANet (ours)        ResNet101  140             63.8      61.7       77.1       72.0       57.3      64.8
                         QANet (ours)        HRNet-W48  140             66.1      64.5       81.3       75.7       60.8      68.8
                                    Bottom-Up
                         MH-Parser [26]      ResNet101    –               –       36.0       17.9       26.9         –         –
                         NAN [60]            –          ∼80               –       41.7       25.1       32.2         –         –
                               One-Stage Top-Down
                         Parsing R-CNN [53] ResNet50     75             36.2      39.5       24.5       37.2         –         –
                         RP R-CNN [52]       ResNet50   150             38.6      46.8       45.3       43.8         –         –
             MHP-v2 [60]
                               Two-Stage Top-Down
                         CE2P [40]           ResNet101  150             41.1      42.7       34.5       43.8         –         –
                         SemaTree [21]       ResNet101  200              –        42.5       34.4       43.5         –         –
                         QANet (ours)        ResNet50   140             42.4      47.2       44.0       46.5         –         –
                         QANet (ours)        ResNet101  140             43.1      49.2       48.8       50.8         –         –
                         QANet (ours)        HRNet-W48  140             44.4      51.0       54.0       55.8         –         –
                                    Bottom-Up
                         PGN† [13]           ResNet101  ∼80             68.4       –          –           –        39.2      39.6
                         Graphonomy [12]     Xception   100             71.1       –          –           –         –         –
                         GPM [17]            Xception   100             69.5       –          –           –         –         –
                         Grapy-ML [17]       Xception   200             71.6       –          –           –         –         –
                               One-Stage Top-Down
              PPP [48]   Parsing R-CNN [53] ResNet50     75             62.7      49.8       58.2       48.7       40.4      43.7
                         RP R-CNN [52]       ResNet50    75             63.3      50.1       58.9       49.1       40.9      44.1
                               Two-Stage Top-Down
                         CNIF† [43]          ResNet101  150             70.8       –          –          –          –         –
                         HHP† [44]           ResNet101  150             73.1       –          –          –          –         –
                         QANet (ours)        ResNet101  140             69.5      60.1       74.6       62.9       54.0      62.6
                         QANet (ours)†       HRNet-W48  140             72.6      63.1       78.2       67.2       58.7      67.8
Table 5. Comparison with previous methods on multiple human parsing on CIHP, MHP-v2 and PASCAL-Person-Part (denoted as
PPP) datasets. Bold numbers are state-of-the-art on each dataset, and gray numbers are based on the official open sourcing implementation
by us [53, 52]. † denotes using multi-scale test augmentation.

4.4. Extensibility                                                             Methods        QAM        APp       APp50       APr         APr50
                                                                                                         53.9       63.7       36.5        40.9
                                                                        Parsing R-CNN [53]
                                                                                                  X   55.8(+2.9) 67.7(+4.0) 47.2(+10.7) 53.8(+12.9)
Region-based Human Parsing. QAM can be regarded as a                                                     59.5       74.1       42.3        48.2
post-processing module, so it can be easily combined with                 RP R-CNN [52]
                                                                                                  X   59.5(+0.0) 74.2(+0.1) 48.3(+6.0) 55.5(+7.3)
other studies. Table 6 shows the performance of QAM com-
                                                                       Table 6. Comparison of region-based multiple human parsing
bined with two region-based multiple human parsing meth-               on CIHP val set with or without QAM. The quality weights of
ods on CIHP val set. QAM improves Parsing R-CNN [53]                   Parsing R-CNN [53] is (1.0, 0.0, 1.0), and of RP R-CNN [52] is
about 2.9 points in APp and 10.7 points in APr . The IoU               (3.0, 1.0, 1.0).
score has been used in RP R-CNN [52], so QAM can im-
proves 6.0 APr and achieve similar APp . This is consistent
with the experimental results in Row (d) and (e) of Table 1.           tiple human parsing are highly similar in processing flow.
                                                                       Therefore, we believe that QAM (especially pixel score)
Instance Segmentation. Instance segmentation and mul-                  is beneficial to the instance segmentation task. Table 7

                                                                   7
Figure 4. Visual comparison on CIHP val set. The images in the first row are the results of Parsing R-CNN [53], the second row are the
results of RP R-CNN [52], and the third row are the results of QANet. All methods are based on ResNet50.

  Dataset         Methods           QAM      APm       APm 50     APm 75                 Methods       APdp APdp     dp   dp
                                                                                                               50 AP75 APM APL
                                                                                                                               dp

                                             35.2       56.2       37.5           DensePose R-CNN [15] 58    90    70   51   61
              Mask R-CNN [18]
                                    X     36.2(+1.0) 55.5(−0.7) 39.0(+1.5)         Parsing R-CNN [53]   65   93    78   57   68 2018 1st
  COCO
                                             34.8       54.0       37.1              Detectron2 [47]    67   92    80   60   70
                SOLOv2 [50]
                                    X     35.2(+0.4) 53.5(−0.5) 37.7(+0.6)            QANet (ours)      72   94    85   65   74 2019 1st
                                             19.1       30.2       20.3
 LVISv1.0 Mask R-CNN [18]
                                    X     20.1(+1.0) 31.2(+1.0) 21.4(+1.1)
                                                                                 Table 9. QANet wins 1st place in CVPR2019 COCO DensePose
                                                                                 Challenge. Our result is obtained without ensembling.
Table 7. Comparison of instance segmentation on COCO /
LVISv1.0 val sets with or without QAM. The quality weights
is (1.0, 0.0, 1.0).                                                              win 1st place in Track 1 & 2 of CVPR2020 LIP Challenge
                                              r                                  (see Table 8) with better human detector [61, 56] and test-
              Methods          mIoU        AP     Rank
          BJTU UIUC [40]       63.77      45.31   54.54    2018 1st              time techniques. By replacing parsing losses with dense
          PAT CV HUMAN         67.83      55.43   61.63                          pose losses, QANet achieves the same impressive result in
          BAIDU-UTS [27]       69.10      55.01   62.05    2019 1st              dense pose estimation. In CVPR2019 COCO DensePose
              tfzhou.ai        68.69      56.40   62.55                          Challenge, QANet wins 1st place and was significantly
            QANet (ours)       68.76      64.63   66.69    2020 1st              ahead of the other competitors in all five metrics (see Ta-
          (a) Track 1: Multi-Person Human Parsing Challenge.                     ble 9).
      Methods               mIoU      APh      APr     Rank
  BAIDU-UTS [27]            60.14    76.52    52.97    63.21   2019 1st          5. Conclusion
      keetysky              60.17    84.71    55.19    66.69
  PAT CV HUMAN              60.58    84.93    55.56    67.02                         The quality estimation of human parsing is an important
    QANet (ours)            62.67    89.59    59.34    70.53   2020 1st          but long neglected issue. In this work, we propose a net-
    (b) Track 2: Video Multi-Person Human Parsing Challenge.                     work that can generate high-quality parsing results and a
Table 8. QANet wins 1st place in Track 1 & 2 of CVPR2020                         simple, low-cost quality estimation method, named QANet.
LIP Challenge. Our results ensemble four QANet models.                           Its core components are pixel score based on probability
                                                                                 map and the quality-aware module (QAM) which integrates
shows that results of some methods on COCO [31] and                              different quality information. QANet has verified its ef-
LVISv1.0 [16] with or without QAM4 . On COCO val set,                            fectiveness and advancement on several challenging bench-
Mask R-CNN [18] with QAM is better than the original                             marks, and wins 1st in three popular challenges, including
counterpart: up to +1.0 APm . QAM can also slightly im-                          CVPR2020 LIP Track 1 & Track 2 and CVPR2019 COCO
prove SOLOv2 [50]. On LVISv1.0 val set, we can also                              DensePose. Beyond the human parsing task, it is possible
observe steady improvements.                                                     to adopt QANet and QAM for other instance-level tasks,
                                                                                 like instance segmentation and dense pose estimation. We
Challenges. Benefiting from the superiority of QANet, we
                                                                                 hope to promote a new breakthrough in the form of quality
  4 All   experiments are based on Detectron2 [47].                              estimation and help more visual tasks.

                                                                             8
References                                                               [20] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang
                                                                              Huang, and Xinggang Wang. Mask scoring r-cnn. In CVPR,
 [1] Maxim Berman, Amal Rannen Triki, and Matthew B                           2019. 2, 3, 4
     Blaschko. The lovasz-softmax loss: A tractable surrogate
                                                                         [21] Ruyi Ji, Dawei Du, Libo Zhang, Longyin Wen, Yanjun Wu,
     for the optimization of the intersection-over-union measure
                                                                              Chen Zhao, Feiyue Huang, and Siwei Lyu. Learning seman-
     in neural networks. In ICCV, 2018. 4
                                                                              tic neural tree for human parsing. In ECCV, 2020. 1, 2, 6,
 [2] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang                  7
     Li. Learning to rank: from pairwise approach to listwise
                                                                         [22] Diederik P. Kingma and Jimmy Ba. Adam: A method for
     approach. In ICML, 2007. 1
                                                                              stochastic optimization. In ICLR, 2015. 5
 [3] Jie Chang, Zhonghao Lan, Changmao Cheng, and Yichen
                                                                         [23] A. Kirillov, R. Girshick, K. He, and P. Dollár. Panoptic fea-
     Wei. Data uncertainty learning in face recognition. In CVPR,
                                                                              ture pyramid networks. In CVPR, 2019. 2, 4
     2020. 1
                                                                         [24] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár.
 [4] Peng Chao, Xiao Tete, Li Zeming, Jiang Yuning, Zhang Xi-
                                                                              Panoptic segmentation. In CVPR, 2019. 2
     angyu, Jia Kai, Yu Gang, and Sun Jian. Megdet: A large
     mini-batch object detector. In CVPR, 2018. 5                        [25] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet clas-
 [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A.                   sification with deep convolutional neural networks. In NIPS,
     Yuille. Deeplab: Semantic image segmentation with deep                   2012. 2
     convolutional nets, atrous convolution, and fully connected         [26] J. Li, J. Zhao, Y. Wei, C. Lang, Y. Li, T. Sim, S. Yan, and J.
     crfs. TPAMI, 2016. 2, 5                                                  Feng. Multi-human parsing in the wild. arXiv:1705.07206,
 [6] L. Chen, Y. Yang, J. Wang, W. Xu, and A. Yuille. Atten-                  2017. 7
     tion to scale: Scale-aware semantic image segmentation. In          [27] Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. Self-
     CVPR, 2016. 6                                                            correction for human parsing. arXiv:1910.09777, 2019. 1,
 [7] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.                 2, 6, 8
     Encoder-decoder with atrous separable convolution for se-           [28] Pang Liang, Jun Xu, Ai Qingyao, Lan Yanyan, Cheng Xueqi,
     mantic image segmentation. In ECCV, 2018. 2                              and Wen Jirong. Setrank: Learning a permutation-invariant
 [8] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun.                  ranking model for information retrieval. In ACM SIGIR,
     Cascaded pyramid network for multi-person pose estimation.               2020. 1, 2
     In CVPR, 2018. 5                                                    [29] X. Liang, K. Gong, X. Shen, and L. Lin. Look into person:
 [9] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei.             Joint human parsing and pose estimation network and a new
     Deformable convolutional networks. In ICCV, 2017.                        benchmark. TPAMI, 2018. 1, 2, 5, 6
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-        [30] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S.
     ture hierarchies for accurate object detection and semantic              Belongie. Feature pyramid networks for object detection. In
     segmentation. In CVPR, 2014. 2                                           CVPR, 2017. 2, 4
[11] Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh          [31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
     r-cnn. In ICCV, 2019. 2                                                  manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
[12] Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, and                    mon objects in context. In ECCV, 2014. 2, 8
     Liang Lin. Graphonomy: Universal human parsing via graph            [32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed.
     transfer learning. In CVPR, 2019. 2, 7                                   Ssd: Single shot multibox detector. In ECCV, 2016. 2
[13] K. Gong, X. Liang, Y. Li, Y. Chen, and L. Lin. Instance-level       [33] Xinchen Liu, Meng Zhang, Wu Liu, Jingkuan Song, and Tao
     human parsing via part grouping network. In ECCV, 2018.                  Mei. Braidnet: Braiding semantics and details for accurate
     1, 2, 5, 6, 7                                                            human parsing. In ACM MM, 2019. 6, 7
[14] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L.                 [34] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
     Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Ac-                networks for semantic segmentation. In CVPR, 2015. 2, 4, 6
     curate, large minibatch sgd: Training imagenet in 1 hour.           [35] Y. Luo, Z. Zheng, L. Zheng, T. Guan, J. Yu, and Y. Yang.
     arXiv:1706.02677, 2017. 5                                                Macro-micro adversarial network for human parsing. In
[15] R. Guler, N. Neverova, and I. Kokkinos. Densepose: Dense                 ECCV, 2018. 6
     human pose estimation in the wild. In CVPR, 2018. 2, 8              [36] X. Nie, J. Feng, and S. Yan. Mutual learning to adapt for
[16] Agrim Gupta, Piotr Dollár, and Ross Girshick. Lvis: a                   joint human parsing and pose estimation. In ECCV, 2018. 1
     dataset for large vocabulary instance segmentation. In CVPR,        [37] Changhua Pei, Yi Zhang, Yongfeng Zhang, Fei Sun, and Dan
     2019. 2, 8                                                               Pei. Personalized re-ranking for recommendation. In ACM
[17] Haoyu He, Jing Zhang, Qiming Zhang, and Dacheng Tao.                     RS, 2019. 2
     Grapy-ml: Graph pyramid mutual learning for cross-dataset           [38] Haifang Qin, Weixiang Hong, Wei-Chih Hung, Yi-Hsuan
     human parsing. In AAAI, 2020. 7                                          Tsai, and Ming-Hsuan Yang. A top-down unified framework
[18] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.             for instance-level human parsing. In BMVC, 2019. 1, 7
     In ICCV, 2017. 4, 8                                                 [39] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
[19] K. He, X. Zhang, S. Ren, , and J. Sun. Deep residual learning            real-time object detection with region proposal networks. In
     for image recognition. In CVPR, 2016. 2, 4                               NIPS, 2015. 2

                                                                     9
[40] Tao Ruan, Ting Liu, Zilong Huang, Yunchao Wei, Shikui                 [58] Ziwei Zhang, Chi Su, Liang Zheng, and Xiaodong Xie. Cor-
     Wei, Yao Zhao, and Thomas Huang. Devil in the details: To-                 relating edge, pose with parsing. In CVPR, 2020. 1, 2, 6,
     wards accurate single and multiple human parsing. In AAAI,                 7
     2019. 1, 2, 6, 7, 8                                                   [59] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
[41] K. Sun, B. Xiao, D. Liu, and J. Wang. Deep high-resolution                 parsing network. In CVPR, 2017. 5
     representation learning for human pose estimation. In CVPR,           [60] J. Zhao, J. Li, Y. Cheng, and J. Feng. Understanding hu-
     2019. 2, 4                                                                 mans in crowded scenes: Deep nested adversarial learning
[42] Philipp Terhörst, Jan Niklas Kolf, Naser Damer, Florian                   and a new benchmark for multi-human parsing. In ACM MM,
     Kirchbuchner, and Arjan Kuijper. Ser-fiq: Unsupervised esti-               2018. 1, 2, 5, 6, 7
     mation of face image quality based on stochastic embedding            [61] Tian Zhi, Shen Chunhua, Chen Hao, and He Tong. Fcos:
     robustness. In CVPR, 2020. 1, 2                                            Fully convolutional one-stage object detection. In ICCV,
[43] Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen,                      2019. 2, 5, 8
     Yanwei Pang, and Ling Shao. Learning compositional neural             [62] Bolei Zhou, David Bau, Aude Oliva, and Antonio Torralba.
     information fusion for human parsing. In ICCV, 2019. 1, 2,                 Interpreting deep visual representations via network dissec-
     5, 6, 7                                                                    tion. TPAMI, 2018. 3
[44] Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang,                   [63] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba.
     Jianbing Shen, and Ling Shao. Hierarchical human parsing                   Interpretable basis decomposition for visual explanation. In
     with typed part-relation reasoning. In CVPR, 2020. 1, 2, 5,                ECCV, 2018. 3
     6, 7                                                                  [64] Bin Zhu, Qing Song, Lu Yang, Zhihui Wang, Chun Liu, and
[45] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural                Mengjie Hu. Cpm r-cnn: Calibrating point-guided misalign-
     networks. In CVPR, 2018. 2                                                 ment in object detection. In WACV, 2021. 2
[46] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convo-              [65] X. Zhu, H. Hu, S. Lin, and J. Dai. Deformable convnets v2:
     lutional pose machines. In CVPR, 2016. 2                                   More deformable, better results. In CVPR, 2018.
[47] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
     Lo, and Ross Girshick. Detectron2. https://github.
     com/facebookresearch/detectron2, 2019. 8
[48] Fangting Xia, Peng Wang, Xianjie Chen, and Alan L Yuille.
     Joint multi-person pose estimation and semantic part seg-
     mentation. In CVPR, 2017. 2, 5, 6, 7
[49] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human
     pose estimation and tracking. In ECCV, 2018. 2, 4, 5, 6
[50] Wang Xinlong, Zhang Rufeng, Kong Tao, Li Lei, and Shen
     Chunhua. Solov2: Dynamic, faster and stronger. In NIPS,
     2020. 4, 8
[51] Lu Yang, Qing Song, Zhihui Wang, Mengjie Hu, and Chun
     Liu. Hier r-cnn: Instance-level human parts detection and a
     new benchmark. TIP, 2020. 2
[52] Lu Yang, Qing Song, Zhihui Wang, Mengjie Hu, Chun Liu,
     Xueshi Xin, Wenhe Jia, and Songcen Xu. Renovating pars-
     ing r-cnn for accurate multiple human parsing. In ECCV,
     2020. 1, 2, 3, 4, 6, 7, 8
[53] Lu Yang, Qing Song, Zhihui Wang, and Ming Jiang. Parsing
     r-cnn for instance-level human analysis. In CVPR, 2019. 1,
     2, 4, 5, 6, 7, 8
[54] Lu Yang, Qing Song, Yingqi Wu, and Mengjie Hu. Atten-
     tion inspiring receptive-fields network for learning invariant
     representations. TNNLS, 2018. 2
[55] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-
     contextual representations for semantic segmentation. In
     ECCV, 2020. 1, 6
[56] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi
     Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R.
     Manmatha, Mu Li, and Alexander Smola. Resnest: Split-
     attention networks. arXiv:2004.08955, 2020. 8
[57] Xiaomei Zhang, Yingying Chen, Bingke Zhu, Jinqiao Wang,
     and Ming Tang. Part-aware context network for human pars-
     ing. In CVPR, 2020. 6

                                                                      10
You can also read