Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv

Page created by Grace Barrett
 
CONTINUE READING
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
1

                                              Look into Person: Joint Body Parsing & Pose
                                               Estimation Network and A New Benchmark
                                                                              Xiaodan Liang, Ke Gong, Xiaohui Shen, and Liang Lin

                                              Abstract—Human parsing and pose estimation have recently received considerable interest due to their substantial application
                                              potentials. However, the existing datasets have limited numbers of images and annotations and lack a variety of human appearances
                                              and coverage of challenging cases in unconstrained environments. In this paper, we introduce a new benchmark named “Look into
                                              Person (LIP)” that provides a significant advancement in terms of scalability, diversity, and difficulty, which are crucial for future
                                              developments in human-centric analysis. This comprehensive dataset contains over 50,000 elaborately annotated images with 19
                                              semantic part labels and 16 body joints, which are captured from a broad range of viewpoints, occlusions, and background
arXiv:1804.01984v1 [cs.CV] 5 Apr 2018

                                              complexities. Using these rich annotations, we perform detailed analyses of the leading human parsing and pose estimation
                                              approaches, thereby obtaining insights into the successes and failures of these methods. To further explore and take advantage of the
                                              semantic correlation of these two tasks, we propose a novel joint human parsing and pose estimation network to explore efficient
                                              context modeling, which can simultaneously predict parsing and pose with extremely high quality. Furthermore, we simplify the network
                                              to solve human parsing by exploring a novel self-supervised structure-sensitive learning approach, which imposes human pose
                                              structures into the parsing results without resorting to extra supervision. The datasets, code and models are available at
                                              http://www.sysu-hcp.net/lip/.

                                              Index Terms—Human Parsing, Pose Estimation, Context Modeling, Convolutional Neural Networks.

                                                                                                                       F
                                        1    I NTRODUCTION

                                        C     OMPREHENSIVE human visual understanding of sce-
                                              narios in the wild, which is regarded as one of the
                                        most fundamental problems in computer vision, could have
                                                                                                                                                                                              Face

                                                                                                                                                                                              Hair

                                                                                                                                                                                              RightArm
                                        a crucial impact in many higher-level application domains,
                                                                                                                                                                                              LeftArm
                                        such as person re-identification [51], video surveillance [42],
                                                                                                                                                                                              Jumpsuits
                                        human behavior analysis [16], [22] and automatic product
                                        recommendation [21]. Human parsing (also named seman-                                       (a)                   (b)                    (c)
                                        tic part segmentation) aims to segment a human image into                          Fig. 1. This example shows that human body structural information is
                                                                                                                           helpful for human parsing. (a) The original image. (b) The parsing results
                                        multiple parts with fine-grained semantics (e.g., body parts                       by attention-to-scale [8], where the left arm is incorrectly labeled as the
                                        and clothing) and provides a more detailed understanding                           right arm. (c) Our parsing results successfully incorporate the structure
                                        of image contents, whereas human pose estimation focuses                           information to generate reasonable outputs.
                                        on determining the precise locations of important body
                                        joints. Human parsing and pose estimation are two critical                         viewpoints and background clutters. Although training sets
                                        and correlated tasks in analyzing images of humans by                              exist for special scenarios, such as fashion pictures [14],
                                        providing both pixel-wise understanding and high-level                             [23], [26], [47] and people in constrained situations (e.g.,
                                        joint structures.                                                                  upright) [9], these datasets are limited in their coverage and
                                            Recently, convolutional neural networks (CNNs) have                            scalability, as shown in Fig. 2. The largest public human
                                        achieved exciting success in human parsing [23], [25], [26]                        parsing dataset [26] to date only contains 17,000 fashion
                                        and pose estimation [33], [44]. Nevertheless, as demon-                            images, while others only include thousands of images.
                                        strated in many other problems, such as object detection [24]                      The MPII Human Pose dataset [1] is the most popular
                                        and semantic segmentation [52], the performance of such                            benchmark for evaluating articulated human pose estima-
                                        CNN-based approaches heavily relies on the availability of                         tion methods, and this dataset includes approximately 25K
                                        annotated images for training. To train a human parsing                            images that contain over 40K people with annotated body
                                        or pose network with potential practical value in real-                            joints. However, all these datasets only focus on addressing
                                        world applications, it is highly desired to have a large-                          different aspects of human analysis by defining discrepant
                                        scale dataset that is composed of representative instances                         annotation sets. There are no available unified datasets
                                        with varied clothing appearances, strong articulation, par-                        with both human parsing and pose annotations for holistic
                                        tial (self-)occlusions, truncation at image borders, diverse                       human understanding, until our work fills this gap.
                                                                                                                              Furthermore, to the best of our knowledge, no attempt
                                                                                                                           has been made to establish a standard representative bench-
                                        •   X. Liang, K. Gong and L. Lin are with the School of Data and Computer
                                            Science, Sun Yat-sen University, China.                                        mark that aims to cover a wide range of challenges for
                                        •   X. Shen is with Adobe Research.                                                the two human-centric tasks. The existing datasets do not
                                        X. Liang and K. Gong contribute equally to this paper. Corresponding author:       provide an evaluation server with a secret test set to avoid
                                        Liang Lin (E-mail: linliang@ieee.org)                                              potential dataset over-fitting, which hinders further devel-
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
2

                                                                                          (a) MPII and LSP

  Hat              Hair             Sunglasses           Skirt            Pants           Dress
  Belt             LeftShoe         RightShoe            Face             LeftLeg         RightLeg          Head        Torso      UpperArms        LowerArms        UpperLegs        LowerLegs
  LeftArm          RightArm         Bag                  UpperClothes

                                                                               (b) ATR and PASCAL-Person-Part

                       Full body                                                               Half body                                                     Back view

                           Occlusion                                                           Sitting                                                            Lying
  Face      UpperClothes    Hair   RightArm      Pants     LeftArm      RightShoe   LeftShoe   Hat   Coat    RightLeg    LeftLeg   Gloves   Socks   Sunglasses   Dress   Skirt   Jumpsuits   Scarf

                                                                         (c) LIP
Fig. 2. Annotation examples for our “Look into Person (LIP)” dataset and existing datasets. (a) The images in the MPII dataset [1] (left) and LSP
dataset [19] (right) with only body joint annotations. (b) The images in the ATR dataset [26] (left) are fixed in size and only contain stand-up person
instances in the outdoors. The images in the PASCAL-Person-Part dataset [9] (right) also have lower scalability and only contain 6 coarse labels. (c)
The images in our LIP dataset have high appearance variability and complexity, and they are annotated with both human parsing and pose labels.

opment on this topic. With the new benchmark named                                                    ing requires more extensive and detailed predictions than
”Look into Person (LIP)”, we provide a public server for                                              pose estimation, it is difficult to directly utilize joint-based
automatically reporting evaluation results. Our benchmark                                             pose estimation models in pixel-wise predictions to incor-
significantly advances the state-of-the-art in terms of ap-                                           porate the complex structure constraints. We demonstrate
pearance variability and complexity, and it includes 50,462                                           that the human joint structure can facilitate the pixel-wise
human images with pixel-wise annotations of 19 semantic                                               parsing prediction by incorporating higher-level semantic
parts and 16 body joints.                                                                             correlations between human parts.
    The recent progress in human parsing [8], [14], [26],                                                 For pose estimation, increasing research efforts [4], [11],
[27], [29], [37], [45], [47] has been achieved by improving                                           [12], [35] have been devoted to learning the relationships
the feature representations using CNNs and recurrent neu-                                             between human body parts and joints. Some studies [4], [11]
ral networks. To capture rich structure information, these                                            explored encoding part constraints and contextual informa-
approaches combine CNNs and graphical models (e.g.,                                                   tion for guiding the network to focus on informative regions
conditional random fields (CRFs)), similar to the general                                             (human parts) to predict more precise locations of the body
object segmentation approaches [6], [43], [52]. However,                                              joints, which achieved state-of-the-art performance. Con-
when evaluated on the new LIP dataset, the results of some                                            versely, some pose-guided human parsing methods [13],
existing methods [3], [6], [8], [31] are unsatisfactory. With-                                        [46] also sufficiently utilized the peculiarity and relation-
out imposing human body structure priors, these general                                               ships of these two correlated tasks. However, previous
approaches based on bottom-up appearance information                                                  works generally solve these two problems separately or
occasionally tend to produce unreasonable results (e.g., right                                        alternatively.
arm connected with left shoulder), as shown in Fig. 1. Hu-                                                In this work, we aim to seamlessly integrate human
man body structural information has previously been well                                              parsing and pose estimation under a unified framework. We
explored in human pose estimation [10], [49], where dense                                             use a shared deep residual network for feature extraction,
joint annotations are provided. However, since human pars-                                            after which there are two distinct small networks to encode
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
3

                                                                       TABLE 1
  Overview of the publicly available datasets for human parsing and pose estimation. For each dataset, we report the number of images in the
                           training, validation and test sets; parsing categories, including background; and body joints.

              Dataset                         #Total     #Training      #Validation     #Test      Parsing Categories        Body Joints
      Fashionista [48] (Parsing)                685         456              -            229              56                     -
   PASCAL-Person-Part [9] (Parsing)            3533        1,716             -           1,817              7                     -
         ATR [26] (Parsing)                   17,700      16,000            700         1,000              18                     -
           LSP [19] (Pose)                     2,000       1,000             -          1,000               -                    14
           MPII [1] (Pose)                    24987        18079             -           6908               -                    16
        J-HMDB [18] (Pose)                    31838          -               -             -               2                     13
                 LIP                          50,462      30,462          10,000        10,000             20                    16

and predict the contextual information and results. Then,               in an end-to-end framework to investigate efficient context
a simple yet efficient refinement network tailored for both             modeling and then enable parsing and pose tasks that are
parsing and pose prediction is proposed to explore efficient            mutually beneficial to each other. This unified framework
context modeling, which makes human parsing and pose                    achieves state-of-the-art performance for both human pars-
estimation mutually beneficial. In our unified framework,               ing and pose estimation tasks. The simplified network for
we propose a scheme to incorporate multi-scale feature com-             human parsing task with self-supervised structure-sensitive
binations and iterative location refinement together, which             learning also significantly surpasses the previous methods
are often posed as two different coarse-to-fine strategies              on both the existing PASCAL-Person-Part dataset [9] and
that are widely investigated for human parsing and pose                 our new LIP dataset.
estimation separately. To highlight the merit of unifying the
two highly correlated and complementary tasks within an
end-to-end framework, we name our framework the joint                   2    R ELATED W ORK
human parsing and pose estimation network (JPPNet).                     Human parsing and pose datasets: The commonly used
    However, note that annotating both pixel-wise labeling              publicly available datasets for human parsing and pose are
maps and pose joints is unrealized in previous human-                   summarized in Table 1. For human parsing, the previous
centric datasets. Therefore, in this work, we also design               datasets were labeled with a limited number of images or
a simplified network suited to general human parsing                    categories. The largest dataset [26] to date only contains
datasets and networks with no need for pose annotations. To             17,000 fashion images with mostly upright fashion models.
explicitly enforce the produced parsing results to be seman-            These small datasets are unsuitable for training models with
tically consistent with the human pose and joint structures,            complex appearance representations and multiple compo-
we propose a novel structure-sensitive learning approach                nents [12], [20], [39], which could perform better. For human
for human parsing. In addition to using the traditional pixel-          pose, the LSP dataset [19] only contains sports people, and
wise part annotations as the supervision, we introduce a                it fails to cover real-life challenges. The MPII dataset [1]
structure-sensitive loss to evaluate the quality of the pre-            has more images and a wider coverage of activities that
dicted parsing results from a joint structure perspective.              cover real-life challenges, such as truncation, occlusions,
This means that a satisfactory parsing result should be able            and variability of imaging conditions. However, this dataset
to preserve a reasonable joint structure (e.g., the spatial             only provides 2D pose annotations. J-HMDB [18] provides
layouts of human parts). We generate approximated human                 densely annotated image sequences and a larger number of
joints directly from the parsing annotations and use them as            videos for 21 human actions. Although the puppet mask and
the supervision signal for the structure-sensitive loss. This           human pose are annotated in the all 31838 frames, detailed
self-supervised structure-sensitive network is a simplified             part segmentations are not labeled.
verson of our JPPNet, denoted as SS-JPPNet, which is ap-                    Our proposed LIP benchmark dataset is the first effort
propriate for the general human parsing datasets without                that focuses on the two human-centric tasks. Containing
pose annotations                                                        50,462 images annotated with 20 parsing categories and 16
    Our contributions are summarized in the following three             body joints, our LIP dataset is the largest and most com-
aspects. 1) We propose a new large-scale benchmark and an               prehensive human parsing and pose dataset to date. Some
evaluation server to advance the human parsing and pose                 other datasets in the vision community were dedicated to
estimation research, in which 50,462 images with pixel-wise             the tasks of clothes recognition, retrieval [30] and fashion
annotations on 19 semantic part labels and 16 body joints               modeling [38], whereas our LIP dataset only focuses on
are provided. 2) By experimenting on our benchmark, we                  human parsing and pose estimation.
present detailed analyses of the existing human parsing and                 Human parsing approaches: Recently, many research
pose estimation approaches to obtain some insights into                 efforts have been devoted to human parsing [8], [26], [29],
the successes and failures of these approaches and thor-                [37], [45], [47], [48]. For example, Liang et al. [26] proposed a
oughly explore the relationship between the two human-                  novel Co-CNN architecture that integrates multiple levels of
centric tasks. 3) We propose a novel joint human parsing                image contexts into a unified network. In addition to human
and pose estimation network, which incorporates the multi-              parsing, there has also been increasing research interest in
scale feature connections and iterative location refinement             the part segmentation of other objects, such as animals or
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
4

                       4
                                                                                          10 4
                 10
          4.5
                                                          Face                     2.5
                                                          UpperClothes
           4                                              Hair
                                                          RightArm
                                                                                                                               Occlusion
                                                          Pants                                                                Full body
          3.5                                             LeftArm                   2
                                                          RightShoe                                                            Upper body
                                                          LeftShoe
                                                          Hat                                                                  Head missing
           3                                              Coat
                                                          RightLeg                                                             Back view
                                                          LeftLeg                  1.5

                                                                         # Image
          2.5                                             Glove
                                                                                                                               Lower body
# Image

                                                          Socks
                                                          Sunglasses
                                                          Dress
           2
                                                          Skirt                     1
                                                          Jumpsuits
                                                          Scarf
          1.5

           1
                                                                                   0.5

          0.5
                                                                               0
           0                                                             Fig. 4. The numbers of images that show diverse visibilities in the LIP
Fig. 3. The data distribution on 19 semantic part labels in the LIP      dataset, including occlusion, full body, upper body, lower body, head
dataset.                                                                 missing and back view.

                                                                         3          L OOK INTO P ERSON B ENCHMARK
cars [32], [41], [43]. To capture the rich structure informa-
                                                                         In this section, we introduce our new “Look into Person
tion based on the advanced CNN architecture, common
                                                                         (LIP)” dataset, which is a new large-scale dataset that fo-
solutions include the combination of CNNs and CRFs [7],
                                                                         cuses on semantic understanding of human bodies and that
[52] and adopting multi-scale feature representations [7],
                                                                         has several appealing properties. First, with 50,462 anno-
[8], [45]. Chen et al. [8] proposed an attention mechanism
                                                                         tated images, LIP is an order of magnitude larger and more
that learns to weight the multi-scale features at each pixel
                                                                         challenging than previous similar datasets [9], [26], [48]. Sec-
location.
                                                                         ond, LIP is annotated with elaborate pixel-wise annotations
    Pose estimation approaches: Articulated human poses                  with 19 semantic human part labels and one background
are generally modeled using a combination of a unary term                label for human parsing and 16 body joint locations for pose
and pictorial structures [2] or graph model, e.g., mixture of            estimation. Third, the images collected from the real-world
body parts [10], [34], [50]. With the introduction of Deep-              scenarios contain people appearing with challenging poses
Pose [40], which formulates the pose estimation problem                  and viewpoints, heavy occlusions, various appearances and
as a regression problem using a standard convolutional                   in a wide range of resolutions. Furthermore, the background
architecture, research on human pose estimation began to                 of the images in the LIP dataset is also more complex and
shift from classic approaches to deep networks. For exam-                diverse than that in previous datasets. Some examples are
ple, Wei et al. [44] incorporated the inference of the spatial           shown in Fig. 2. With the LIP dataset, we propose a new
correlations among body parts within ConvNets. Newell                    benchmark suite for human parsing and pose estimation
et al. proposed a stacked hourglass network [33] using a                 together with a standard evaluation server where the test
repeated pooling down and upsampling process to learn                    set will be kept secret to avoid overfitting.
the spatial distribution.
   Some previous works [13], [46] explored human                         3.1         Image Annotation
pose information to guide human parsing by generating
                                                                         The images in the LIP dataset are cropped person instances
“pose-guided” part segment proposals. Additionally, some
                                                                         from Microsoft COCO [28] training and validation sets. We
works [4], [11] generated attention maps of the body part to
                                                                         defined 19 human parts or clothes labels for annotation,
guide pose estimation. To further utilize the advantages of
                                                                         which are hat, hair, sunglasses, upper clothes, dress, coat,
parsing and pose and their relationships, our focus is a joint
                                                                         socks, pants, gloves, scarf, skirt, jumpsuit, face, right arm,
human parsing and pose estimation network, which can si-
                                                                         left arm, right leg, left leg, right shoe, and left shoe, as well as
multaneously predict parsing and pose with extremely high
                                                                         a background label. Similarly, we provide rich annotations
quality. Additionally, to leverage the human joint structure
                                                                         for human poses, where the positions and visibility of 16
more effortlessly and efficiently, we simplify the network
                                                                         main body joints are annotated. Following [1], we annotate
and propose a self-supervised structure-sensitive learning
                                                                         joints in a “person-centric” manner, which means that the
approach.
                                                                         left/right joints refer to the left/right limbs of the person. At
    The rest of this paper is organized as follows. We present           test time, this requires pose estimation with both a correct
the analysis of existing human parsing and pose estimation               localization of the limbs of a person along with the correct
datasets and then introduce our new LIP benchmark in                     match to the left/right limb.
Section 3. In Section 4, we present the empirical study of                   We implemented an annotation tool and generate multi-
current methods based on our LIP benchmark and discuss                   scale superpixels of images to speed up the annotation.
the limitations of these methods. Then, to address the chal-             More than 100 students are trained well to accomplish
lenges raised by LIP, we propose a unified framework for                 annotation work which lasts for five months. We super-
simultaneous human parsing and pose estimation in Section                vise the whole annotation process and check the results
5. At last, more detailed comparisons between our approach               periodically to control the annotation quality. Finally, we
and state-of-the-art methods are exhibited in Section 6.                 conduct a second-round check for each annotated image and
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
5

selecte 50,000 usable and well-annotated images strictly and                                 TABLE 2
carefully from over 60,000 submitted images.                       Comparison of human parsing performance with five state-of-the-art
                                                                                  methods on the LIP validation set.
                                                                           Method             Overall accuracy   Mean accuracy   Mean IoU
3.2    Dataset Splits                                                     SegNet [3]               69.04            24.00         18.17
                                                                         FCN-8s [31]               76.06            36.75         28.29
In total, there are 50,462 images in the LIP dataset, including      DeepLab (VGG-16) [7]          82.66            51.64         41.64
                                                                         Attention [8]             83.43            54.39         42.92
19,081 full-body images, 13,672 upper-body images, 403             DeepLab (ResNet-101) [7]        84.09            55.62         44.80
lower-body images, 3,386 head-missing images, 2,778 back-           JPPNet (with pose info)        86.39            62.32         51.37
view images and 21,028 images with occlusions. We split                   SS-JPPNet                84.36            54.94         44.73
the images into separate training, validation and test sets.
                                                                                             TABLE 3
Following random selection, we arrive at a unique split that       Comparison of human parsing performance with five state-of-the-art
consists of 30,462 training and 10,000 validation images with                       methods on the LIP test set.
publicly available annotations, as well as 10,000 test images              Method             Overall accuracy   Mean accuracy   Mean IoU
with annotations withheld for benchmarking purposes.                      SegNet [3]               69.10            24.26         18.37
    Furthermore, to stimulate the multiple-human parsing                 FCN-8s [31]               76.28            37.18         28.69
                                                                     DeepLab (VGG-16) [7]          82.89            51.53         41.56
research, we collect the images with multiple person in-                 Attention [8]             83.56            54.28         42.97
stances in the LIP dataset to establish the first standard and     DeepLab (ResNet-101) [7]        84.25            55.64         44.96
                                                                    JPPNet (with pose info)        86.48            62.25         51.36
comprehensive benchmark for multiple-human parsing and                    SS-JPPNet                84.53            54.81         44.59
pose estimation. Our LIP multiple-human parsing and pose
dataset contains 4,192 training, 497 validation and 458 test
images, in which there are 5,147 multiple-person images in        this analysis is to evaluate the robustness of the current ap-
total.                                                            proaches in various challenges for human parsing and pose
                                                                  estimation and identify the existing limitations to stimulate
                                                                  further research advances.
3.3    Dataset Statistics
In this section, we analyze the images and categories in the
LIP dataset in detail. In general, face, arms, and legs are the   4.1     Human Parsing
most remarkable parts of a human body. However, human
                                                                  In our analysis, we consider fully convolutional net-
parsing aims to analyze every detailed region of a person,
                                                                  works [31] (FCN-8s), a deep convolutional encoder-decoder
including different body parts and different categories of
                                                                  architecture [3] (SegNet), deep convolutional nets with
clothes. We therefore define 6 body parts and 13 clothes
                                                                  atrous convolution and multi-scale [7] (DeepLab (VGG-16),
categories. Among these 6 body parts, we divide arms and
                                                                  DeepLab (ResNet-101)) and an attention mechanism [8] (At-
legs into the left side and right side for a more precise
                                                                  tention), which all have achieved excellent performance on
analysis, which also increases the difficulty of the task. For
                                                                  semantic image segmentations in different ways and have
clothes classes, we not only have common clothes such as
                                                                  completely available codes. For a fair comparison, we train
upper clothes, pants, and shoes but also have infrequent
                                                                  each method on our LIP training set until the validation
categories, such as skirts and jumpsuits. Furthermore, small-
                                                                  performance saturates, and we perform evaluation on the
scale accessories such as sunglasses, gloves, and socks are
                                                                  validation set and the test set. For the DeepLab methods, we
also taken into account. The numbers of images for each
                                                                  remove the post-processing, dense CRFs. Following [8], [45],
semantic part label are presented in Fig. 3.
                                                                  we use the standard intersection over union (IoU) criterion
    In contrast to other human image datasets, the images
                                                                  and pixel-wise accuracy for evaluation.
in the LIP dataset contain diverse human appearances,
viewpoints, and occlusions. Additionally, more than half
of the images suffer from occlusions of different degrees.        4.1.1    Overall Performance Evaluation
An occlusion is considered to have occurred if any of the         We begin our analysis by reporting the overall human
semantic parts or body joints appear in the image but are         parsing performance of each approach, and the results are
occluded or invisible. In more challenging cases, the images      summarized in Table 2 and Table 3. On the LIP valida-
contain person instances in a back view, which gives rise         tion set, among the five approaches, DeepLab (ResNet-
to more ambiguity in the left and right spatial layouts. The      101) [7] with the deepest networks achieves the best result
numbers of images of different appearances (i.e., occlusion,      of 44.80% mean IoU. Benefitting from the attention model
full body, upper body, head missing, back view and lower          that softly weights the multi-scale features, Attention [8]
body) are summarized in Fig. 4.                                   also performs well with 42.92% mean IoU, whereas both
                                                                  FCN-8s [31] (28.29%) and SegNet [3] (18.17%) perform sig-
                                                                  nificantly worse. Similar performance is observed on the LIP
4     E MPIRICAL S TUDY OF S TATE - OF - THE - ART                test set. The interesting outcome of this comparison is that
In this section, we analyze the performance of leading            the achieved performance is substantially lower than the
human parsing or semantic object segmentation and pose            current best results on other segmentation benchmarks, such
estimation approaches on our benchmark. We take advan-            as PASCAL VOC [15]. This result suggests that detailed hu-
tage of our rich annotations and conduct a detailed analysis      man parsing due to the small parts and diverse fine-grained
of the various factors that influence the results, such as        labels is more challenging than object-level segmentation,
appearance, foreshortening, and viewpoints. The goal of           which deserves more attention in the future.
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
6
         55
                                                                  SegNet
                                                                  FCN-8s
                                                                                        For the upper-body image (a) with slight occlusion, the five
         50
                                                                  DeepLab(VGG)
                                                                  Attention             approaches perform well with few errors. For the back-view
                                                                  DeepLab(ResNet-101)

         45
                                                                  JPPNet
                                                                  SS-JPPNet
                                                                                        image (b), all five methods mistakenly label the right arm
                                                                                        as the left arm. The worst results occur for the head-missing
         40
                                                                                        image (c). SegNet [3] and FCN-8s [31] fail to recognize arms
         35                                                                             and legs, whereas DeepLab (VGG-16) [7] and Attention [8]
IoU(%)

         30
                                                                                        have errors on the right and left arms, legs and shoes.
                                                                                        Furthermore, severe occlusion (d) also greatly affects the
         25
                                                                                        performance. Moreover, as observed from (c) and (d), some
         20                                                                             of the results are unreasonable from the perspective of
                                                                                        human body configuration (e.g., two shoes on one foot)
         15
                                                                                        because the existing approaches lack the consideration of
         10
              Occlusion   Full body   Upper body   Head missing      Back view          body structures. In summary, human parsing is more diffi-
Fig. 5. Human parsing performance comparison evaluated on the LIP                       cult than general object segmentation. Particularly, human
validation set with different appearances, including occlusion, full body,              body structures should receive more attention to strengthen
upper body, head missing and back view.
                                                                                        the ability to predict human parts and clothes with more
                                                                                        reasonable configurations. Consequently, we consider con-
4.1.2 Performance Evaluation under Different Challenges                                 necting human parsing results and body joint structure to
We further analyze the performance of each approach with                                determine a better approach for human parsing.
respect to the following five challenging factors: occlusion,
full body, upper body, head missing and back view (see
                                                                                        4.2   Pose Estimation
Fig. 5). We evaluate the above five approaches on the LIP
validation set, which contains 4,277 images with occlusions,                            Similarly, we consider three state-of-the-art methods for
5,452 full-body images, 793 upper-body images, 112 head-                                pose estimation, including a sequential convolutional ar-
missing images and 661 back-view images. As expected,                                   chitecture [44] (CPM) and a repeated bottom-up, top-down
the performance varies when the approaches are affected by                              network [33] (Hourglass). ResNet-101 with atrous convolu-
different factors. Back view is clearly the most challenging                            tions [7] is also taken into account, for which we reserve
case. For example, the IoU of Attention [8] decreases from                              the entire network and change the output layer to generate
42.92% to 33.50%. The second most influential factor is                                 pose heatmaps. These approaches achieve top performance
the appearance of the head. The scores of all approaches                                on the MPII [1] and LSP [19] datasets and can be trained on
are considerably lower on head-missing images than the                                  our LIP dataset using publicly available codes. Again, we
average score on the entire set. The performance also greatly                           train each method on our LIP training set and evaluate on
suffers from occlusions. The results of full-body images are                            the validation set and the test set. Following MPII [1], the
the closest to the average level. By contrast, upper body is                            evaluation metric that we used is the percentage of correct
relatively the easiest case, where fewer semantic parts are                             keypoints with respect to head (PCKh). PCKh considers a
present and the part regions are generally larger. From these                           candidate keypoint to be localized correctly if it falls within
results, we can conclude that the head (or face) is an impor-                           the matching threshold which is 50% of the head segment
tant cue for the existing human parsing approaches. The                                 length.
probability of ambiguous results will increase if the head
part disappears in the images or in the back view. Moreover,                            4.2.1 Overall Performance Evaluation
the parts or clothes on the lower body are more difficult                               We again begin our analysis by reporting the overall pose es-
than those on the upper body because of the existence of                                timation performance of each approach, and the results are
small labels, such as shoes or socks. In this case, the body                            summarized in Table 5 and Table 6. On the LIP validation
joint structure can play an effective role in guiding human                             set, Hourglass [33] achieves the best result of 77.5% total
parsing.                                                                                PCKh, benefiting from their multiple hourglass modules
                                                                                        and intermediate supervision. With a sequential composi-
4.1.3 Per-class Performance Evaluation                                                  tion of convolutional architectures to learn implicit spatial
To discuss and analyze each of the 20 labels in the LIP                                 models, CPM [44] also obtains comparable performance. In-
dataset in more detail, we further report the performance of                            terestingly, the achieved performance is substantially lower
per-class IoU on the LIP validation set, as shown in Table 4.                           than the current best results on other pose estimation bench-
We observe that the results with respect to labels with larger                          marks, such as MPII [1]. This wide gap reflects the higher
regions such as face, upper clothes, coats, and pants are con-                          complexity and variability of our LIP dataset and the sig-
siderably better than those on the small-region labels, such                            nificant development potential of pose estimation research.
as sunglasses, scarf, and skirt. DeepLab (ResNet-101) [7] and                           Similar performance on the LIP test set is again consistent
Attention [8] perform better on small labels thanks to the                              with our analysis.
utilization of deep networks and multi-scale features.
                                                                                        4.2.2 Performance Evaluation under Different Challenges
4.1.4 Visualization Comparison                                                          We further analyze the performance of each approach with
The qualitative comparisons of the five approaches on our                               respect to the four challenging factors (see Fig. 7). We leave
LIP validation set are visualized in Fig. 6. We present exam-                           head-missing images out because the PCKh metric depends
ple parsing results of the five challenging factor scenarios.                           on the head size of the person. In general and as expected,
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
7

             Image             SegNet              FCN-8s      DeepLab(VGG-16)      Attention   DeepLab(ResNet-101)    JPPNet        SS-JPPNet
                                                                                                                                                      Face

       (a)                                                                                                                                            UpperClothes
                                                                                                                                                      Hair
                                                                                                                                                      RightArm
                                                                                                                                                      Pants
       (b)                                                                                                                                            LeftArm
                                                                                                                                                      RightShoe
                                                                                                                                                      LeftShoe
                                                                                                                                                      Hat
       (c)                                                                                                                                            Coat
                                                                                                                                                      RightLeg
                                                                                                                                                      LeftLeg
                                                                                                                                                      Gloves
       (d)
                                                                                                                                                      Socks
                                                                                                                                                      Sunglasses
                                                                                                                                                      Dress
                                                                                                                                                      Skirt
       (e)
                                                                                                                                                      Jumpsuits
                                                                                                                                                      Scarf

Fig. 6. Visualized comparison of human parsing results on the LIP validation set. (a) The upper-body images. (b) The back-view images. (c) The
head-missing images. (d) The images with occlusion. (e) The full-body images.

                                                                     TABLE 4
                     Performance comparison in terms of per-class IoU with five state-of-the-art methods on the LIP validation set.
        Method              hat    hair    gloves sunglasses u-clothes dress coat socks pants jumpsuit scarf skirt face l-arm r-arm l-leg r-leg   l-shoe   r-shoe    Bkg    Avg
       SegNet [3]          26.60   44.01     0.01    0.00      34.46 0.00 15.97 3.59 33.56      0.01   0.00 0.00 52.38 15.30 24.23 13.82 13.17      9.26     6.47   70.62   18.17
      FCN-8s [31]          39.79   58.96     5.32    3.08      49.08 12.36 26.82 15.66 49.41    6.48   0.00 2.16 62.65 29.78 36.63 28.12 26.05     17.76    17.70   78.02   28.29
  DeepLab (VGG-16) [7]     57.94   66.11    28.50   18.40      60.94 23.17 47.03 34.51 64.00 22.38 14.29 18.74 69.70 49.44 51.66 37.49 34.60       28.22    22.41   83.25   41.64
      Attention [8]        58.87   66.78    23.32   19.48      63.20 29.63 49.70 35.23 66.04 24.73 12.84 20.41 70.58 50.17 54.03 38.35 37.70       26.20    27.09   84.00   42.92
DeepLab (ResNet-101) [7]   59.76   66.22    28.76   23.91      64.95 33.68 52.86 37.67 68.05 26.15 17.44 25.23 70.00 50.42 53.89 39.36 38.27       26.95    28.36   84.09   44.80
 JPPNet (with pose info)   63.55   70.20    36.16   23.48      68.15 31.42 55.65 44.56 72.19 28.39 18.76 25.14 73.36 61.97 63.88 58.21 57.99       44.02    44.09   86.26   51.37
       SS-JPPNet           59.75   67.25    28.95   21.57      65.30 29.49 51.92 38.52 68.02 24.48 14.92 24.32 71.01 52.64 55.79 40.23 38.80       28.08    29.03   84.56   44.73

                                                                   TABLE 5
                         Comparison of human pose estimation performance with state-of-the-art methods on the LIP test set.

                         Method                               Head       Shoulder        Elbow        Wrist     Hip       Knee     Ankle          Total
                     ResNet-101 [7]                           91.2         84.4           78.5        75.7      62.8      70.1      70.6          76.8
                        CPM [44]                              90.8         85.1           78.7        76.1      64.7      70.5      71.2          77.3
                     Hourglass [33]                           91.1         85.3           78.9        76.2      65.0      70.2      72.2          77.6
                JPPNet (with parsing info)                    93.3         89.3           84.4        82.5      70.0      78.3      77.7          82.7

                                                                   TABLE 6
                      Comparison of human pose estimation performance with state-of-the-art methods on the LIP validation set.

                         Method                               Head       Shoulder        Elbow        Wrist     Hip       Knee     Ankle          Total
                     ResNet-101 [7]                           91.2         84.3           78.0        74.9      62.3      69.5      71.1          76.5
                        CPM [44]                              91.1         85.1           78.7        75.0      63.7      69.6      71.7          77.0
                     Hourglass [33]                           91.2         85.7           78.7        75.5      64.8      70.5      72.1          77.5
                JPPNet (with parsing info)                    93.2         89.3           84.6        82.2      69.9      78.0      77.3          82.5

the performance decreases as the complexity increases.                                  leveraging the correlation and complementation of human
However, there are interesting differences. The back-view                               parsing and pose estimation is advantageous for reducing
factor clearly influences the performance of all approaches                             this type of dependency.
the most, as the scores of all approaches decrease nearly 10%
compared to the average score on the entire set. The second                             4.2.3 Visualization Comparison
most influential factor is occlusion. For example, the PCKh                             The qualitative comparisons of the pose estimation results
of Hourglass [33] is 4.60% lower. These two factors are re-                             on our LIP validation set are visualized in Fig. 8. We select
lated to the visibility and orientation of heads in the images,                         some challenging images with unusual appearances, trunca-
which indicates that similar to human parsing, the existing                             tions and occlusions to analyze the failure cases and obtain
pose estimation methods strongly depend on the contextual                               some inspiration. First, for the persons standing or sitting
information of the head or face. In this case, exploring and                            sideways, the existing approaches typically failed to predict
                                                                                        their occluded body joints, such as the right arm in Col 1,
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
8

          85                                                          or attention-to-zoom scheme [8] for more precise pixel-
                                                        Resnet-101
                                                        CPM           wise classification. Conversely, for the pose task, it indicates
          80
                                                        Hourglass     iterative displacement refinement, which is widely used in
                                                        JPPNet
          75                                                          pose estimation [44]. It is reasonable to incorporate these
                                                                      two distinct coarse-to-fine schemes together in a unified
PCKh(%)

          70                                                          network to further improve the parsing and pose results.
          65
                                                                      5.2   Joint Human Parsing and Pose Estimation Network
          60
                                                                      To utilize the coherent representation of human parsing and
          55                                                          pose to promote each task, we propose a joint human pars-
                                                                      ing and pose estimation network, which also incorporates
          50
                 Occlusion   Full body   Upper body   Back view
                                                                      two distinct coarse-to-fine schemes, i.e., multi-scale features
Fig. 7. Pose estimation performance comparison evaluated on the LIP   and iterative refinement, together. The framework architec-
validation set with different appearances.                            ture is illustrated in Fig. 9 and the detailed configuration is
                                                                      presented in Table 7. We denote our joint human parsing
the right leg in Col 2, and the right leg in Col 7. Second, for       and pose estimation network as JPPNet.
the persons in back view or head missing, the left and right              In general, the basic network of the parsing framework
arms (legs) of the persons are always improperly located,             is a deep residual network [17], while the pose framework
as with those in Cols 3 and 4. Moreover, for some images              prefers a stacked hourglass network [33]. In our joint frame-
with strange appearances where some limbs of the person               work, we use a shared residual network to extract human
are very close (Cols 5 and 6), ambiguous and irrational               image features, which is more efficient and concise. Then,
results will be generated by these methods. In particular, the        we have two distinct networks to generate parsing and pose
performance of the gray images (Col 8) is also far from being         features and results. They are followed by a refinement net-
satisfactory. Learning from these failure cases, we believe           work, which takes features and results as input to produce
that pose estimation should fall back on more instructional           more accurate segmentation and joint localization.
contextual information, such as the guidance from human                   Feature extraction. We employ convolution with upsam-
parts with reasonable configurations.                                 pled filters, or “atrous convolution” [7], as a powerful tool to
    Summarizing the analysis of human parsing and pose                repurpose ResNet-101 [17] in dense prediction tasks. Atrous
estimation, it is clear that despite the strong connection            convolution allows us to explicitly control the resolution
of these two human-centric tasks, the intrinsic consistency           at which feature responses are computed within DCNNs.
between them will benefit each other. Consequently, we                It also effectively enlarges the field of view of filters to
present a unified framework for simultaneous human pars-              incorporate larger context without increasing the number
ing and pose estimation to explore this intrinsic correlation.        of parameters or the amount of computation. The first four
                                                                      stages of ResNet-101 (i.e., Res-1 to Res-4) are shared in our
                                                                      framework. The deeper convolutional layers are different to
5          M ETHODS
                                                                      learn for distinct tasks.
5.1            Overview                                                   Parsing and pose subnet. We use ResNet-101 with
In this section, we first summarize some insights about the           atrous convolution as the basic parsing subnet, which con-
limitations of existing approaches and then illustrate our            tains atrous spatial pyramid pooling (ASPP) as the output
joint human parsing and pose estimation network in detail.            layer to robustly segment objects at multiple scales. ASPP
From the above detailed analysis, we obtain some insights             probes an incoming convolutional feature layer with filters
into the human parsing and pose estimation tasks. 1) A                at multiple sampling rates and effective fields of view,
major limitation of the existing human parsing approaches             thus capturing objects and image context at multiple scales.
is the lack of consideration of human body configuration,             Furthermore, to generate the context used in the refinement
which is mainly investigated in the human pose estimation             stage, there are two convolutions following Res-5. For the
problem. Meanwhile, the part maps produced by the detec-              pose subnet, we simply add several 3 × 3 convolutional
tion network contain multiple contextual information and              layers to the fourth stage (Res-4) of ResNet-101 to generate
structural part relationships, which can effectively guide            pose features and heatmaps.
the regression network to predict the locations of the body               Refinement network. We also design a simple but effi-
joints. Human parsing and pose estimation aim to label                cient refinement network, which is able to iteratively refine
each image with different granularities, that is, pixel-wise          both parsing and pose results. We reintegrate the interme-
semantic labeling versus joint-wise structure prediction. The         diate parsing and pose predictions back into the feature
pixel-wise labeling can address more detailed information,            space by mapping them to a larger number of channels
whereas joint-wise structure provides more high-level struc-          with an additional 1 × 1 convolution. Then, we have four
ture, which means that the two tasks are complementary. 2)            convolutional layers with an incremental kernel size that
As learned from the existing approaches, the coarse-to-fine           varies from 3 to 9 to capture a sufficient local context and to
scheme is widely used in both parsing and pose networks               increase the receptive field size, which is crucial for learning
to improve accuracy. For coarse-to-fine, there are two dif-           long-term relationships. Next is another 1×1 convolution to
ferent definitions for parsing and pose tasks. For parsing or         generate the features for the next refinement stage. To refine
segmentation, it means using the multi-scale segmentation             pose, we concatenate the remapped pose and parsing results
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
9

      ResNet-101

             CPM

        Hourglass

           JPPNet

Fig. 8. Visualized comparison of pose estimation results with three state-of-the-art methods, including ResNet-101, CPM and Hourglass on the LIP
validation set.

                               TABLE 7                                      stages and intermediate supervision [5], [33], [44], we apply
               The detailed configuration of our JPPNet .                   a loss upon the prediction of intermediate maps.
      Component        Name          Input         Kernel size   Channels
                       conv-1         Res-5          3×3           512
     Part module
                       conv-2        conv-1          3×3           256      5.3   Self-supervised Structure-sensitive Learning
                       conv-1         Res-4          3×3           512      The joint human parsing and pose estimation network
                       conv-2        conv-1          3×3           512
                       conv-3        conv-2          3×3           256
                                                                            (JPPNet) leverages both pixel-wise supervision from human
     Joint module
                       conv-4        conv-3          3×3           256      part annotations and high-level structural guidance from
                       conv-5        conv-4          3×3           256      joint annotations. However, in some cases, e.g., in previous
                       conv-6        conv-5          3×3           256
                       conv-7        conv-6          1×1           512      human parsing datasets, the joint annotations may not be
                       conv-8        conv-7          1×1           16       available. In this section, we show that high-level human
                       remap-1    pose maps          1×1           128      structure cues can still help the human parsing task even
                       remap-2   parsing maps        1×1           128
                                   remap-1                                  without explicit supervision from manual annotations. We
                       concat      remap-2              -          512      simplify our JPPNet and propose a novel self-supervised
    Pose refinement              pose context
                       conv-1       concat           3×3           512      structure-sensitive learning for human parsing, which intro-
                       conv-2       conv-1           5×5           256      duces a self-supervised structure-sensitive loss to evaluate
                       conv-3       conv-2           7×7           256
                       conv-4       conv-3           9×9           256
                                                                            the quality of the predicted parsing results from a joint
                       conv-5       conv-4           1×1           256      structure perspective, as illustrated in Fig. 10.
                       conv-6       conv-5           1×1           16           Specifically, in addition to using the traditional pixel-
                       remap-1     pose maps         1×1           128      wise annotations as the supervision, we generate the ap-
                       remap-2    parsing maps       1×1           128
                                    remap-1                                 proximated human joints directly from the parsing anno-
                       concat       remap-2             -          512      tations, which can also guide human parsing training. For
  Parsing refinement             parsing context
                       conv-1        concat          3×3           512      the purpose of explicitly enforcing the produced parsing
                       conv-2        conv-1          5×5           256      results to be semantically consistent with the human joint
                       conv-3        conv-2          7×7           256
                       conv-4        conv-3          9×9           256      structures, we treat the joint structure loss as a weight of
                       conv-5        conv-4          1×1           256      segmentation loss, which becomes our structure-sensitive
                        ASPP         conv-5           -             20
                                                                            loss.
                                                                                Self-supervised Structure-sensitive Loss: Generally, for
and the pose features from the last stage. For parsing, we                  the human parsing task, no other extensive information is
concatenate the two remapped results and parsing features                   provided except the pixel-wise annotations. This situation
and use ASPP again to generate parsing predictions. The                     means that rather than using augmentative information,
entire joint network with refinement can be trained end-                    we have to find a structure-sensitive supervision from the
to-end, feeding the output of the former stage into the                     parsing annotations. Because the human parsing results are
next. Following other pose estimation methods that have                     semantic parts with pixel-level labels, we attempt to explore
demonstrated strong performance with multiple iterative                     pose information contained in human parsing results. We
Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark - arXiv
10

                                                                      Joint                                                                      Joint
                                           Conv-1
                                                                     context                                                                    context                      Pose maps 2
                                                                                                  Pose maps 1

                                                                                                                                                Pose
                                                                                                                                            refinement
                                                                      Joint
                                           Res-4
                                                                     module

                                           Res-5                      ASPP
                                                                                                                                              Parsing
                                                                                                                                            refinement

                                                                                              Parsing maps 1
                                                                                                                                                                      Parsing maps 2
                                            Part                       Part                                                                        Part
                                           module                    context                                                                     context

                                     ResNet-101                     Pose Subnet                                 Parsing Subnet                                 Refinement Network

Fig. 9. The proposed JPPNet learns to incorporate the image-level context, body joint context, body part context and refined context into a unified
network, which consists of shared feature extraction, pixel-wise label prediction, keypoint heatmap prediction and iterative refinement. Given an
input image, we use ResNet-101 to extract the shared feature maps. Then, a part module and a joint module are appended to capture the part
context and keypoint context while simultaneously generating parsing score maps and pose heatmaps. Finally, a refinement network is performed
based on the predicted maps and generated context to produce better results. To clearly observe the correlation between the parsing and pose
maps, we combine pose heatmaps and parsing score maps into one map separately. For better viewing of all figures in this paper, please see the
original zoomed-in color pdf file.

                                                                                                                                                                                        Face

                       Parsing                                                       Structure-                                                                                         UpperClothes
                                                                                      sensitive        Images
                      networks                                                          loss                                                                                            Hair

                                                                                                                                                                                        RightArm

                                                                                                                                                                                        Pants

                                                                                                                                                                                        LeftArm
                                            Segmentation                                               Parsing
                                                                               Parsing g.t.                                                                                             RightShoe
     Parsing result                             loss                                                   results
                                                                                                                                                                                        LeftShoe

                                                                                                                                                                                        Hat
                                                                                                                                                     H               H
                                                                                                                                  H

                                                                                                                              U                                                         Coat
                      H                                     H                                                                               U                            U
                                                                     LA                                Joint        RA
                                                                                                                                       LA
                            RA                                  U                                                                                                                       RightLeg
                                     LA
                                                                RA                                     Structures             L                 RA
                                                Joint                                                                                            L                             L

  Generated joints         UL
                                              structure         U              Joints g.t.                               RL       LL
                                                                                                                                                         LL                             LeftLeg
                                                                                                                                            RL

                                LL               loss                LL
                                                                                                                                                                                   LL
                                                                                                                                                                                        Socks
                                                                                                                                                 RS       LS

                      RS              LS                   RS             LS                        Fig. 11. Some examples of self-supervised human joints generated from
                                                                                                    our parsing results for different bodies.
Fig. 10. Illustration of our SS-JPPNet for human parsing. An input image
goes through parsing networks, including several convolutional layers,
                                                                                                    pants and skirt are merged for lower body. The remaining
to generate the parsing results. The generated joints and ground truths
of joints represented as heatmaps are obtained by computing the center                              regions can also be obtained by the corresponding labels.
points of corresponding regions in parsing maps, including head (H),                                Some examples of generated human joints for different
upper body (U), lower body (L), right arm (RA), left arm (LA), right leg                            humans are shown in Fig. 11. Following [36], for each
(RL), left leg (LL), right shoe (RS), and left shoe (LS). The structure-
sensitive loss is generated by weighting segmentation loss with joint                               parsing result and corresponding ground truth, we compute
structure loss. For a clear observation, we combine nine heatmaps into                              the center points of regions to obtain joints represented as
one map here.                                                                                       heatmaps for training more smoothly. Then, we use the
                                                                                                    Euclidean metric to evaluate the quality of the generated
                                                                                                    joint structures, which also reflects the structure consistency
define 9 joints to construct a pose structure, which are the                                        between the predicted parsing results and the ground truth.
centers of the regions of head, upper body, lower body, left                                        Finally, the pixel-wise segmentation loss is weighted by the
arm, right arm, left leg, right leg, left shoe and right shoe.                                      joint structure loss, which becomes our structure-sensitive
The head regions are generated by merging the parsing                                               loss. Consequently, the overall human parsing networks
labels of hat, hair, sunglasses and face. Similarly, upper                                          become self-supervised with the structure-sensitive loss.
clothes, coat and scarf are merged to be upper body, and                                                Formally, given an image I , we define a list of joint
11

configurations CIP =        {cpi |i
                                ∈ [1, N ]}, where    cpi
                                                      is the                                     TABLE 8
heatmap of the i-th joint computed according to the parsing        Human parsing performance comparison in terms of mean IoU. Left:
                                  gt                                      different test sets. Right: different sizes of objects.
result map. Similarly, CIGT = {ci |i ∈ [1, N ]}, which is ob-
tained from the corresponding parsing ground truth. Here,                   Method                ATR        LIP      small      medium           large
                                                                           SegNet [3]             15.79     21.79     16.53       18.58           18.18
N is a variable decided by the human bodies in the input                  FCN-8s [31]             34.44     32.28     22.37       29.41           28.09
images, and it is equal to 9 for a full-body image. For the           DeepLab (VGG-16) [7]        48.64     43.97     28.77       40.74           43.02
                                                                          Attention [8]           49.35     45.38     31.71       41.61           44.90
joints missed in the image, we simply replace the heatmaps          DeepLab (ResNet-101) [7]      53.28     46.99     31.70       43.14           47.62
with maps filled with zeros. The joint structure loss is the         JPPNet (with pose info)      54.45     53.99     44.56       50.52           52.58
Euclidean (L2) loss, which is calculated as follows:                       SS-JPPNet              52.69     46.85     33.48       43.12           46.73

                                N
                             1 X p                                                            TABLE 9
                 LJoint =         kc − cgt 2
                                        i k2                (1)    Comparison of human parsing performance with four state-of-the-art
                            2N i=1 i                                       methods on the PASCAL-Person-Part dataset [9].
The final structure-sensitive loss, denoted as Lstructure , is           Method           head    torso   u-arms   l-arms   u-legs   l-legs    Bkg    Avg
the combination of the joint structure loss and the parsing        DeepLab-LargeFOV [7]   78.09   54.02    37.29    36.85   33.73    29.61    92.85   51.78
                                                                        HAZN [45]         80.79   59.11    43.05    42.76   38.99    34.46    93.59   56.11
segmentation loss, and it is calculated as follows:                    Attention [8]      81.47   59.06    44.15    42.50   38.28    35.62    93.65   56.39
                                                                      LG-LSTM [25]        82.72   60.99    45.40    47.76   42.33    37.96    88.63   57.97
                  LStructure = LJoint · LParsing            (2)         SS-JPPNet         83.26   62.40    47.80    45.58   42.32    39.48    94.68   59.36

where LParsing is the pixel-wise softmax loss calculated based
on the parsing annotations.                                       as the final result the average probabilities from each scale
    We name this learning strategy “self-supervised” be-          and flipped images, which is the same for predicting both
cause the above structure-sensitive loss can be generated         parsing and pose. The difference is that we utilize predic-
from existing parsing results without any extra information.      tions of all stages for parsing, but for pose, we only use the
Our self-supervised structure-sensitive JPPNet (SS-JPPNet)        results of the last stage.
thus has excellent adaptability and extensibility, which can
be injected into any advanced network to help incorporate         6.2     Results and Comparisons
rich high-level knowledge about human joints from a global
perspective.                                                      6.2.1    Human Parsing
                                                                  We compare our proposed approach with the strong base-
                                                                  lines on the LIP dataset, and we further evaluate SS-JPPNet
6     E XPERIMENTS
                                                                  on another public human parsing dataset.
6.1   Experimental Settings                                           LIP dataset: We report the results and the comparisons
Network architecture: We utilize the publicly available           with five state-of-the-art methods on the LIP validation set
model DeepLab (ResNet-101) [7] as the basic architecture          and test set in Table 2 and Table 3. On the validation set, the
of our JPPNet, which employs atrous convolution, multi-           proposed JPPNet framework improves the best performance
scale inputs with max-pooling to merge the results from all       from 44.80% to 51.37%. The simplified architecture can also
scales, and atrous spatial pyramid pooling. For SS-JPPNet,        provide a substantial enhancement in average IoU: 3.09%
our basic network is Attention [8] due to its leading accuracy    better than DeepLab (VGG-16) [7] and 1.81% better than
and competitive efficiency.                                       Attention [8]. On the test set, the JPPNet also considerably
    Training: To train JPPNet, the input image is scaled          outperforms the other baselines. This superior performance
to 384 × 384. We first train ResNet-101 on the human              achieved by our methods demonstrates the effectiveness of
parsing task for 30 epochs using the pre-trained models           our joint parsing and pose networks, which incorporate the
and networks settings from [7]. Then, we train the joint          body joint structure into the pixel-wise prediction.
framework end-to-end for another 30 epochs. We apply data             In Fig. 5, we show the results with respect to the different
augmentation, including randomly scaling the input images         challenging factors on our LIP validation set. With our
(from 0.75 to 1.25), randomly cropping and randomly left-         unified framework that models the contextual information
right flipping during training.                                   of body parts and joints, the performance of all kinds of
    When training SS-JPPNet, we use the pre-trained models        types is improved, which demonstrates that human joint
and network settings provided by DeepLab [7]. The scale           structure is conducive for the human parsing task.
of the input images is fixed as 321 × 321 for training                We further report per-class IoU on the LIP validation
networks based on Attention [8]. Two training steps are           set to verify the detailed effectiveness of our approach, as
employed to train the networks. First, we train the basic         presented in Table 4. With the consideration of human body
network on our LIP dataset for 30 epochs. Then, we perform        joints, we achieved the best performance on almost all the
the “self-supervised” strategy to fine-tune our model with        classes. As observed from the reported results, the proposed
the structure-sensitive loss. We fine-tune the networks for       JPPNet significantly improves the performance of the labels
approximately 20 epochs. We use both human parsing and            such as arms, legs, and shoes, which demonstrates its ability
pose annotations to train JPPNet and only parsing labels for      to refine the ambiguity of left and right. Furthermore, the
SS-JPPNet.                                                        labels covering small regions such as socks, and gloves are
    Inference: To stabilize the predictions, we perform           better predicted with higher IoUs. This improvement also
inference on multi-scale inputs (with scales = 0.75, 0.5, 1.25)   demonstrates the effectiveness of the unified framework,
and also left-right flipped images. In particular, we compute     particularly for small labels.
12
                                                                TABLE 10
 Comparison of human pose estimation performance of the models trained on the LIP training set and evaluated on the MPII training set (11431
                                                         single person images).

                        Method                   Head     Shoulder      Elbow      Wrist      Hip    Knee    Ankle      Total
                    ResNet-101 [7]               89.2       86.7         79.5      77.7       75.5   66.7     61.8      77.9
                       CPM [44]                  86.6       83.6         75.8      72.1       70.9   62.0     59.1      74.0
                    Hourglass [33]               86.4       84.7         77.5      73.9       74.0   63.3     58.4      75.2
               JPPNet (with parsing info)        90.4       91.7         86.4      84.0       82.5   76.5     71.3      84.1
                                                                TABLE 11
        Human pose estimation comparison between different variants of the proposed JPPNet on the LIP test set using the PCKh metric.

                          Method            Head     Shoulder      Elbow      Wrist    Hip      Knee    Ankle     Total
                            Joint           93.1       87.2         81.1      79.3     66.8     72.6     72.9     79.6
                        Joint + MSC         92.9       88.1         82.8      81.0     67.8     74.3     75.7     80.9
                         Joint + S1         93.5       88.8         83.6      81.6     70.2     76.4     76.8     82.1
                     Joint + MSC + S1       93.4       88.9         84.2      82.1     70.8     77.2     77.3     82.5
                     Joint + MSC + S2       93.3       89.3         84.4      82.5     70.0     78.3     77.7     82.7

    For a better understanding of our LIP dataset, we train             particular, for the most challenging body parts, e.g., hip and
all methods on LIP and evaluate them on ATR [26], as re-                ankle, our method achieves 5.0% and 5.5% improvements
ported in Table 8 (left). As ATR contains 18 categories while           compared with the closest competitor, respectively. Similar
LIP has 20, we test the models on the 16 common categories              improvements also occur on the validation set.
(hat, hair, sunglasses, upper clothes, dress, pants, scarf, skirt,          We present the results with respect to different challeng-
face, right arm, left arm, right leg, left leg, right shoe, left        ing factors on our LIP validation set in Fig. 7. As expected,
shoe, and background). In general, the performance on ATR               with our unified architecture, the results of all different
is better than those on LIP because the LIP dataset contains            appearances become better, thus demonstrating the positive
instances with more diverse poses, appearance patterns,                 effects of the human parsing to pose estimation.
occlusions and resolution issues, which is more consistent                  MPII Human Pose dataset [1]: To be more convincing,
with real-world situations.                                             we also perform evaluations on the MPII dataset. The MPII
    Following the MSCOCO dataset [28], we have conducted                dataset contains approximately 25,000 images, where each
an empirical analysis on different object sizes, i.e., small            person is annotated with 16 joints. The images are extracted
(area < 1532 ), medium (1532 ≤ area < 3212 ) and large                  from YouTube videos, where the contents are everyday
(area ≥ 3212 ). The results of the five baselines and the               human activities. There are 18079 images in the training
proposed methods are reported in Table 8 (right). As shown,             set, including 11431 single person images. We evaluate the
our methods show substantially superior performance for                 models trained on our LIP training set and test on these
different sizes of objects, thus further demonstrating the              11431 single person images from MPII, as presented in
advantage of incorporating the human body structure into                Table 10. The distance between our approach and others
the parsing model.                                                      provides evidence of the higher generalization ability of our
    PASCAL-Person-Part dataset [9]. The public PASCAL-                  proposed JPPNet model.
Person-Part dataset with 1,716 images for training and
1,817 for testing focuses on the human part segmentation
                                                                        6.3   Ablation Studies of JPPNet
annotated by [9]. Following [8], [45], the annotations are
merged to be six person part classes and one background                 We further evaluate the effectiveness of our two coarse-to-
class, which are head, torso, upper / lower arms and upper              fine schemes of JPPNet, including the multi-scale features
/ lower legs. We train and evaluate all methods using the               and iterative refinement. “Joint” denotes the JPPNet without
training and testing data in PASCAL-Person-Part dataset [9].            multi-scale features (“MSC”) or refinement networks. “S1”
Table 9 shows the performance of our model and compar-                  means one stage refinement and “S2” is noted for two
isons with four state-of-the-art methods on the standard IoU            stages. The human parsing and pose estimation results are
criterion. Our SS-JPPNet can significantly outperform the               shown in Table 12 and Table 11. From the comparisons, we
four baselines. For example, our best model achieves 59.36%             can learn that multi-scale features greatly improve for hu-
IoU, which is 7.58% better than DeepLab-LargeFOV [7] and                man parsing but slightly for pose estimation. However, pose
2.97% better than Attention [8]. This large improvement                 estimation considerably benefits from iterative refinement,
demonstrates that our self-supervised strategy is signifi-              which is not quite helpful for human parsing, as two stage
cantly beneficial for the human parsing task.                           refinements will decrease the parsing performance.

6.2.2   Pose Estimation                                                 6.4   Qualitative Comparison
LIP dataset: Table 5 and Table 6 report the comparison of the           Human parsing: The qualitative comparisons of the parsing
PCKh performance of our JPPNet and previous state-of-the-               results on the LIP validation set are visualized in Fig. 6.
art at a normalized distance of 0.5. On the LIP test set, our           As can be observed from these visual comparisons, our
method achieves state-of-the-art PCKh scores of 82.7%. In               methods output more semantically meaningful and precise
You can also read