VITON: An Image-based Virtual Try-on Network

Page created by Jeffrey Hines
 
CONTINUE READING
VITON: An Image-based Virtual Try-on Network
VITON: An Image-based Virtual Try-on Network

                                                                  Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, Larry S. Davis
                                                                             University of Maryland, College Park
                                                                        {xintong,zxwu,zhewu,richyu,lsd}@umiacs.umd.edu
arXiv:1711.08447v4 [cs.CV] 12 Jun 2018

                                                                  Abstract

                                             We present an image-based VIirtual Try-On Network
                                         (VITON) without using 3D information in any form, which
                                         seamlessly transfers a desired clothing item onto the cor-
                                         responding region of a person using a coarse-to-fine strat-
                                         egy. Conditioned upon a new clothing-agnostic yet descrip-
                                         tive person representation, our framework first generates a
                                         coarse synthesized image with the target clothing item over-
                                         laid on that same person in the same pose. We further
                                         enhance the initial blurry clothing area with a refinement
                                         network. The network is trained to learn how much detail
                                         to utilize from the target clothing item, and where to apply
                                         to the person in order to synthesize a photo-realistic image
                                         in which the target item deforms naturally with clear vi-
                                         sual patterns. Experiments on our newly collected dataset
                                         demonstrate its promise in the image-based virtual try-on
                                         task over state-of-the-art generative models.1                   Figure 1: Virtual try-on results generated by our
                                                                                                          method. Each row shows a person virtually trying on dif-
                                                                                                          ferent clothing items. Our model naturally renders the items
                                         1. Introduction                                                  onto a person while retaining her pose and preserving de-
                                                                                                          tailed characteristics of the target clothing items.
                                             Recent years have witnessed the increasing demands of
                                         online shopping for fashion items. Online apparel and ac-        person, the high costs of installing hardwares and collecting
                                         cessories sales in US are expected to reach 123 billion in       3D annotated data inhibit their large-scale deployment.
                                         2022 from 72 billion in 2016 [1]. Despite the convenience           We present an image-based virtual try-on approach, re-
                                         online fashion shopping provides, consumers are concerned        lying merely on plain RGB images without leveraging any
                                         about how a particular fashion item in a product image           3D information. Our goal is to synthesize a photo-realistic
                                         would look on them when buying apparel online. Thus, al-         new image by overlaying a product image seamlessly onto
                                         lowing consumers to virtually try on clothes will not only       the corresponding region of a clothed person (as shown in
                                         enhance their shopping experience, transforming the way          Figure 1). The synthetic image is expected to be perceptu-
                                         people shop for clothes, but also save cost for retailers. Mo-   ally convincing, meeting the following desiderata: (1) body
                                         tivated by this, various virtual fitting rooms/mirrors have      parts and pose of the person are the same as in the original
                                         been developed by different companies such as TriMirror,         image; (2) the clothing item in the product image deforms
                                         Fits Me, etc. However, the key enabling factor behind them       naturally, conditioned on the pose and body shape of the
                                         is the use of 3D measurements of body shape, either cap-         person; (3) detailed visual patterns of the desired product
                                         tured directly by depth cameras [40] or inferred from a 2D       are clearly visible, which include not only low-level features
                                         image using training data [4, 45]. While these 3D model-         like color and texture but also complicated graphics like em-
                                         ing techniques enable realistic clothing simulations on the      broidery, logo, etc. The non-rigid nature of clothes, which
                                                                                                          are frequently subject to deformations and occlusions, poses
                                           1 Models and code are available at https://github.com/         a significant challenge to satisfying these requirements si-
                                         xthan/VITON                                                      multaneously, especially without 3D information.
VITON: An Image-based Virtual Try-on Network
Conditional Generative Adversarial Networks (GANs),            lar deep generative models for image synthesis, and have
which have demonstrated impressive results on image gen-           demonstrated promising results in tasks like image gener-
eration [37, 26], image-to-image translation [20] and edit-        ation [8, 36] and image editing [49, 34]. To incorporate
ing tasks [49], seem to be a natural approach for addressing       desired properties in generated samples, researchers also
this problem. In particular, they minimize an adversarial          utilize different signals, in the form of class labels [33],
loss so that samples generated from a generator are indistin-      text [37], attributes [41], etc., as priors to condition the im-
guishable from real ones as determined by a discriminator,         age generation process. There are a few recent studies in-
conditioned on an input signal [37, 33, 20, 32]. However,          vestigating the problem of image-to-image translation us-
they can only transform information like object classes and        ing conditional GANs [20], which transform a given input
attributes roughly, but are unable to generate graphic de-         image to another one with a different representation. For
tails and accommodate geometric changes [50]. This limits          example, producing an RGB image from its corresponding
their ability in tasks like virtual try-on, where visual details   edge map, semantic label map, etc., or vice versa. Recently,
and realistic deformations of the target clothing item are re-     Chen and Kolton [6] trained a CNN using a regression loss
quired in generated samples.                                       as an alternative to GANs for this task without adversarial
    To address these limitations, we propose a virtual try-on      training. These methods are able to produce photo-realistic
network (VITON), a coarse-to-fine framework that seam-             images, but have limited success when geometric changes
lessly transfers a target clothing item in a product image to      occur [50]. Instead, we propose a refinement network that
the corresponding region of a clothed person in a 2D image.        pays attention to clothing regions and deals with clothing
Figure 2 gives an overview of VITON. In particular, we first       deformations for virtual try-on.
introduce a clothing-agnostic representation consisting of a           In the context of image synthesis for fashion applica-
comprehensive set of features to describe different charac-        tions, Yoo et al. [46] generated a clothed person conditioned
teristics of a person. Conditioned on this representation, we      on a product image and vice versa regardless of the per-
employ a multi-task encoder-decoder network to generate a          son’s pose. Lassner et al. [26] described a generative model
coarse synthetic clothed person in the same pose wearing           of people in clothing, but it is not clear how to control the
the target clothing item, and a corresponding clothing re-         fashion items in the generated results. A more related work
gion mask. The mask is then used as a guidance to warp             is FashionGAN [51], which replaced a fashion item on a
the target clothing item to account for deformations. Fur-         person with a new one specified by text descriptions. In
thermore, we utilize a refinement network which is trained         contrast, we are interested in the precise replacement of the
to learn how to composite the warped clothing item to the          clothing item in a reference image with a target item, and
coarse image so that the desired item is transfered with nat-      address this problem with a novel coarse-to-fine framework.
ural deformations and detailed visual patterns. To validate
                                                                   Virtual try-on. There is a large body of work on virtual
our approach, we conduct a user study on our newly col-
                                                                   try-on, mostly conducted in computer graphics. Guan et al.
lected dataset and the results demonstrate that VITON gen-
                                                                   proposed DRAPE [13] to simulate 2D clothing designs on
erates more realistic and appealing virtual try-on results out-
                                                                   3D bodies in different shapes and poses. Hilsmann and P.
performing state-of-the-art methods.
                                                                   Eisert [18] retextured the garment dynamically based on a
                                                                   motion model for real-time visualization in a virtual mir-
2. Related Work                                                    ror environment. Sekine et al. [40] introduced a virtual fit-
                                                                   ting system that adjusts 2D clothing images to users through
Fashion analysis. Extensive studies have been conducted
                                                                   inferring their body shapes with depth images. Recently,
on fashion analysis due to its huge profit potentials. Most
                                                                   Pons-Moll et al. [35] utilized a multi-part 3D model of
existing methods focus on clothing parsing [44, 28], cloth-
                                                                   clothed bodies for clothing capture and retargeting. Yang
ing recognition by attributes [31], matching clothing seen
                                                                   et al. [45] recovered a 3D mesh of the garment from a sin-
on the street to online products [30, 14], fashion recommen-
                                                                   gle view 2D image, which is further re-targeted to other hu-
dation [19], visual compatibility learning [43, 16], and fash-
                                                                   man bodies. In contrast to relying on 3D measurements to
ion trend prediction [2]. Compared to these lines of work,
                                                                   perform precise clothes simulation, in our work, we focus
we focus on virtual try-on with only 2D images as input.
                                                                   on synthesizing a perceptually correct photo-realistic image
Our task is also more challenging compared to recent work
                                                                   directly from 2D images, which is more computationally
on interactive search that simply modifies attributes (e.g.,
                                                                   efficient. In computer vision, limited work has explored
color and textures) of a clothing item [25, 48, 15], since vir-
                                                                   the task of virtual try-on. Recently, Jetchev and Bergmann
tual try-on requires preserving the details of a target cloth-
                                                                   [21] proposed a conditional analogy GAN to swap fashion
ing image as much as possible, including exactly the same
                                                                   articles. However, during testing, they require the prod-
style, embroidery, logo, text, etc.
                                                                   uct images of both the target item and the original item on
Image synthesis.      GANs [12] are one of most popu-              the person, which makes it infeasible in practical scenar-
VITON: An Image-based Virtual Try-on Network
Multi-task Encoder-decoder Generator Sec 3.2 Coarse Result I 0                        Reference Image
                                                                                                                                                                         p (256 × 192 × 22)
Person Representation     p                                                                                 Reference Image   Pose map   Body shape Face and hair
                                                                            Perceptual
                                                                               Loss

                                                                                                                                                                Concat

  Target Clothing                                          Clothing Mask                  GT Mask
                                                                                                                                                                             Person
                                                                                                                                                                          Representation
                                    GC                                          L1 Loss
                                                                                                            Figure 3: A clothing-agnostic person representation.
                                                                                                            Given a reference image I, we extract the pose, body shape
                                                                                                            and face and hair regions of the person, and use this infor-
   Shape Context
     Matching                                                                                               mation as part of input to our generator.

   Warped Clothing   c0                                         Refinement Network Sec 3.3

                                   Composition Mask       Refined Result   Iˆ             Reference Image
                                                                                                            improved to account for detailed visual patterns and defor-
                                                    Alpha
                                                  Composition
                                                                            Perceptual
                                                                              Loss                          mations with a refinement network (Sec 3.3). The overall
                    I0
                                                                                                            framework is illustrated in Figure 2.
   Coarse Result

                              GR                                                                            3.1. Person Representation
                                                                                                                A main technical challenge of a virtual try-on synthesis
Figure 2: An overview of VITON. VITON consists of two                                                       is to deform the target clothing image to fit the pose of a
stages: (a) an encoder-decoder generator stage (Sec 3.2),                                                   person. To this end, we introduce a clothing-agnostic per-
and (b) a refinement stage (Sec 3.3).                                                                       son representation, which contains a set of features (Figure
                                                                                                            3), including pose, body parts, face and hair, as a prior to
ios. Moreover, without injecting any person representation                                                  constrain the synthesis process.
or explicitly considering deformations, it fails to generate
photo-realistic virtual try-on results.                                                                     Pose heatmap. Variations in human poses lead to differ-
                                                                                                            ent deformations of clothing, and hence we explicitly model
3. VITON                                                                                                    pose information with a state-of-the-art pose estimator [5].
                                                                                                            The computed pose of a person is represented as coordinates
    The goal of VITON is, given a reference image I with                                                    of 18 keypoints. To leverage their spatial layout, each key-
a clothed person and a target clothing item c, to synthesize                                                point is further transformed to a heatmap, with an 11 × 11
a new image I,   ˆ where c is transferred naturally onto the                                                neighborhood around the keypoint filled in with ones and
corresponding region of the same person whose body parts                                                    zeros elsewhere. The heatmaps from all keypoints are fur-
and pose information are preserved. Key to a high-quality                                                   ther stacked into an 18-channel pose heatmap.
synthesis is to learn a proper transformation from product
                                                                                                            Human body representation. The appearance of clothing
images to clothes on the body. A straightforward approach
                                                                                                            highly depends on body shapes, and thus how to transfer
is to leverage training data of a person with fixed pose wear-
                                                                                                            the target fashion item depends on the location of differ-
ing different clothes and the corresponding product images,
                                                                                                            ent body parts (e.g., arms or torso) and the body shape. A
which, however, is usually difficult to acquire.
                                                                                                            state-of-the-art human parser [11] is thus used to compute
    In a practical virtual try-on scenario, only a reference im-
                                                                                                            a human segmentation map, where different regions repre-
age and a desired product image are available at test time.
                                                                                                            sent different parts of human body like arms, legs, etc. We
Therefore, we adopt the same setting for training, where a
                                                                                                            further convert the segmentation map to a 1-channel binary
reference image I with a person wearing c and the product
                                                                                                            mask, where ones indicate human body (except for face and
image of c are given as inputs (we will use c to refer to the
                                                                                                            hair) and zeros elsewhere. This binary mask derived di-
product image of c in the following paper). Now the prob-
                                                                                                            rectly from I is downsampled to a lower resolution (16 × 12
lem becomes given the product image c and the person’s in-
                                                                                                            as shown in Figure 3) to avoid the artifacts when the body
formation, how to learn a generator that not only produces
                                                                                                            shape and target clothing conflict as in [51].
I during training, but more importantly is able to generalize
at test time – synthesizing a perceptually convincing image                                                 Face and hair segment. To maintain the identity of the per-
with an arbitrary desired clothing item.                                                                    son, we incorporate physical attributes like face, skin color,
    To this end, we first introduce a clothing-agnostic per-                                                hair style, etc. We use the human parser [11] to extract the
son representation (Sec 3.1). We then synthesize the refer-                                                 RGB channels of face and hair regions of the person to in-
ence image with an encoder-decoder architecture (Sec 3.2),                                                  ject identity information when generating new images.
conditioned on the person representation as well as the tar-                                                   Finally, we resize these three feature maps to the same
get clothing image. The resulting coarse result is further                                                  resolution and then concatenate them to form a clothing-
VITON: An Image-based Virtual Try-on Network
0
agnostic person representation p such that p ∈ Rm×n×k ,                                   Target Clothing                     Warped Clothing c

where m = 256 and n = 192 denote the height and width
of the feature map, and k = 18 + 1 + 3 = 22 represents the                                                       TPS
                                                                                                            Transformation
number of channels. The representation contains abundant
information about the person upon which convolutions are
performed to model their relations. Note that our represen-               Clothing Mask
tation is more detailed than previous work [32, 51].

3.2. Multi-task Encoder-Decoder Generator                                                                     Shape Context
                                                                                                                TPS Warp

     Given the clothing-agnostic person representation p and
the target clothing image c, we propose to synthesize the
                                                                      Figure 4: Warping a clothing image. Given the target
reference image I through reconstruction such that a nat-
                                                                      clothing image and a clothing mask predicted in the first
ural transfer from c to the corresponding region of p can
                                                                      stage, we use shape context matching to estimate the TPS
be learned. In particular, we utilize a multi-task encoder-
                                                                      transformation and generate a warped clothing image.
decoder framework that generates a clothed person image
along with a clothing mask of the person as well. In addi-
tion to guiding the network to focus on the clothing region,
                                                                      RGB values of the ground truth image and their activations
the predicted clothing mask will be further utilized to refine
                                                                      at different layers in a visual perception model as well, al-
the generated result, as will be discussed in Sec 3.3. The
                                                                      lowing the synthesis network to learn realistic patterns. The
encoder-decoder is a general type of U-net architecture [38]
                                                                      second term in Eqn. 1 is a regression loss that encourages
with skip connections to directly share information between
                                                                      the predicted clothing mask M to be the same as M0 .
layers through bypassing connections.
                                                                          By minimizing Eqn. 1, the encoder-decoder learns how
     Formally, let GC denote the function approximated by
                                                                      to transfer the target clothing conditioned on the person rep-
the encoder-decoder generator. It takes the concatenated
                                                                      resentation. While the synthetic clothed person conforms to
c and p as its input and generates a 4-channel output
                                                                      the pose, body parts and identity in the original image (as
(I ′ , M ) = GC (c, p), where the first 3 channels represent
                                                                      illustrated in the third column of Figure 5), details of the tar-
a synthesized image I ′ and the last channel M represents a
                                                                      get item such as text, logo, etc. are missing. This might be
segmentation mask of the clothing region as shown at the
                                                                      attributed to the limited ability to control the process of syn-
top of Figure 2. We wish to learn a generator such that I ′
                                                                      thesis in current state-of-the-art generators. They are typi-
is close to the reference image I and M is close to M0 (M0
                                                                      cally optimized to synthesize images that look similar glob-
is the pseudo ground truth clothing mask predicted by the
                                                                      ally to the ground truth images without knowing where and
human parser on I). A simple way to achieve this is to train
                                                                      how to generate details. To address this issue, VITON uses
the network with an L1 loss, which generates decent results
                                                                      a refinement network together with the predicted clothing
when the target is a binary mask like M0 . However, when
                                                                      mask M to improve the coarse result I ′ .
the desired output is a colored image, L1 loss tends to pro-
duce blurry images [20]. Following [22, 27, 7], we utilize            3.3. Refinement Network
a perceptual loss that models the distance between the cor-
responding feature maps of the synthesized image and the                 The refinement network GR in VITON is trained to ren-
ground truth image, computed by a visual perception net-              der the coarse blurry region leveraging realistic details from
work. The loss function of the encoder-decoder can now be             a deformed target item.
written as the sum of a perceptual loss and an L1 loss:               Warped clothing item. We borrow information directly
            5
                                                                      from the target clothing image c to fill in the details in the
            X
   L GC =         λi ||φi (I ′ ) − φi (I)||1 + ||M − M0 ||1 ,   (1)   generated region of the coarse sample. However, directly
            i=0
                                                                      pasting the product image is not suitable as clothes deform
                                                                      conditioned on the person pose and body shape. Therefore,
where φi (y) in the first term is the feature map of image y          we warp the clothing item by estimating a thin plate spline
of the i-th layer in the visual perception network φ, which           (TPS) transformation with shape context matching [3], as il-
is a VGG19 [42] network pre-trained on ImageNet. For                  lustrated in Figure 4. More specifically, we extract the fore-
layers i > 1, we utilize ‘conv1 2’, ‘conv2 2’, ‘conv3 2’,             ground mask of c and compute shape context TPS warps [3]
‘conv4 2’, ‘conv5 2’ of the VGG model while for layer 0,              between this mask and the clothing mask M of the person,
we directly use RGB pixel values. The hyperparameter λi               estimated with Eqn. 1. These computed TPS parameters are
controls the contribution of the i-th layer to the total loss.        further applied to transform the target clothing image c into
The perceptual loss forces the synthesized image to match             a warped version c′ . As a result, the warped clothing image
VITON: An Image-based Virtual Try-on Network
Reference Image   Target Clothing   Coarse Result   Clothing Mask   Warped Clothing Composition Mask   Refined Result
                                                                                                                        where φ denotes the visual perception network VGG19.
                                                                                                                        Here we only use ‘conv3 2’, ‘conv4 2’, ‘conv5 2’ for cal-
                                                                                                                        culating this loss. Since lower layers of a visual perception
                                                                                                                        network care more about the detailed pixel-level informa-
                                                                                                                        tion of an image instead of its content [10], small displace-
                                                                                                                        ments between I and Iˆ (usually caused by imperfect warp-
                                                                                                                        ing) will lead to a large mismatch between the feature maps
                                                                                                                        of lower layers (‘conv1’ and ‘conv2’), which, however, is
                                                                                                                        acceptable in a virtual try-on setting. Hence, by only us-
                                                                                                                        ing higher layers, we encourage the model to ignore the ef-
                                                                                                                        fects of imperfect warping, and hence it is able to select the
                                                                                                                        warped target clothing image and preserve more details.
Figure 5: Output of different steps in our method. Coarse                                                                  We further regularize the generated composition mask
sythetic results generated by the encoder-decoder are fur-                                                              output by GR with an L1 norm and a total-variation (TV)
ther improved by learning a composition mask to account                                                                 norm. The full objective function for the refinement net-
for details and deformations.                                                                                           work then becomes:

conforms to pose and body shape information of the person                                                                                ˆ I) − λwarp ||α||1 + λT V ||∇α||1 ,
                                                                                                                            LGR = Lperc (I,                                                      (4)
and fully preserves the details of the target item. The idea
is similar to recent 2D/3D texture warping methods for face                                                             where λwarp and λT V denote the weights for the L1 norm
synthesis [52, 17], where 2D facial keypoints and 3D pose                                                               and the TV norm, respectively. Minimizing the negative
estimation are utilized for warping. In contrast, we rely on                                                            L1 term encourages our model to utilize more information
the shape context-based warping due to the lack of accu-                                                                from the warped clothing image and render more details.
rate annotations for clothing items. Note that a potential                                                              The total-variation regularizer ||∇α||1 penalizes the gradi-
alternative to estimating TPS with shape context matching                                                               ents of the generated composition mask α to make it spa-
is to learn TPS parameters through a Siamese network as                                                                 tially smooth, so that the transition from the warped region
in [23]. However, this is particularly challenging for non-                                                             to the coarse result looks more natural.
rigid clothes, and we empirically found that directly using                                                                 Figure 5 visualizes the results generated at different steps
context shape matching offers better warping results for vir-                                                           from our method. Given the target clothing item and the
tual try-on.                                                                                                            representation of a person, the encoder-decoder produces
                                                                                                                        a coarse result with pose, body shape and face of the per-
Learn to composite. The composition of the warped cloth-                                                                son preserved, while details like graphics and textures on
ing item c′ onto the coarse synthesized image I ′ is expected                                                           the target clothing item are missing. Based on the clothing
to combine c′ seamlessly with the clothing region and han-                                                              mask, our refinement stage warps the target clothing im-
dle occlusion properly in cases where arms or hair are in                                                               age and predicts a composition mask to determine which
front of the body. Therefore, we learn how to compos-                                                                   regions should be replaced in the coarse synthesized im-
ite with a refinement network. As shown at the bottom                                                                   age. Consequentially, important details (material in the 1st
of Figure 2, we first concatenate c′ and the coarse output                                                              example, text in 2nd example, and patterns in the 3rd exam-
I ′ as the input of our refinement network GR . The refine-                                                             ple) “copied” from the target clothing image are “pasted” to
ment network then generates a 1-channel composition mask                                                                the corresponding clothing region of the person.
α ∈ (0, 1)m×n , indicating how much information is utilized
from each of the two sources, i.e., the warped clothing item                                                            4. Experiments
c′ and the coarse image I ′ . The final virtual try-on output
of VITON Iˆ is a composition of c′ and I ′ :                                                                            4.1. Dataset
                                                                                                                           The dataset used in [21] is a good choice for conducting
                               Iˆ = α ⊙ c′ + (1 − α) ⊙ I ′ ,                                                   (2)      experiments for virtual try-on, but it is not publicly avail-
                                                                                                                        able. We therefore collected a similar dataset. We first
where ⊙ represents element-wise matrix multiplication. To                                                               crawled around 19,000 frontal-view woman and top2 cloth-
learn the optimal composition mask, we minimize the dis-                                                                ing image pairs and then removed noisy images with no
crepancy between the generated result Iˆ and the reference                                                              parsing results, yielding 16,253 pairs. The remaining im-
image I with a similar perceptual loss Lperc as Eqn. 1:                                                                 ages are further split into a training set and a testing set
                                                5                                                                           2 Note that we focus on tops since they are representative in attire with
                                                X
                         ˆ I) =
                  Lperc (I,                                     ˆ − φi (I)||1 ,
                                                       λi ||φi (I)                                             (3)      diverse visual graphics and significant deformations. Our method is gen-
                                                i=3
                                                                                                                        eral and can also be trained for pants, skirts, outerwears, etc.
VITON: An Image-based Virtual Try-on Network
with 14,221 and 2,032 pairs respectively. Note that dur-           lem - it treats the original item and the target clothing item
ing testing, the person should wear a different clothing item      together as a condition when training a Cycle-GAN [50].
than the target one as in real-world scenarios, so we ran-         However, at test time, it also requires the product image of
domly shuffled the clothing product images in these 2,032          the original clothing in the reference image, which makes it
test pairs for evaluation.                                         infeasible in practical scenarios. But we compare with this
                                                                   approach for completeness. Note that for fairness, we mod-
4.2. Implementation Details                                        ify their encoder-decoder generator to have the same struc-
Training setup. Following recent work using encoder-               ture as ours, so that it can also generate 256 × 192 images.
decoder structures [21, 36], we use the Adam [24] optimizer        Other implementation details are the same as in [21].
with β1 = 0.5, β2 = 0.999, and a fixed learning rate of            Cascaded Refinement Network (CRN) [6]. CRN lever-
0.0002. We train the encoder-decoder generator for 15K             ages a cascade of refinement modules, and each module
steps and the refinement network for 6K steps both with a          takes the output from its previous module and a down-
batch size of 16. The resolution of the synthetic samples is       sampled version of the input to generate a high-resolution
256 × 192.                                                         synthesized image. Without adversarial training, CRN re-
Encoder-decoder generator. Our network for the coarse              gresses to a target image using a CNN network. To compare
stage contains 6 convolutional layers for encoding and de-         with CRN, we feed the same input of our generator to CRN
coding, respectively. All encoding layers consist of 4 × 4         and output a 256 × 192 synthesized image.
spatial filters with a stride of 2, and their numbers of filters   Encoder-decoder generator. We only use the network of
are 64, 128, 256, 512, 512, 512, respectively. For decoding,       our first stage to generate the target virtual try-on effect,
similar 4 × 4 spatial filters are adopted with a stride of 1/2     without the TPS warping and the refinement network.
for all layers, whose number of channels are 512, 512, 256,
128, 64, 4. The choice of activation functions and batch           Non-parametric warped synthesis. Without using the
normalizations are the same as in [20]. Skip connections           coarse output of our encoder-decoder generator, we esti-
[38] are added between encoder and decoder to improve the          mate the TPS transformation using shape context matching
performance. λi in Eqn. 1 is chosen to scale the loss of each      and paste the warped garment on the reference image. A
term properly [6].                                                 similar baseline is also presented in [51].
                                                                      The first three state-of-the-art approaches are directly
Refinement network. The network is a four-layer fully
                                                                   compared with our encoder-decoder generator without ex-
convolutional model. Each of the first three layers has
                                                                   plicitly modeling deformations with warping, while the
3 × 3 × 64 filters followed by Leaky ReLUs and the last
                                                                   last Non-parametric warped synthesis method is adopted
layer outputs the composition mask with 1 × 1 spatial fil-
                                                                   to demonstrate the importance of learning a composition
ters followed by a sigmoid activation function to scale the
                                                                   based on the coarse results.
output to (0, 1). λi in Eqn. 4 is the same as in Eqn. 1,
λwarp = 0.1 and λT V = 5e − 6.                                     4.4. Qualitative Results
Runtime. The runtime of each component in VITON: Hu-
man Parsing (159ms), Pose estimation (220ms), Encoder-                Figure 6 presents a visual comparison of different meth-
Decoder (27ms), TPS (180ms), Refinement (20ms). Results            ods. CRN and encoder-decoder create blurry and coarse re-
other than TPS are obtained on a K40 GPU. We expect fur-           sults without knowing where and how to render the details
ther speed up of TPS when implemented in GPU.                      of target clothing items. Methods with adversarial training
                                                                   produce shaper edges, but also cause undesirable artifacts.
4.3. Compared Approaches                                           Our Non-parametric baseline directly pastes the warped tar-
                                                                   get image to the person regardless of the inconsistencies be-
   To validate the effectiveness of our framework, we com-
                                                                   tween the original and target clothing items, which results
pare with the following alternative methods.
                                                                   in unnatural images. In contrast to these methods, VITON
GANs with Person Representation (PRGAN) [32, 51].                  accurately and seamlessly generates detailed virtual try-on
Existing methods that leverage GANs conditioned on either          results, confirming the effectiveness of our framework.
poses [32] or body shape information [51] are not directly            However, there are some artifacts around the neckline in
comparable since they are not designed for the virtual try-        the last row, which results from the fact that our model can-
on task. To achieve fair comparisons, we enrich the input of       not determine which regions near the neck should be visible
[51, 32] to be the same as our model (a 22-channel repre-          (e.g., the neck tag should be hided in the final result, see
sentation, p + target clothing image c) and adopt their GAN        supplementary material for more discussions). In addition,
structure to synthesize the reference image.                       pants, without providing any product images of them, are
Conditional Analogy GAN (CAGAN) [21]. CAGAN for-                   also generated by our model. This indicates that our model
mulates the virtual try-on task as an image analogy prob-          implicitly learns the co-occurrence between different fash-
VITON: An Image-based Virtual Try-on Network
Reference Image                Target Clothing      PRGAN [32,51]        CAGAN [21]     CRN [6]      Encoder-Decoder Non-parametric   VITON (ours)

Figure 6: Qualitative comparisons of different methods. Our method effectively renders the target clothing on to a person.

Reference Image   Target Clothing           Our Method               Without Pose

Reference Image   Target Clothing           Our Method             Without Body Shape                  Figure 8: Failure cases of our method.

                                                                                           person with a complicated pose, using body shape infor-
                                                                                           mation alone is not sufficient to handle occlusion and pose
Figure 7: Effect of removing pose and body shape from                                      ambiguity. Body shape information is also critical to ad-
the person representation. For each method, we show its                                    just the target item to the right size. This confirms the pro-
coarse result and predicted clothing mask output by the cor-                               posed clothing-agnostic representation is indeed more com-
responding encoder-decoder generator.                                                      prehensive and effective than prior work.

                                                                                           Failure cases. Figure 8 demonstrates two failure cases of
ion items. VITON is also able to keep the original pants if                                our method due to rarely-seen poses (example on the left)
the pants regions are handled in the similar way as face and                               or a huge mismatch in the current and target clothing shapes
hair (i.e., extract pants regions and take them as the input to                            (right arm in the right example).
the encoder). More results and analysis are present in the
supplementary material.                                                                    In the wild results. In addition to experimenting with con-
Person representation analysis. To investigate the effec-                                  strained images, we also utilize in the wild images from the
tiveness of pose and body shape in the person representa-                                  COCO dataset [29], by cropping human body regions and
tion, we remove them from the representation individually                                  running our method on them. Sample results are shown in
and compare with our full representation. Sampled coarse                                   Figure 9, which suggests our method has potentials in ap-
results are illustrated in Figure 7. We can see that for a                                 plications like generating people in clothing [26].
VITON: An Image-based Virtual Try-on Network
other methods is adopted as the Human evaluation metric
                                                                   following [6] (chance is 50%).
                                                                      Quantitative comparisons are summarized in Table 1.
                                                                   Note that the human score evaluates whether the virtual try-
                                                                   on results, synthetic images with a person wearing the target
                                                                   item, are realistic. However, we don’t have such ground-
                                                                   truth images - the same person in the same pose wearing
                                                                   the target item (IS measures the characteristics of a set, so
                                                                   we use all reference images in the test set to estimate the IS
                                                                   of real data).
                                                                       According to this table, we make the following obser-
                                                                   vations: (a) Automatic measures like Inception Score are
                                                                   not suitable for evaluating tasks like virtual try-on. The
                                                                   reasons are two-fold. First, these measures tend to reward
                                                                   sharper image content generated by adversarial training or
Figure 9: In the wild results. Our method is applied to            direct image pasting, since they have higher activation val-
images on COCO.                                                    ues of neurons in Inception model than those of smooth im-
                                                                   ages. This even leads to a higher IS of the Non-parametric
            Method             IS               Human              baseline over real images. Moreover, they are not aware
       PRGAN [32, 51] 2.688 ± 0.098             27.3%              of the task and cannot measure the desired properties of
         CAGAN [21]     2.981 ± 0.087           21.8%              a virtual try-on system. For example, CRN has the low-
           CRN [6]      2.449 ± 0.070           69.1%              est IS, but ranked the 2nd place in the user study. Sim-
      Encoder-Decoder 2.455 ± 0.110             58.4%              ilar phenomena are also observed in [27, 6]; (b) Person
        Non-parametric  3.373 ± 0.142           46.4%              representation guided methods (PRGAN, CRN, Encoder-
         VITON (Ours)   2.514 ± 0.130           77.2%              Decoder, VITON) are preferred by humans. CAGAN and
                                                                   Non-parametric directly take the original person image as
           Real Data    3.312 ± 0.098              -
                                                                   inputs, so they cannot deal with cases when there are in-
Table 1: Quantitative evaluation on our         virtual try-on     consistencies between the original and target clothing item,
dataset.                                                           e.g., rendering a short-sleeve T-shirt on a person wearing a
                                                                   long-sleeve shirt; (c) By compositing the coarse result with
4.5. Quantitative Results                                          a warped clothing image, VITON performs better than each
    We also compare VITON with alternative methods quan-           individual component. VITON also obtains a higher human
titatively based on Inception Score [39] and a user study.         evaluation score than state-of-the-art generative models and
                                                                   outputs more photo-realistic virtual try-on effects.
Inception Score. Inception Score (IS) [39] is usually used
to quantitatively evaluate the synthesis quality of image             To better understand the noise of the study, we follow
generation models [32, 33, 47]. Models producing visu-             [6, 32] to perform time-limited (0.25s) real or fake test on
ally diverse and semantically meaningful images will have          AMT, which shows 17.18% generated images are rated as
higher Inception Scores, and this metric correlates well with      real, and 11.46% real images are rated as generated.
human evaluations on image datasets like CIFAR10.
Perceptual user study. Although Inception Score can be             5. Conclusion
used as an indicator of the image synthesis quality, it cannot
reflect whether the details of the target clothing are naturally       We presented a virtual try-on network (VITON), which
transferred or the pose and body of the clothed person are         is able to transfer a clothing item in a product image to a per-
preserved in the synthesized image. Thus, simialr to [6, 9],       son relying only on RGB images. A coarse sample is first
we conducted a user study on the Amazon Mechanical Turk            generated with a multi-task encoder-decoder conditioned
(AMT) platform. On each trial, a worker is given a person          on a detailed clothing-agnostic person representation. The
image, a target clothing image and two virtual try-on results      coarse results are further enhanced with a refinement net-
generated by two different methods (both in 256×192). The          work that learns the optimal composition. We conducted
worker is then asked to choose the one that is more realistic      experiments on a newly collected dataset, and promising re-
and accurate in a virtual try-on situation. Each AMT job           sults are achieved both quantitatively and qualitatively. This
contains 5 such trials with a time limit of 200 seconds. The       indicates that our 2D image-based synthesis pipeline can be
percentage of trials in which one method is rated better than      used as an alternative to expensive 3D based methods.
VITON: An Image-based Virtual Try-on Network
References                                                            [21] N. Jetchev and U. Bergmann. The conditional analogy gan:
                                                                           Swapping fashion articles on people images. In ICCVW,
 [1] U.S. fashion and accessories e-retail revenue 2016-2022.              2017. 2, 5, 6, 8
     https://www.statista.com/statistics/278890/us-apparel-and-
                                                                      [22] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for
     accessories-retail-e-commerce-revenue/. 1
                                                                           real-time style transfer and super-resolution. In ECCV, 2016.
 [2] Z. Al-Halah, R. Stiefelhagen, and K. Grauman. Fashion for-
                                                                           4
     ward: Forecasting visual style in fashion. In ICCV, 2017.
                                                                      [23] A. Kanazawa, D. W. Jacobs, and M. Chandraker. Warpnet:
     2
                                                                           Weakly supervised matching for single-view reconstruction.
 [3] S. Belongie, J. Malik, and J. Puzicha. Shape matching and
                                                                           In CVPR, 2016. 5
     object recognition using shape contexts. IEEE TPAMI, 2002.
     4                                                                [24] D. Kingma and J. Ba. Adam: A method for stochastic opti-
 [4] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero,               mization. arXiv preprint arXiv:1412.6980, 2014. 6
     and M. J. Black. Keep it smpl: Automatic estimation of 3d        [25] A. Kovashka, D. Parikh, and K. Grauman. Whittlesearch:
     human pose and shape from a single image. In ECCV, 2016.              Image search with relative attribute feedback. In CVPR,
     1                                                                     2012. 2
 [5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-      [26] C. Lassner, G. Pons-Moll, and P. V. Gehler. A generative
     person 2d pose estimation using part affinity fields. In CVPR,        model of people in clothing. In ICCV, 2017. 2, 7
     2017. 3                                                          [27] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunning-
 [6] Q. Chen and V. Koltun. Photographic image synthesis with              ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and
     cascaded refinement networks. In ICCV, 2017. 2, 6, 8                  W. Shi. Photo-realistic single image super-resolution using a
 [7] A. Dosovitskiy and T. Brox. Generating images with per-               generative adversarial network. In CVPR, 2017. 4, 8
     ceptual similarity metrics based on deep networks. In NIPS,      [28] X. Liang, L. Lin, W. Yang, P. Luo, J. Huang, and S. Yan.
     2016. 4                                                               Clothes co-parsing via joint image segmentation and labeling
 [8] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learn-           with application to clothing retrieval. IEEE TMM, 2016. 2
     ing to generate chairs with convolutional neural networks. In    [29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
     CVPR, 2015. 2                                                         manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
 [9] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng. Stylenet: Gen-            mon objects in context. In ECCV, 2014. 7
     erating attractive visual captions with styles. In CVPR, 2017.   [30] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. Street-to-
     8                                                                     shop: Cross-scenario clothing retrieval via parts alignment
[10] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer         and auxiliary set. In CVPR, 2012. 2
     using convolutional neural networks. In CVPR, 2016. 5            [31] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion:
[11] K. Gong, X. Liang, X. Shen, and L. Lin. Look into per-                Powering robust clothes recognition and retrieval with rich
     son: Self-supervised structure-sensitive learning and a new           annotations. In CVPR, 2016. 2
     benchmark for human parsing. In CVPR, 2017. 3
                                                                      [32] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and
[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,                     L. Van Gool. Pose guided person image generation. In NIPS,
     D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-          2017. 2, 4, 6, 8
     erative adversarial nets. In NIPS, 2014. 2
                                                                      [33] A. Odena, C. Olah, and J. Shlens. Conditional image synthe-
[13] P. Guan, L. Reiss, D. A. Hirshberg, A. Weiss, and M. J.
                                                                           sis with auxiliary classifier gans. In ICML, 2017. 2, 8
     Black. Drape: Dressing any person. ACM TOG, 2012. 2
                                                                      [34] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M.
[14] M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L.
                                                                           Álvarez. Invertible conditional gans for image editing. In
     Berg. Where to buy it: Matching street clothing photos in
                                                                           NIPS Workshop, 2016. 2
     online shops. In ICCV, 2015. 2
[15] X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y. Li,             [35] G. Pons-Moll, S. Pujades, S. Hu, and M. Black. Clothcap:
     Y. Zhao, and L. S. Davis. Automatic spatially-aware fash-             Seamless 4d clothing capture and retargeting. ACM TOG,
     ion concept discovery. In ICCV, 2017. 2                               2017. 2
[16] X. Han, Z. Wu, Y.-G. Jiang, and L. S. Davis. Learning fash-      [36] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
     ion compatibility with bidirectional lstms. In ACM Multime-           sentation learning with deep convolutional generative adver-
     dia, 2017. 2                                                          sarial networks. arXiv preprint arXiv:1511.06434, 2015. 2,
[17] T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face            6
     frontalization in unconstrained images. In CVPR, 2015. 5         [37] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
[18] A. Hilsmann and P. Eisert. Tracking and retexturing cloth for         H. Lee. Generative adversarial text to image synthesis. In
     real-time virtual clothing applications. In MIRAGE, 2009. 2           ICML, 2016. 2
[19] Y. Hu, X. Yi, and L. S. Davis. Collaborative fashion rec-        [38] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
     ommendation: a functional tensor factorization approach. In           tional networks for biomedical image segmentation. In MIC-
     ACM Multimedia, 2015. 2                                               CAI, 2015. 4, 6
[20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image    [39] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
     translation with conditional adversarial networks. In CVPR,           ford, and X. Chen. Improved techniques for training gans. In
     2017. 2, 4, 6                                                         NIPS, 2016. 8
VITON: An Image-based Virtual Try-on Network
[40] M. Sekine, K. Sugita, F. Perbet, B. Stenger, and
     M. Nishiyama. Virtual fitting by single-shot body shape es-
     timation. In 3D Body Scanning Technologies, 2014. 1, 2
[41] W. Shen and R. Liu. Learning residual images for face at-
     tribute manipulation. CVPR, 2017. 2
[42] K. Simonyan and A. Zisserman. Very deep convolutional
     networks for large-scale image recognition. In ICLR, 2015.
     4
[43] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Be-
     longie. Learning visual clothing style with heterogeneous
     dyadic co-occurrences. In CVPR, 2015. 2
[44] K. Yamaguchi, M. Hadi Kiapour, and T. L. Berg. Paper doll
     parsing: Retrieving similar styles to parse clothing items. In
     ICCV, 2013. 2
[45] S. Yang, T. Ambert, Z. Pan, K. Wang, L. Yu, T. Berg, and
     M. C. Lin. Detailed garment recovery from a single-view
     image. In ICCV, 2017. 1, 2
[46] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-
     level domain transfer. In ECCV, 2016. 2
[47] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and
     D. Metaxas. Stackgan: Text to photo-realistic image synthe-
     sis with stacked generative adversarial networks. In ICCV,
     2017. 8
[48] B. Zhao, J. Feng, X. Wu, and S. Yan. Memory-augmented at-
     tribute manipulation networks for interactive fashion search.
     In CVPR, 2017. 2
[49] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
     Generative visual manipulation on the natural image mani-
     fold. In ECCV, 2016. 2
[50] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
     to-image translation using cycle-consistent adversarial net-
     works. In ICCV, 2017. 2, 6
[51] S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. L. Chen. Be your
     own prada: Fashion synthesis with structural coherence. In
     ICCV, 2017. 2, 3, 4, 6, 8
[52] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity
     pose and expression normalization for face recognition in the
     wild. In CVPR, 2015. 5
VITON: An Image-based Virtual Try-on Network
                                         Supplemental Material

Detailed Network Structures
   We illustrate the detailed network structure of our encoder-decoder generator in Figure 1, and that of our refinement
network in Figure 2.

                                            128x96x64
                                                               64x48x128
                                                                                   32x24x256
                                                                                                   16x12x512
                                                                                                                   8x6x512
                                                        L                    L                 L               L             L
                   256x192x22
                                     C                  C                    C                 C               C             C
                                                        B                    B                 B               B             B

                    256x192x3                                                                                                4x3x512

                                     R                                                                                       R
                                                        R                    R                 R               R
                                     D                                                                                       D
                                                        D                    D                 D               D
                                     T                                                                                       B
                                                        B                    B                 B               B
                                                                                                   16x12x512       8x6x512   P
                  256x192x4                                   64x48x128           32x24x256
                                         128x96x64

                                                               L Leaky ReLu,
                      C Convolution                            C Convolution,                           Skip Connection
                                                               B Batchnorm

                      R   ReLu,
                                                               R ReLu,                               R ReLu,
                      D   Deconvolution,
                                                               D Deconvolution,                      D Deconvolution,
                      B   Batchnorm,
                                                               B Batchnorm                           T Tanh
                      P   Dropout

Figure 1: Network structure of our encoder-decoder generator. Blue rectangles indicate the encoding layers and green ones
are the decoding layers. Convolution denotes 4 × 4 convolution with a stride of 2. The negative slope of Leaky ReLu is 0.2.
Deconvolution denotes 4 × 4 convolution with a stride of 1/2. The dropout probability is set to 0.5.

256x192x3                                                          L                                                                   L Leaky ReLu,
                                 L                L                                                    L Leaky ReLu,
              C                                                    C                     C Convolution                                 C Convolution,
                                 C                C                                                    C Convolution
                                                                   S                                                                   S Sigmoid
                                                                           256x192x1
                    256x192x64       256x192x64       256x192x64
256x192x3

Figure 2: Network structure of our refinement network. Convolution denotes 3 × 3 convolution with a stride of 1. The
negative slope of Leaky ReLu is 0.2.

                                                                              1
Person Representation Analysis
   To investigate the effectiveness of pose and body shape in the person representation, we remove them individually from
our person representation, and train the corresponding encoder-decoder generators to compare with the generator learned by
using our full representation (as in Figure 7 in the main paper). Here, we present more qualitative results and analysis.
Person Representation without Pose
    In Figure 3, we show more examples where ignoring the pose representation of the person leads to unsatisfactory virtual
try-on results. Although body shape representation and face/hair information are preserved, the model without capturing the
person’s pose fails to determine where the arms and hands of a person should occur. For example, in the first, third, fifth
example in Figure 3, the hands are almost invisible without modeling pose. In the second and forth example, only given the
silhouette (body shape) of a person, the model without pose produces ambiguous results when generating the forearms in the
virtual try-on results, because it is possible for the forearm to be either in front/back of body or on one side of the body. Note
that during these ablation results, we only show the coarse results of each model for simplicity.

             Reference Image    Target Clothing            Our Method                        Without Pose

Figure 3: Comparison between the outputs (coarse result + clothing mask) of the encoder-decoder generator trained using
our person presentation with the generator trained using the representation without pose.
Person Representation without Body Shape
    Figure 4 demonstrates the cases where body shape is essential to generate a good synthesized result, e.g., virtually trying
on plus-size clothing. Interestingly, we also find that body shape can help preserving the poses of the person (second and
fifth examples), which suggests that body shape and poses can benefit from each other.

             Reference Image   Target Clothing            Our Method                   Without Body Shape

Figure 4: Comparisons between the outputs (coarse result + clothing mask) of the encoder-decoder generator trained using
our person presentation with the generator trained using the representation body shape.

Quantitative evaluation
   Furthermore, we conducted the same user study to compare our person representation with the representation without
body shape and without pose, respectively. Our method is preferred in 67.6% trials to the representation without body shape,
and 77.4% trials to the representation without pose.
More Qualitative Results
   In the following, we present more qualitative comparisons with alternative methods in Figure 5, and more results of our
full method in Figure 6. Same conclusions can be drawn as in our main paper.

      Reference Image Target Clothing PRGAN [32,51]   CAGAN [21]   CRN [6]   Encoder-Decoder Non-parametric VITON (ours)

   Figure 5: Comparisons of VITON with other methods. Reference numbers correspond to the ones in the main paper.
Figure 6: Virtual try-on results of VITON. Each row corresponds to the same person virtually trying on different target
clothing items.
Artifacts near Neck
    As mentioned in the main paper, some artifacts may be observed near the neck regions in the synthesized images. This
stems from the fact that during warping (See Figure 4 in the main paper) our model warps the whole clothing foreground
region as inputs to the refinement network, however regions like neck tag and inner collars should be ignored in the refined
result as a result of occlusion.
    One way to eliminate these artifacts is to train a segmentation model [2] and remove these regions before warping. We
annotated these regions for 2,000 images in our training dataset to train a simple FCN model [2] for this purpose. At test
time, this segmentation model is applied to target clothing images to remove the neck regions that should not be visible in the
refined results. In this way, we can get rid of the aforementioned artifacts. Some examples are shown in Figure 7. Note that
in the main paper, the results are obtained without this segmentation pre-processing step to achieve fair comparisons with
alternative methods.

                      Target clothing images without neck artifact removal      Target clothing images with neck artifact removal

Reference Image

Figure 7: We train an FCN model to segment and remove the regions that cause artifacts during testing. All original results
shown in these examples are also included in Figure 6. The artifacts near neck regions (highlighted using blue boxes) are
eliminated by removing the unwanted neck areas.

   One may notice that there are some discrepancies between the collar style and shape in the predicted image and the target
item image. The main reason behind these artifacts is the human parser [1] does not have annotations for necks and treats
neck/collar regions as background, and hence the model keeps the original collar style in the reference image. To avoid
this, we can augment our human parser with [3] to correctly segment neck regions and retrain our model. Fig 8 illustrates
examples of our updated model, which can handle the change in the collars.

Keep the original pants regions
   A straightforward way to keep the original pants regions would be simply keeping the original pixels outside the clothing
mask. However, this will cause two problems: (a) if the target item is shorter than the original one, the kept region will be
too small, producing a gap between the leg and the target (see the non-parametric results in the 1st and 4th row in Figure 6 of
our main paper). Therefore, we re-generate the legs for a seamless overlay. As for the face and hair regions, they are isolated
since they are rarely occluded by the clothing; (b) comparisons with baselines like PRGAN and CRN might not be fair, since
Reference Image     Target Clothing     Coarse Result Result w/ parser [3]        Reference Image         Target Clothing   Coarse Result Result w/ parser [3]

Figure 8: We combine the human parser in [1] and [3] to correct segment neck regions. As a result, the inconsistency between
the collar style in the target clothing and the result is properly addressed. Only the coase results are shown for simplicity.

they generate a whole image without using a mask.
    For completeness, we segment the leg region and use it as an input to VITON, the original pants are preserved with some
artifacts near the waist regions as shown in Figure 9.

 Reference Image Target Clothing   Coarse Result      Reference Image   Target Clothing   Coarse Result         Reference Image   Target Clothing   Coarse Result

Figure 9: By treating the pants regions in a similar fashion as face/hair, VITON can keep the original pants regions. Only the
coase results are shown for simplicity.
More results
   In Figure 10, we show more final (refined) results of VITON with the aforementioned updates - artifacts and inconsistency
near neck regions are removed, and original leg regions are kept.

Figure 10: More results of VITON with artifacts near neck region removed and original pants regions preserved. Top two
rows are from our visual try-on dataset and the last row contains in-the-wild results from COCO.
References
[1] K. Gong, X. Liang, X. Shen, and L. Lin. Look into person: Self-supervised structure-sensitive learning and a new benchmark for
    human parsing. In CVPR, 2017. 6, 7
[2] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 6
[3] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net.
    In ECCV, 2016. 6, 7
You can also read