Smooth Deep Image Generator from Noises - MIT alumni

Page created by Cecil Marshall
 
CONTINUE READING
Smooth Deep Image Generator from Noises - MIT alumni
Smooth Deep Image Generator from Noises
                       Tianyu Guo1,2,3 , Chang Xu2 , Boxin Shi4 , Chao Xu1,3 , Dacheng Tao2
                  1
                   Key Laboratory of Machine Perception (MOE), School of EECS, Peking University, China
              2
                UBTECH Sydney AI Centre, School of Computer Science, FEIT, University of Sydney, Australia
                             3
                               Cooperative Medianet Innovation Center, Peking University, China
             4
               National Engineering Laboratory for Video Technology, School of EECS, Peking University, China
                            tianyuguo@pku.edu.cn, c.xu@sydney.edu.au, shiboxin@pku.edu.cn,
                                    xuchao@cis.pku.edu.cn, dacheng.tao@sydney.edu.au

                            Abstract
  Generative Adversarial Networks (GANs) have demonstrated
  a strong ability to fit complex distributions since they were
  presented, especially in the field of generating natural images.
  Linear interpolation in the noise space produces a continu-
  ously changing in the image space, which is an impressive
  property of GANs. However, there is no special consideration
  on this property in the objective function of GANs or its de-           (a) Interpolation image shows blur and distorted images.
  rived models. This paper analyzes the perturbation on the in-
  put of the generator and its influence on the generated images.
  A smooth generator is then developed by investigating the
  tolerable input perturbation. We further integrate this smooth
  generator with a gradient penalized discriminator, and design
  smooth GAN that generates stable and high-quality images.
  Experiments on real-world image datasets demonstrate the
  necessity of studying smooth generator and the effectiveness
  of the proposed algorithm.                                           (b) Smooth interpolation shows clear and high-quality images.

                        Introduction                                 Figure 1: Interpolation images generated from WGAN-GP
                                                                     (Gulrajani et al. 2017) (a) and the proposed smooth GAN
Deep generative models have attracted increasing attention           (b).
from researchers, especially in the task of natural image
generation. Representative techniques include Variational
Auto-Encoder (VAE) (Kingma and Welling 2013), Pixel-                 and Bottou 2017; Gulrajani et al. 2017). Besides, a consider-
CNN (van den Oord et al. 2016), and Generative Adver-                able body of work has been conducted to arbitrarily manip-
sarial Networks (GANs) (Goodfellow et al. 2014). Genera-             ulate generated images according to different factors, e.g.,
tive Adversarial Networks (GANs) (Goodfellow et al. 2014)            the category, illumination, and style (Chen et al. 2016). Be-
translate Gaussian inputs into natural images by discover-           yond meaningless noise input in GANs, interpretable fea-
ing the equilibrium within a max-min game. The genera-               tures can be discovered by investigating label information
tor in vanilla GANs is to transform noisy vectors into im-           in conditional GANs (Mirza and Osindero 2014), exploring
ages, while the discriminator aims to distinguish the gener-         the mutual information between elements of input in info-
ated samples from real samples. Convincing images gener-             GAN (Chen et al. 2016) or leveraging the discriminator on
ated from noisy vectors through GANs could be employed to            latent space in AAE (Makhzani et al. 2015).
augment image datasets, which would alleviate the shortage
                                                                        Noise vector inputs for GANs can be taken as low-
of training data in some tasks. Moreover, image-to-image
                                                                     dimensional representations of images. As widely accepted
translation (Chen et al. 2018; 2019) based on GANs also
                                                                     in representation learning, the closeness of two data points
gets its popularity.
                                                                     is supposed to be preserved before and after transforma-
   However, vanilla GANs have flaws in its stability, and
                                                                     tion. Most of these improved GANs methods implicitly as-
we have seen many promising works to alleviate this prob-
                                                                     sume that the generator would translate linear interpolation
lem by modifying the network frameworks or proposing im-
                                                                     in the input noise space to semantic interpolation in the out-
proved loss functions (Radford, Metz, and Chintala 2015;
                                                                     put image space (Bojanowski et al. 2017). Although this
Nguyen et al. 2017; Karras et al. 2017; Mao et al. 2017;
                                                                     kind of experimental result showing interesting visual ef-
Berthelot, Schumm, and Metz 2017a; Arjovsky, Chintala,
                                                                     fects attracts readers’ attention, the quality of images gener-
Copyright c 2019, Association for the Advancement of Artificial      ated through interpolations could be very noisy and fragile,
Intelligence (www.aaai.org). All rights reserved.                    and some of these images would look obviously unnatural
Smooth Deep Image Generator from Noises - MIT alumni
or even meaningless, as demonstrated in Figure 1(a). Efforts     al. 2017a) introduced a classifier C to perform generation
are spent towards generating high-quality images or stabiliz-    tasks under semi-supervised conditions. DCGAN (Radford,
ing the training of GANs, and how to ensure the success of       Metz, and Chintala 2015) introduced interpolation in latent
semantic interpolation in GANs has rarely been investigated.     space generate the smooth transition in image space. How-
   In this paper, we propose a smooth deep image gener-          ever, there is no insurance for the sign of smooth transition
ator that can suppress the influence of input perturbations      in the adversarial Loss. As a result, this paper analyzes the
on generated images. By investigating the connections be-        constraint required by the smooth transition in image space
tween input noises and generated images, we theoretically        and introduces a method to enhance this sign of GANs.
present the most serious input perturbation that can be tol-
erated for an output image of desired precision. A gradient-                         Proposed Method
based loss function is then introduced to reduce the variation   In this section, we analyze the conditions required by the
of generated images caused by perturbations on input noises,     smooth transition in image space and develop a smooth gen-
which encourages a smooth interpolation of images. Com-          erator within GANs.
bining a discriminator with gradient penalty, we show the
smooth generator will be beneficial for improving the qual-      Generative Adversarial Nets
ity of interpolation samples, as demonstrated in Figure 1(b).
Experimental results on real-world datasets MNIST (LeCun         A discriminator D and a generator G play a max-min game
et al. 1998), CIFAR-10 (Krizhevsky and Hinton 2009), and         in GANs, in which the discriminator D is responsible for
CelebA (Liu et al. 2015) demonstrate the generator pro-          distinguishing real samples from generated samples, while
duced by the proposed method is essential for the success        the generator G is to deceive the discriminator D. When the
of smooth and high-quality interpolation of images.              game achieves equilibrium, the generator G would be able
                                                                 to fit complicated distribution of real samples.
                     Related Work                                   Formally given the sample x from the real distribution
                                                                 Pd and the noise z drawn from noise distribution Pz (e.g.,
In this section, we briefly introduce related works on gener-    Gaussian or uniform distribution), the optimal generator G
ative adversarial networks (GANs).                               transforming the noise distribution to the real data distribu-
   Although the GANs model has powerful image genera-            tion can be solved from the following min-max optimization
tion capabilities, the model was often trapped in the prob-      problem:
lem of unstable training and difficulty in convergence. Some
methods have been proposed to solve this problem. DC-
GAN (Radford, Metz, and Chintala 2015) introduced a net-           min max E [log(D(x))] + E [log(D(G(z)))].               (1)
                                                                     G    D   x∼Pd                z∼Pz
work structure that works well and is stable. WGAN (Ar-
jovsky, Chintala, and Bottou 2017) proved the defect of          We denote the distribution of generated sample G(z) as PG .
vanilla adversarial loss and proposed Wasserstein distance       By alternately optimizing the generator G and the discrim-
to measure the distance between the generated data distri-       inator D in the min-max problem, we expect that the dif-
bution and the real data distribution. However, the weight       ference between the generated distribution PG and the real
clip used in WGAN to ensure the Lipschitz continuous of          data distribution Pd would be gradually consistent with each
D leads to the loss of the capacity of neural networks. To       other.
solve this problem, WGAN-GP (Gulrajani et al. 2017) pro-
posed gradient penalty instead of weight clip operation to       Smooth Generator
satisfy Lipschitz continuous condition. BEGAN (Berthelot,
Schumm, and Metz 2017a) proposed a novel concept of              In the noise-to-image generation task, it is difficult to know
equilibrium that can help GANs to achieve considerable re-       what type of perturbation could happen in practice. Hence
sults using standard training methods that do not incorporate    we consider the general perturbation on pixels, and Eu-
tricks. At the same time, similar to the Wasserstein distance,   clidean distance is adopted for the measurement. Consider-
this degree of equilibrium can estimate the degree of con-       ing a continuous translation or rotation, generated images
vergence of the model. MMD GAN (Li et al. 2017b) con-            are still expected to evolve smoothly, and thus pixel values
nected moment matching network and GANs and achieved             should avoid sudden changes. Given the input noise vector
competitive performances with state-of-the art GANs. Kim         z and the generator G, the generated image can be written
et al. (Kim and Bengio 2016) and VGAN (Zhai et al. 2016)         as G(z). We suppose that the value of the i-th pixel on the
integrated GANs with the energy-based model and improved         image G(z) is determined by Gi (z), where Gi (·) is reduced
the performance of generative models.                            from G(·). A smooth generator is then expected to have the
   GANs have achieved remarkable results in image gen-           following pixel-wise property:
eration. LapGAN (Denton et al. 2015) generated high-
resolution images from low resolution one with the help                              Gi (z + δ) − Gi (z) <                (2)
of the Laplacian pyramid framework. Furthermore, Prog-           where δ stands for a small perturbation over input noise z,
GAN (Karras et al. 2017) proposed to train generator             and  > 0 is a small constant number. Since linear interpo-
and discriminator progressively at upscale resolution lev-       lation around z in the noise space can be approximated as
els, which can produce extremely high-quality 2k resolu-         imposing perturbation δ on z, Eq. (2) would encourage the
tion images. In semi-supervised learning, TripleGAN (Li et       image generated from noise interpolation would not be far
Smooth Deep Image Generator from Noises - MIT alumni
from the original image. In addition, Eq. (2) can be helpful                                 By minimizing the denominator in Eq. (3), the model is
to improve the robustness of the generator, so that it would                              expected to tolerate larger perturbation δ under fixed differ-
not be easily disabled by adversarial noise inputs with slight                            ence  on the i-th pixel. If all pixels of the generated image
perturbations. However, it is difficult to straightforwardly in-                          are simultaneously investigated, we then have
tegrate Eq. (2) into objective function of GANs, because of
the unspecified δ. We next proceed to analyze the appropri-                                             L = min      max       k∇ẑ G(ẑ)kp .        (12)
                                                                                                               G ẑ∈Bp (z,R)
ate δ that satisfies Eq. (2) in the following theorem.
                                                                                             However, maxẑ∈Bp (z,R) k∇ẑ G(ẑ)kp is difficult to calcu-
Theorem 1. Fix  > 0. Given z ∈ Rd as the noise input of
                                                                                          late. Since ẑ lies in a local region around z, it is reasonable
generator Gi , if the perturbation δ ∈ Rd satisfies
                                                                                          to assume that there is a data point ẑ ∼ Pz that can well
                                                                                         approximate ẑ. Hence, we can reformulate Eq. (12) as
            kδkq <                                   ,       (3)
                      maxẑ∈Bp (z,R) k∇ẑ Gi (ẑ)kp
                                                                                                           L = min E k∇z G(z)kp .                    (13)
           1       1                                                                                              G z∼Pz
with       p   +   q   = 1, we have Gi (z + δ) − Gi (z) < .
                                                                                             Though minimizing k∇z G(z)kp will increase the pertur-
Proof. Without loss of generality, we first suppose Gi (z) ≤                              bation δ that can be tolerated by the generator, it is inappro-
Gi (z + δ). Our aim is then to demonstrate what condition δ                               priate to expect an enormously large value of δ, which could
should obey to realize                                                                    damage the diversity of generated images. If the generator is
                         0 ≤ Gi (z + δ) − Gi (z) < .                              (4)    extremely insensitive to changes in the input, linear interpo-
                                                                                          lation in noise space would always lead to the same output.
By the main theorem of calculus, we have                                                  As a result, we introduce a constant number k as a margin to
                          Z 1                                                             constrain the value of k∇z G(z)kp ,
   Gi (z + δ) = Gi (z) +      h∇z Gi (z + tδ), δidt,                               (5)                L = E max(0, k∇z G(z)k2p − k).
                                            0                                                              z∼Pz
                                                                                                                                                     (14)
so that
                                            Z       1
                                                                                          If the value of k∇z G(z)kp is larger than k, there will be
                                                                                          penalty on the generator. Otherwise, we think the value of
 0 ≤ Gi (z + δ) − Gi (z) =                              h∇z Gi (z + tδ), δidt. (6)
                                                0                                         k∇z G(z)kp is sufficient to bring in an appropriate δ for
                                                                                          the generator. This hinge loss is advantageous over classi-
Consider the fact that
 Z 1                                                                                      cal squared loss that expects the gradient magnitude to be
                                                                                          exactly k, as it is unreasonable to set the same gradient mag-
     h∇z Gi (z + tδ), δidt                                                                nitude for data points from distribution Pz .
   0
                                                Z    1
                                                                                   (7)
                                 ≤ kδkq                  k∇z Gi (z + tδ)kp dt,            Smooth GAN
                                                 0                                        So far, we mainly focus on the smoothness of generated im-
where holder inequality is applied and q-norm is dual to the                              ages while neglecting their quality. Considering the genera-
p-norm with p1 + 1q = 1. Suppose that ẑ = z + tδ lies in                                 tion network and the discriminant network within the frame-
a sphere centered at z with a radius R, and we define the                                 work of GANs, we suggest the proposed smooth generator
sphere as Bp (z, R) = {ẑ ∈ Rd | kz − ẑkp ≤ R}. Hence,                                   is beneficial for improving the quality of generated images.
we have                                                                                      Well-trained deep neural networks have been recently
   Z 1                                                                                    found vulnerable to adversarial examples that are impercep-
       k∇z G(z + tδ)kp dt ≤ max k∇ẑ Gi (ẑ)kp . (8)                                      tible to human. Most of the studies on adversarial examples
       0                                        ẑ∈Bp (z,R)                               are for image classification problem. But in image genera-
By combining Eqs. (7) and (8), Eq. (6) can be re-written as                               tion task, we can easily discover failure generations of well-
                                                                                          trained generators as well. The noises resulting in these fail-
  0 ≤ Gi (z + δ) − Gi (z) ≤ kδkq                          max      k∇ẑ Gi (ẑ)kp . (9)   ure cases can thus be regarded as adversarial noise input.
                                                     ẑ∈Bp (z,R)
                                                                                          WGAN-GP (Arjovsky, Chintala, and Bottou 2017) is a re-
 If the right side of Eq. (9) is always upper bounded by ,                               cent promising variant of vanilla GAN,
i.e.,
                       kδkq     max         k∇ẑ Gi (ẑ)kp < ,                   (10)                min E [D(G(z))] − E [D(x)].                    (15)
                              ẑ∈Bp (z,R)                                                               D z∼Pz                  x∼Pd

we can achieve the conclusion that 0 < Gi (z +δ)−Gi (z) <                                 Loss function of WGAN-GP reflects the image quality,
. According to Eq. (10), δ should satisfy                                                which is distinct from loss of vanilla GAN to measure how
                                                                                         well it fools the discriminator. The first term in Eq. (15) is
          kδkq <                                 .    (11)                                relevant to the real sample and has no connection with the
                   maxẑ∈Bp (z,R) k∇ẑ Gi (ẑ)kp
                                                                                          generator. Larger value of D(G(z)) in Eq. (15) therefore in-
By setting z := z + δ and δ := −δ, we we can get the same                                 dicates high quality of generated images. If noise vector z
constraint (i.e., Eq. (11)) over δ to achieve − < Gi (z +                                generates a high-quality image, we expect that its neighbor-
δ) − Gi (z). The proof is completed.                                                      ing point z + δ would generate an image of high quality as
Smooth Deep Image Generator from Noises - MIT alumni
well. To decrease the quality gap between images generated                  Algorithm 1 Smooth GAN
from closed noise inputs, we need to ensure that D(G(z))                    Require: The number of critic iterations per generator iter-
would not drop significantly when the input variates, i.e.,                     ation ncritic , the batch size m, Adam hyperparameters
                                                                                α, β1 , and β2 , the loss balanced coefficient λ, γ.
                        D[G(z + δ)] − D[G(z)] < .                   (16)   Require: initial discriminator parameters w0 , initial gener-
In the following theorem, we analyze what conditions the                        ator parameters θ0 .
perturbation δ should satisfy to guarantee the image quality.                   repeat
                                                                             1: for t = 1, ..., ncritic do
Theorem 2. Fix  > 0. Consider generator G and discrimi-                     2:    for i = 1, ..., m do
nator D in GANs. Given a noise input z ∈ Rd , the generated                  3:       Sample real data x ∼ Pd , latent variable z ∼ Pz , a
image is x̂ = G(ẑ). If the perturbation δ ∈ Rd satisfies                             random number t ∼ U [0, 1].
                                                                            4:       Calculate fake sample Gθ (z), interpolation sample
   kδkq <                                            , (17)                           x̃ ← tx + (1 − t)Gθ (z)
            maxẑ∈Bp (z,R) k∇x̂ D(x̂)kp k∇ẑ G(ẑ)kp
                                                                                                                      (i)
       1       1
                                                                             5:       Calculate the loss function LD ← Dw [Gθ (z)] −
with   p   +   q   = 1, we have D[G(z + δ)] − D[G(z)] < .                            Dw (x) + λ(k∇x̃ D(x̃)k2 − 1)2 ;
                                                                             6:    end for
Proof. Without loss of generality, we first suppose D[G(z +
δ)] ≥ D[G(z)]. Following the proof of Theorem 1, we can                      7:    Update discriminator parameters
                                                                                                       1
                                                                                                          Pm (i)
draw a similar conclusion,                                                         w ← Adam(∇w m             i=1 LD , w, α, β1 , β2 )
                                                                             8: end for
                       0 ≤ D[G(z + δ)] − D[G(z)]
                                                                     (18)    9: Sample a batch of latent variables {z (i) }m  i=1 ∼ Pz .
                         ≤ kδkq max k∇ẑ D[G(ẑ)]kp .                       10: Calculate the loss function
                                 ẑ∈Bp (z,R)
                                                                                  (i)
                                                                                LG ← −Dw [Gθ (z)] + γ max(0, k∇z G(z)k22 − k);
According to the chain rule, we have                                        11: Update generator parameters
                                                                                                   1
                                                                                                      Pm (i)
                       ∇ẑ D[G(ẑ)] = ∇x̂ D(x̂)∇ẑ G(ẑ),            (19)       θ ← Adam(∇θ m            i=1 LG , θ, α, β1 , β2 )
                                                                                until θ has converged
where x̂ = G(ẑ) is the generated image. Given the fact that                Ensure: A smooth generator network G.
  k∇x̂ D(x̂)∇ẑ G(ẑ)kp ≤ k∇x̂ D(x̂)kp k∇ẑ G(ẑ)kp , (20)
where      1
               +   1
                       = 1, Eq. (18) can be re-written as                   where x ∼ PG is the generated sample G(z). Two terms
           p       q
                                                                            k∇x D(x)kp and k∇z G(z)kp are involved in Eq. (25).
       kδkq         max       k∇ẑ D[G(ẑ)]kp                               WGAN-GP (Gulrajani et al. 2017) proposed gradient-
               ẑ∈Bp (z,R)
                                                                     (21)   penalty,
           ≤kδkq             max       k∇x̂ D(x̂)kp k∇ẑ G(ẑ)kp .                     LGP oD = E [(k∇x̃ D(x̃)k2 − 1)2 ],
                         ẑ∈Bp (z,R)
                                                                                                    x̃∼Px
                                                                                                                                     (26)
If the right side of Eq. (21) is always upper bounded by ,                 where Px consists of both real sample distribution Pd and
i.e.                                                                        generated sample distribution PG . By concentrating on gen-
       kδkq            max     k∇x̂ D(x̂)kp k∇ẑ G(ẑ)kp < ,               erated samples x ∼ PG , Eq. (26) encourages k∇x D(x)k2
               ẑ∈Bp (z,R)                                           (22)   to go towards 1, and has been proved to successfully con-
                                                                            strain the norm of the gradient of discriminator k∇x D(x)k2
we then have                                                                in experiments. The remaining term k∇z G(z)kp in Eq. (25)
                                                                           is therefore our only focus. In a similar approach, we en-
  kδkq <                                                    .        (23)
                   maxẑ∈Bp (z,R) k∇x̂ D(x̂)kp k∇ẑ G(ẑ)kp                 courage the norm of the gradient of generator to stay at a
                                                                            lower level and reformulate Eq. (25) to
In the similar approach, we can get the same constraint (i.e.,
Eq. (23)) over δ to achieve − < D[G(z + δ)] − D[G(z)] .                            LGP oG = E max(0, k∇z G(z)k22 − k),              (27)
                                                                                                z∼Pz
The proof is completed.
                                                                            where we set p = q = 2. This equation is exactly the same
  Based on Theorem 2, we propose to minimize                                as Eq. (14).
                                                                               By integrating Eqs. (26) and (27) with WGAN, we obtain
                        max      k∇x̂ D(x̂)kp k∇ẑ G(ẑ)kp ,         (24)   the resulting objective function:
                   ẑ∈Bp (z,R)
                                                                                        L = E [D(G(z))] − E [D(x)]
so that the upper bound over δ will be enlarged and GANs                                     z∼Pz              x∼Pd

model is expect to tolerate more drastic perturbation. Since                              + λ E [(k∇x̃ D(x̃)k2 − 1)2 ]                (28)
it is difficult to discover the optimal ẑ ∈ Bp (z, R), we sup-                               x̃∼Px

pose that there is an approximated ẑ sampled from the dis-                               + γ E max(0, k∇z G(z)k22 − k),
                                                                                              z∼Pz
tribution Pz as well. The loss function can then be reformu-
lated as,                                                                    where λ and γ constant numbers to balance different terms
                                                                            in the function. Our complete algorithm pipeline is summa-
                   min E k∇x D(x)kp k∇z G(z)kp ,                     (25)   rized in Algorithm 1.
                   G,D z∼Pz
Smooth Deep Image Generator from Noises - MIT alumni
Figure 2: Illustration of image interpolations on the MNIST dataset generated from WGAN-GP (Gulrajani et al. 2017) (left)
and the proposed method (middle and right).

                Experimental Results                                Frechet Inception Distance (FID) (Heusel et al. 2017)
In this section, we conduct comprehensive experiments on a        described the distance between two distributions. FID is
toy dataset and three real-world image datasets, MNIST (Le-       computed as follow:
Cun et al. 1998), CIFAR-10 (Krizhevsky and Hinton 2009),                                                             1

and CelebA (Liu et al. 2015).                                       FID = kµg − µr k22 + Tr(Σg + Σr − 2(Σg Σr ) 2 ),       (29)
                                                                  where (µg , Σg ) and (µr , Σr ) are the mean and covariance of
Datasets and Settings                                             embedded samples from generated distribution Pg and real
In this part, we introduce the real image datasets used in        image distribution Pr , respectively. In the paper, we regard
the experiments and the corresponding experimental settings       the feature maps obtained from a specific layer of the pre-
and network structure. In addition, all the images are nor-       trained Inception model as the embedding of the samples.
malized to pixel values in [−1, +1]. We utilize different net-    FID is more sensitive to the diversity of samples belonging
work architectures on different datasets that was detailed in     to the same category and fixes the drawback of inception
following. The common points are: i) no nonlinear activation      score that is easily fooled by a model which generated only
was attached to the end of discriminators; ii) the minibatch      one image per class. We describe the quality of models to-
used in training process is 64 for both generator and discrim-    gether with the FID and IS score.
inators; iii) Adam optimizer with learning rate 0.0001 and           Multi-scale Structural Similarity for Image Quality
momentum 0.5; iv) noise dimension of 128 for generator; v)        (MS-SSIM) (Wang, Simoncelli, and Bovik 2003) was pro-
weights initialized from Gaussian: N (0; 0.01).                   posed to measuring the similarity of two images. This met-
   MNIST (LeCun et al. 1998) is a handwritten digits              ric well suits to evaluate the quality of samples belonging to
dataset (from 0 to 9) composed of 28 × 28 pixel greyscale         one class. As a result, we apply it to describe the smooth-
images from ten categories. The whole dataset of 70,000 im-       ness of samples obtained from the interpolation opretion. If
ages is split into 60,000 and 10,000 images for training and      two algorithms receive similar FID and IS socores, the sam-
test, respectively. In the experiments on the MNIST dataset,      ples could be equivalent on quality and diversity, and higher
we consider the 10,000 images in test set as valid set in the     MS-SSIM score means a smoother conversion process.
calculation of FID.
   CIFAR-10 (Krizhevsky and Hinton 2009) is a dataset             Interpolation
that consists of 32 × 32 pixel RGB color images drawn from        We implement the images interpolation on the MNIST and
10 categories. There are 60,000 images in the CIFAR-10            CelebA datasets and show a few results in Figures (2) and
dataset which are split into 50,000 training and 10,000 test-     (3). We forward a series of noises obtained by linearly inter-
ing images. We also calculate the FID with 3,000 images           polating between two points in noise space into the model
that was randomly selected in the test set.                       and expect that the resulting images show smooth transposi-
   CelebA (Liu et al. 2015) is a dataset consist of 202,599       tion in the image space.
portraits of celebrities. We use the aligned and cropped ver-        In Figure 2, the left part and the middle part showing the
sion, which preprocesses each image to a size of 64 × 64          similar transposition are generated from models trained with
pixels. 3,000 examples are randomly selected as the test set      WGAN-GP and the Smooth GAN, respectively. The left part
and the rest samples as the training set.                         shows more meaningless images (high lighted in red) during
                                                                  the transposition. When changing an image from one class
Evaluation Metrics                                                to the other one, we want to keep the whole process smooth
We evaluate the proposed method mainly in terms of n three        and meaningful. As shown in the right part of Figure 2, these
metrics well suited to the image domain.                          images accomplish these transpositions by considering im-
   Inception score (IS) (Salimans et al. 2016) rewarding          age qualities and image semantics. For example, the number
high-quality and high-variability of samples, can be ex-          ‘6’ becomes ‘1’ firstly and then becomes ‘9’. Obviously, ‘1’
pressed as: exp(Ex [DKL (p(y|x)kp(y))]), where p(y) =             is more closed to ‘9’ than ‘6’. Interpolation results on the
 1
    PN          i         i
N     i=1 p(y|x = G(z )) is the margin distribution and           other numbers also illustrate this phenomenon. In the results
p(y|x) is the conditional distribution for samples. In this pa-   on the CelebA dataset shown in Figure 3, we achieve great
per, we estimate the IS using a Inception model (Szegedy et       quality while maintaining smoothness, which illustrates the
al. 2016) pretrained in torchvision of PyTorch.                   effectiveness of our approach.
Smooth Deep Image Generator from Noises - MIT alumni
(a) Interpolation images generated from WGAN-GP.

                                        (b) Interpolation images generated from the proposed Smooth GAN.

                                              Figure 3: Image interpolations on the CelebA dataset.

                                                                                                                     46.1
                  DCGAN            WGAN-GP       Smooth GAN
                                                                                         45

                                                         0.827                           40
      24              23.798                                     0.83

                                                                                         35
             23.135

                                                                                   FID
    23.2                                                         0.81                    30                   28.6
                                                 0.797                                        26.3
                                                                                                       24.1
                                                                                         25
    22.4                                                         0.79
                                                                                         20
                               21.831
                                                                                         15
    21.6                                                         0.77
                                                                                              k=100    k=10   k=3    k=0.1
                                         0.755

    20.8                                                         0.75
                                                                                              (a) FID scores                 (b) k = 10
      20                                                         0.73
                       FID                   Sliced MS-SSIM

Figure 4: FID and sliced MS-SSIM obtained by different
models on the CIFAR-10 dataset.

   Taken two images and their interpolation images gener-
ated from the linear interpolations in noise space as a inter-                                       (c) k = 3               (d) k = 0.1
polation slice, to illustrate the effectiveness of the proposed
method in a more convincing way, we generate several slides                    Figure 5: FID scores (a) and Wasserstein distance conver-
on the CelebA dataset. We generated such slices based on                       gence (b, c, d) under different values of k.
different models and calculated the FID and MS-SSIM of
these slices for comparison. Different from calculating the
MS-SSIM score over whole resulting images, we calculated                       exceeds all other models. This is consistent with our obser-
it for every interpolation slice independently. Higher MS-                     vations and demonstrates the effectiveness of our approach.
SSIM values correspond to perceptually more similar im-
ages but also lower diversity and mode collapse (Odena,
Olah, and Shlens 2016; Fedus et al. 2017). Meanwhile,                          Hyperparemeter Analysis To illustrate the choice of the
higher FID score ensures the diversity and prevents the                        value of k, we show FID scores and Wasserstein distance
GANs from mode collapse. As a result, considering together                     convergence curves of experiments with different k values
with FID, the MS-SSIM score could focus on indicating the                      on the CIFAR-10 dataset in Figure 5. Figure 5 (a) shows
similarity of images, which is consistent with smooth trans-                   the FID scores obtained from four k values, and K = 10
position in an interpolation slide. The sliced MS-SSIM score                   provides the best score. Figure 5 (b) shows that the experi-
used in this experiment can be described as:                                   ment with k = 10 achieves the best convergence. Setting k
                N −1 N
                                                                               to 3 will influence the training progress and achieve slightly
      N (N − 1) X X                                                            higher Wasserstein distance than Figure 5 (b) when network
                          MS − SSIM(Si , Sj ),                          (30)
          2                                                                    convergence. The generator is failed to produce enough real-
                i=1 j=i+1
                                                                               istic images when setting k to 0.1; this is because too small
where Si is the i-th slide of samples in the resulting group.                  a k value will suppress the diversity of generator output.
Now, we could estimate the effectiveness with FID and                          Moreover, a large value of k relaxes the constraint on the
sliced MS-SSIM. We report the FID and sliced MS-SSIM                           gradient and can result in insufficient smoothness of the net-
obtained on the CelebA dataset in Figure 4. Our model not                      work output. In our experiments, we used different k values
only has the lowest FID score, but also its MS-SSIM score                      based on experimental results regarding different datasets.
Smooth Deep Image Generator from Noises - MIT alumni
Figure 6: Samples generated from the proposed model trained on MNIST (left), CIFAR-10 (middle), and CelebA (right).

50                                                   44
                                                                                                                                                                             DCGAN
40                                                                                                             36                                                            WGAN-GP
                                                                  32
                                                                         29
                                                                                    27          27                              27                                           Smooth GAN
30                                  24                                                               25

20                                                                                         15
                                                                              12
                                              10                                                                     10
                             7                                                                                                                   8
10                                                           6                                                                        6
                    3                                                                                     2                 3                                    2
        0    1          0                1                                                                                                   0        1      0        1      0   1    0      0
 0
            45-46           46-47            47-48               48-49             49-50          50-51             51-52            52-53           53-54           54-55           55-56
                                                                                                 FID

                                                          Figure 7: FID distribution obtained on several subsets.

                                                                                                          uate the quality of the generated samples. However, FID
Table 1: Incetion scores (higher is better) and FIDs (lower is                                            only provides an average quality of the whole test images.
better) on the CIFAR-10 dataset.                                                                          First, we generate a sufficient large set of images (50,000
                 Method                                                IS ↑          FID ↓
                                                                                                          in experiments). Next, we randomly sample 512 images to
                 DCGAN
                                                                 6.40 ± .05              36.7             form a subset and calculate the FID of this subset. We repeat
   (Radford, Metz, and Chintala 2015)
                 BEGAN                                                                                    the second step 120 times and calculate their FID scores.
                                                                 5.62 ± .07              32.3             By comparing these FIDs obtained from subsets, we can
  (Berthelot, Schumm, and Metz 2017b)
                 WGAN                                                                                     roughly estimate the quality distribution of the samples. Fig-
                                                                 3.82 ± .06              42.8
  (Arjovsky, Chintala, and Bottou 2017)                                                                   ure 7 shows the distribution of FID scores calculated over
      D2GAN (Nguyen et al. 2017)                               7.15 ± .07             30.6                subsets and obtained from three models. Compared to other
    WGAN-GP (Gulrajani et al. 2017)                            7.36 ± .07             26.5                models, our model can obtain lower FID scores, while big-
          Smooth GAN (Ours)                                   7.66 ± .05              24.1                ger value of FIDs are less. Moreover, the FID scores ob-
                                                                                                          tained from the proposed model are mainly concentrated be-
                                                                                                          tween 46 and 50, and only 8 scores of subset fall outside
Image Generation                                                                                          this region, while the other two algorithms get a loose distri-
We conduct experiments on three real image datasets to in-                                                bution of FID scores. Therefore, our method can effectively
vestigate the capabilities of the proposed method. Table 1 re-                                            reduce the low-quality samples in the generated samples.
ports the inception scores and FIDs on the CIFAR-10 dataset
which obtained from the proposed model and baseline meth-                                                                                 Conclusions
ods. In this results, the proposed method outperforms almost
state-of-the-art methods. Therefore, the proposed method                                                  Here we analyze the relationship between perturbation on
provides considerable quality on the three datasets.                                                      the input of the generator and its influence on the output im-
   Figure 6 shows several samples generated by the model                                                  ages. By investigating the tolerable input perturbation, we
learned with the proposed method. The samples on the                                                      develop a smooth generator. We further integrate this smooth
MNIST dataset show a variety of numbers and styles. Dogs,                                                 generator with a gradient penalized discriminator, and pro-
trucks, boats, and fish could also be found in the samples on                                             duce smooth GAN that generate stable and high-quality im-
the CIFAR-10 dataset. Age and gender diversity can also be                                                ages. Experiments on real-world image datasets demonstrate
observed in the results on the CelebA dataset. These results                                              the necessity of studying smooth generator and show the
confirm the capabilities of the proposed method.                                                          proposed method is capable of learning smooth GAN.

Samples Quality Distribution                                                                                                     Acknowledgments
In this section, we introduce a way to describe the quality                                               We thank supports of the NSFC under Grant 61876007 and
distribution of samples and demonstrate the effectiveness of                                              61872012, ARC Projects FL-170100117, DP-180103424,
the proposed method that can effectively reduce the num-                                                  IH-180100002, DE-180101438, and Beijing Major Science
ber of low-quality images. FID is a good indicator to eval-                                               and Technology Project Z171100000117008.
References                               Gradient-based learning applied to document recognition.
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein    Proceedings of the IEEE 86(11):2278–2324.
gan. arXiv preprint arXiv:1701.07875.                           Li, C.; Xu, K.; Zhu, J.; and Zhang, B. 2017a. Triple genera-
                                                                tive adversarial nets. arXiv preprint arXiv:1703.02291.
Berthelot, D.; Schumm, T.; and Metz, L. 2017a. Be-
gan: boundary equilibrium generative adversarial networks.      Li, C.-L.; Chang, W.-C.; Cheng, Y.; Yang, Y.; and Póczos,
arXiv preprint arXiv:1703.10717.                                B. 2017b. Mmd gan: Towards deeper understanding of mo-
                                                                ment matching network. In Advances in Neural Information
Berthelot, D.; Schumm, T.; and Metz, L. 2017b. BE-
                                                                Processing Systems, 2203–2213.
GAN: boundary equilibrium generative adversarial net-
works. CoRR abs/1703.10717.                                     Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep learn-
                                                                ing face attributes in the wild. In Proceedings of the IEEE
Bojanowski, P.; Joulin, A.; Lopez-Paz, D.; and Szlam, A.        International Conference on Computer Vision, 3730–3738.
2017. Optimizing the latent space of generative networks.
arXiv preprint arXiv:1707.05776.                                Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; and
                                                                Frey, B. 2015. Adversarial autoencoders. arXiv preprint
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever,     arXiv:1511.05644.
I.; and Abbeel, P. 2016. Infogan: Interpretable representa-
tion learning by information maximizing generative adver-       Mao, X.; Li, Q.; Xie, H.; Lau, R. Y.; Wang, Z.; and Smolley,
sarial nets. In Advances in neural information processing       S. P. 2017. Least squares generative adversarial networks.
systems, 2172–2180.                                             In Computer Vision (ICCV), 2017 IEEE International Con-
                                                                ference on, 2813–2821. IEEE.
Chen, X.; Xu, C.; Yang, X.; and Tao, D. 2018. Attention-gan
for object transfiguration in wild images. In The European      Mirza, M., and Osindero, S. 2014. Conditional generative
Conference on Computer Vision (ECCV).                           adversarial nets. arXiv preprint arXiv:1411.1784.
Chen, X.; Xu, C.; Yang, X.; Song, L.; and Tao, D. 2019.         Nguyen, T.; Le, T.; Vu, H.; and Phung, D. 2017. Dual dis-
Gated-gan: Adversarial gated networks for multi-collection      criminator generative adversarial nets. In Advances in Neu-
style transfer. IEEE Transactions on Image Processing           ral Information Processing Systems, 2670–2680.
28(2):546–560.                                                  Odena, A.; Olah, C.; and Shlens, J. 2016. Conditional im-
                                                                age synthesis with auxiliary classifier gans. arXiv preprint
Denton, E. L.; Chintala, S.; Szlam, A.; and Fergus, R. 2015.
                                                                arXiv:1610.09585.
Deep generative image models using a laplacian pyramid of
adversarial networks. CoRR abs/1506.05751.                      Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised
                                                                representation learning with deep convolutional generative
Fedus, W.; Rosca, M.; Lakshminarayanan, B.; Dai, A. M.;
                                                                adversarial networks. arXiv preprint arXiv:1511.06434.
Mohamed, S.; and Goodfellow, I. 2017. Many paths to equi-
librium: Gans do not need to decrease adivergence at every      Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Rad-
step. arXiv preprint arXiv:1710.08446.                          ford, A.; and Chen, X. 2016. Improved techniques for train-
                                                                ing gans. In Advances in Neural Information Processing
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;           Systems, 2234–2242.
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.
2014. Generative adversarial nets. In Advances in neural        Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna,
information processing systems, 2672–2680.                      Z. 2016. Rethinking the inception architecture for computer
                                                                vision. In Proceedings of the IEEE conference on computer
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and       vision and pattern recognition, 2818–2826.
Courville, A. C. 2017. Improved training of wasserstein
gans. In Advances in Neural Information Processing Sys-         van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals,
tems, 5767–5777.                                                O.; Graves, A.; et al. 2016. Conditional image generation
                                                                with pixelcnn decoders. In Advances in Neural Information
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.;         Processing Systems, 4790–4798.
Klambauer, G.; and Hochreiter, S. 2017. Gans trained by
a two time-scale update rule converge to a nash equilibrium.    Wang, Z.; Simoncelli, E. P.; and Bovik, A. C. 2003. Multi-
arXiv preprint arXiv:1706.08500.                                scale structural similarity for image quality assessment. In
                                                                The Thrity-Seventh Asilomar Conference on Signals, Sys-
Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2017. Pro-    tems & Computers, 2003, volume 2, 1398–1402. Ieee.
gressive growing of gans for improved quality, stability, and
                                                                Zhai, S.; Cheng, Y.; Feris, R.; and Zhang, Z. 2016. Gener-
variation. arXiv preprint arXiv:1710.10196.
                                                                ative adversarial networks as variational training of energy
Kim, T., and Bengio, Y. 2016. Deep directed genera-             based models. arXiv preprint arXiv:1611.01799.
tive models with energy-based probability estimation. arXiv
preprint arXiv:1606.03439.
Kingma, D. P., and Welling, M. 2013. Auto-encoding vari-
ational bayes. arXiv preprint arXiv:1312.6114.
Krizhevsky, A., and Hinton, G. 2009. Learning multiple
layers of features from tiny images.
LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.
You can also read