Deep Image Synthesis from Intuitive User Input: A Review and Perspectives - arXiv

Page created by Dan Marquez
 
CONTINUE READING
Deep Image Synthesis from Intuitive User Input: A Review and Perspectives - arXiv
Deep Image Synthesis from Intuitive User Input: A Review and
 Perspectives
 Yuan Xue1 , Yuan-Chen Guo2 , Han Zhang3 ,
 Tao Xu4 , Song-Hai Zhang2 , Xiaolei Huang1
 1
 The Pennsylvania State University, University Park, PA, USA
 2
 Tsinghua University, Beijing, China
arXiv:2107.04240v2 [cs.CV] 30 Sep 2021

 3
 Google Brain, Mountain View, CA, USA
 4
 Facebook, Menlo Park, CA, USA

 Abstract systems that synthesize an image from text description
 [143, 98, 152, 142] or from learned style constant [50],
 In many applications of computer graphics, art and paint a picture given a sketch [106, 27, 25, 73], ren-
 design, it is desirable for a user to provide intuitive der a photorealistic scene from a wireframe [61, 134],
 non-image input, such as text, sketch, stroke, graph create virtual reality content from images and videos
 or layout, and have a computer system automatically [121], among others. A comprehensive review of such
 generate photo-realistic images that adhere to the systems can inform about the current state-of-the-art
 input content. While classic works that allow such in such pursuits, reveal open challenges and illuminate
 automatic image content generation have followed future directions. In this paper, we make an attempt
 a framework of image retrieval and composition, at a comprehensive review of image synthesis and ren-
 recent advances in deep generative models such as dering techniques given simple, intuitive user inputs
 generative adversarial networks (GANs), variational such as text, sketches or strokes, semantic label maps,
 autoencoders (VAEs), and flow-based methods have poses, visual attributes, graphs and layouts. We first
 enabled more powerful and versatile image generation present ideas on what makes a good paradigm for image
 tasks. This paper reviews recent works for image synthesis from intuitive user input and review popular
 synthesis given intuitive user input, covering advances metrics for evaluating the quality of generated images.
 in input versatility, image generation methodology, We then introduce several mainstream methodologies
 benchmark datasets, and evaluation metrics. This for image synthesis given user inputs, and review al-
 motivates new perspectives on input representation and gorithms developed for application scenarios specific to
 interactivity, cross pollination between major image different formats of user inputs. We also summarize ma-
 generation paradigms, and evaluation and comparison jor benchmark datasets used by current methods, and
 of generation methods. advances and trends in image synthesis methodology.
 Last, we provide our perspective on future directions
 Keywords: Image Synthesis, Intuitive User Input, towards developing image synthesis models capable of
 Deep Generative Models, Synthesized Image Quality generating complex images that are closely aligned with
 Evaluation user input condition, have high visual realism, and ad-
 here to constraints of the physical world.
 1 Introduction
 2 What Makes a Good Paradigm
 Machine learning and artificial intelligence have given
 computers the abilities to mimic or even defeat humans for Image Synthesis from Intu-
 in tasks like playing chess and Go games, recognizing itive User Input?
 objects from images, translating from one language to
 another. An interesting next pursuit would be: can 2.1 What Types of User Input Do We
 computers mimic creative processes such as mimicking
 Need?
 painters in making pictures, assisting artists or archi-
 tects in making artistic or architectural designs? In For an image synthesis model to be user-friendly and
 fact, in the past decade, we have witnessed advances in applicable in real-world applications, user inputs that

 1
Deep Image Synthesis from Intuitive User Input: A Review and Perspectives - arXiv
are intuitive, easy for interactive editing, and commonly representation methods of various input types will be
used in the design and creation processes are desired. reviewed and discussed in Sec. 4.
We define an input modality to be intuitive if it has the
following characteristics: 2.2 How Do We Evaluate the Output
 Synthesized Images?
 • Accessibility. The input should be easy to access,
 especially for non-professionals. Take sketch for an The goodness of an image synthesis method depends on
 example, even people without any trained skills in how well its output adheres to user input, whether the
 drawing can express rough ideas through sketching. output is photorealistic or structurally coherent, and
 whether it can generate a diverse pool of images that
 • Expressiveness. The input should be expressive satisfy requirements. There have been general metrics
 enough to allow someone to convey not only simple designed for evaluating the quality and sometimes di-
 concepts but also complex ideas. versity of synthesized images. Widely adopted metrics
 use different methods to extract features from images
 • Interactivity. The input should be interactive to then calculate different scores or distances. Such met-
 some extent, so that users can modify the input rics include Peak Signal-to-Noise Ratio (PSNR), Incep-
 content interactively and fine tune the synthesized tion Score (IS), Fréchet inception distance (FID), struc-
 output in an iterative fashion. tural similarity index measure (SSIM) and Learned Per-
 ceptual Image Patch Similarity (LPIPS).
 Taking painting as an example, a sketch is an intu- Peak Signal-to-Noise Ratio (PSNR) measures the
itive input because it is what humans use to design the physical quality of a signal by the ratio between the
composition of the painting. On the other hand, being maximum possible power of the signal and the power of
intuitive often means that the information provided by the noise affecting it. For images, PSNR can be repre-
the input is limited, which makes the generation task sented as
more challenging. Moreover, for different types of ap-
plications, the suitable forms of user input can be quite 1X max DR2
 PSNR = 10 log10 1 P 2 (1)
different. 3 i,j (ti,j,k − yi,j,k )
 k m
 For image synthesis with intuitive user input, the
 where k is the number of channels, DR is the dynamic
most relevant and well-investigated method is with con-
 range of the image (255 for 8-bit images), m is the num-
ditional image generation models. In other words, user
 ber of pixels, i, j are indices iterating over every pixel,
inputs are treated as conditional input to the synthesis
 t and y are the reference image and synthesized image
model to guide the generation process by conditional
 respectively.
generative models. In this review, we will mainly dis-
 The Inception Score (IS) [103] uses a pre-trained In-
cuss mainstream conditional image generation applica-
 ception [112] network to compute the KL-divergence
tions including those using text descriptions, sketches
 between the conditional class distribution and the
or strokes, semantic maps, poses, visual attributes, or
 marginal class distribution. The inception score is de-
graphs as intuitive input. The processing and rep-
 fined as
resentation of user input are usually application- and
modality-dependent. When given text descriptions as IS = exp(Ex KL(P (y|x)||P (y))), (2)
input, pretrained text embeddings are often used to
convert text into a vector-representation of input words. where x is an input image and y is the label predicted
Image-like inputs, such as sketches, semantic maps and by an Inception model. A high inception score indicates
poses, are often represented as images and processed ac- that the generated images are diverse and semantically
cordingly. In particular, one-hot encoding can be used meaningful.
in semantic maps to represent different categories, and Fréchet Inception Distance (FID) [34] is a popular
keypoint maps can be used to encode poses where each evaluation metric for image synthesis tasks, especially
channel represents the position of a body keypoint; both for Generative Adversarial network (GAN) based mod-
result in multi-channel image-like tensors as input. Us- els. It computes the divergence between the synthetic
ing visual attributes as input is most similar to general data distribution and the real data distribution:
conditional generation tasks, where attributes can be
 FID = ||m̂ − m||22 + Tr(Ĉ + C − 2(C Ĉ)1/2 ), (3)
provided in the form of class vectors. For graph-like
user inputs, additional processing steps are required where m, C and m̂, Ĉ represent the mean and covari-
to extract relationship information represented in the ance of the feature embeddings of the real and the syn-
graphs. For instance, graph convolutional networks thetic distributions, respectively. The feature embed-
(GCNs) [53] can be applied to extract node features ding is extracted from a pre-trained Inception-v3 [112]
from input graphs. More details of the processing and model.

 2
Deep Image Synthesis from Intuitive User Input: A Review and Perspectives - arXiv
Structural Similarity Index Measure (SSIM) [126] body. Similar to sketch-based synthesis, detection score
or multi-scale structural similarity (MS-SSIM) met- (DS) is used to evaluate how well the synthesized person
ric [127] gives a relative similarity score to an image can be detected [107, 154] and keypoint accuracy can
against a reference one, which is different from absolute be used to measure the level of correspondence between
measures like PSNR. The SSIM is defined as: keypoints [154]. For semantic maps, a commonly used
 metric tries to restore the semantic-map input from gen-
 (2µx µy + c1 ) (2σxy + c2 ) erated images using a pre-trained segmentation network
 SSIM(x, y) =   , (4)
 µ2x + µ2y + c1 σx2 + σy2 + c2 and then compares the restored semantic map with the
 original input by Intersection over Union (IoU) score or
where µ and σ indicate the average and variance of two other segmentation accuracy measures. Similarly, using
windows x and y, c1 and c2 are two variables to sta- visual attributes as input, a pre-trained attribute clas-
bilize the division with weak denominator. The SSIM sifier or regressor can be used to assess the attribute
measures perceived image quality considering structural correctness of generated images.
information. It tests pair-wise similarity between gen-
erated images, where a lower score indicates higher di-
versity of generated images (i.e. less mode collapses). 3 Overview of Mainstream
 Another metric based on features extracted from pre-
trained CNN networks is the Learned Perceptual Image Conditional Image Synthe-
Patch Similarity (LPIPS) score [145]. The distance is sis Paradigms
calculated as
 Image synthesis models with intuitive user inputs of-
 X 1 X l l
  2
 d (x, x0 ) = wl ŷhw − ŷ0hw 2
 , (5) ten involve different types of generative models, more
 Hl W l specifically, conditional generative models that treat
 l h,w
 user input as observed conditioning variable. Two ma-
where ŷ l , ŷ0l ∈ RHl ×Wl ×Cl are unit-normalized feature jor goals of the synthesis process are high realism of
stack from the l-th layer in a pre-trained CNN and wl in- the synthesized images, and correct correspondences be-
dicates channel-wise weights. LPIPS evaluates percep- tween input conditions and output images. In existing
tual similarity between image patches using the learned literature, methods vary from more traditional retrieval
deep features from trained neural networks. and composition based methods to more recent deep
 For flow based models [102, 52] and autoregres- learning based algorithms. In this section, we give an
sive models [118, 117, 104], the average negative log- overview of the architectures and main components of
likelihood (i.e., bits per dimension) [118] is often used different conditional image synthesis models.
to evaluate the quality of generated images. It is cal-
culated as the negative log-likelihood with log base 2
 3.1 Retrieval and Composition
divided by the number of pixels, which is interpretable
as the number of bits that a compression scheme based Traditional image synthesis techniques mainly take a
on this model would need to compress every RGB color retrieval and composition paradigm. In the retrieval
value [118]. stage, candidate images / image fragments are fetched
 Except for metrics designed for general purposes, spe- from a large image collection, under some user-provided
cific evaluation metrics have been proposed for differ- constraints, like texts, sketches and semantic label
ent applications with various input types. For instance, maps. Methods like edge extraction, saliency detec-
using text descriptions as input, R-precision [133] eval- tion, object detection and semantic segmentation are
uates whether a generated image is well conditioned on used to pre-process images in the collection according
the given text description. The R-precision is measured to different input modalities and generation purposes,
by retrieving relevant text given an image query. For after which the retrieval can be performed using shal-
sketch-based image synthesis, classification accuracy is low image features like HoG and Shape Context [5].
used to measure the realism of the synthesized objects The user may interact with the system to improve the
[27, 25] and how well the identities of synthesized re- quality of the retrieved candidates. In the composition
sults match those of real images [77]. Also, similarity stage, the selected images or image fragments are com-
between input sketches and edges of synthesized images bined by Poisson Blending, Alpha blending, or a hybrid
can be measured to evaluate the correspondence be- of both [15], resulting in the final output image.
tween the input and output [25]. In the scenario of pose- The biggest advantage of synthesizing images
guided person image synthesis, “masked” versions of IS through retrieval and composition is its controllability
and SSIM, Mask-IS and Mask-SSIM are often used to and interpretability. The user can simply intervene with
ignore the effects of background [79, 80, 107, 111, 154], the generation process in any stage, and easily find out
since we only want to focus on the synthesized human whether the output image looks like the way it should

 3
Deep Image Synthesis from Intuitive User Input: A Review and Perspectives - arXiv
be. But it can not generate instances that do not appear
 z ො x Ƹ
in the collection, which restricts the range and diversity Gen Dis
 True/ Encoder Decoder ො
 y y False y y
of the output.
 (a) cGAN (b) cVAE
3.2 Conditional Generative Adversarial
 Networks (cGANs) Figure 1: A general illustration of cGAN and cVAE
 that can be applied to image synthesis with intuitive
Generative Adversarial Networks (GANs) [29] have user inputs. During inference, the generator in cGAN
achieved tremendous success in various image gener- and the decoder in cVAE generate new images x̂ under
ation tasks. A GAN model typically consists of two the guidance of user input y and noise vector or latent
networks: a generator network that learns to generate variable z.
realistic synthetic images and a discriminator network
that learns to differentiate between real images and syn-
 ing objective for cVAE is
thetic images generated by the generator. The two net-
works are optimized alternatively through adversarial max LcVAE = Ez∼Qφ [log Pθ (x | z, y)] −
training. Vanilla GAN models are designed for uncon- θ,φ
 (7)
ditional image generation, which implicitly model the DKL [Qφ (z | x, y)kp(z | y)] ,
distribution of images. To gain more control over the
generation process, conditional GANs or cGANs [86] where x is the real image, y is the user input, z is the
synthesize images based on both a random noise vector latent variable and p(z | x) is the prior distribution of
and a condition vector provided by users. The objective the latent vectors such as the Gaussian distribution. φ
of training cGAN as a minimax game is and θ are parameters of the encoder Q and decoder P
 networks, respectively. An illustration of cGAN and
 min maxLcGAN = E(x,y)∼pdata (x,y) [log D(x, y)] + cVAE can be found in Fig. 1.
 θG θD
 (6)
 Ez∼p(z),y∼pdata (y) [log(1 − D(G(z, y), y)] ,
 3.4 Other Learning-based Methods
where x is the real image, y is the user input, and z is Other learning-based conditional image synthesis mod-
the random noise vector. There are different ways of els include hybrid methods such as the combina-
incorporating user input in the discriminator, such as tion of VAE and GAN models [57, 4], autoregressive
inserting it at the beginning of the discriminator [86], models and normalizing flow-based models. Among
middle of the discriminator [88], or the end of the dis- these methods, autoregressive models such as Pixel-
criminator [91]. RNN [118], PixelCNN [117], and PixelCNN++ [104]
 provide tractable likelihood over priors such as class
3.3 Variational Auto-encoders (VAEs) conditions. The generation process is similar to an au-
 toregression model: while classic autoregression models
Variational auto-encoders (VAEs) proposed in [51] ex- predict future information based on past observations,
tend the idea of auto-encoder and introduce variational image autoregressive models synthesize next image pix-
inference to approximate the latent representation z en- els based on previously generated or existing nearby
coded from the input data x. The encoder converts pixels.
x into z in a latent space where the decoder tries to Flow-based models [102], or normalizing flow based
reconstruct x from z. Similar to GANs which typ- methods, consist of a sequence of invertible transfor-
ically assume the input noise vector follows a Gaus- mations which can convert a simple distribution (e.g.,
sian distribution, VAEs use variational inference to ap- Gaussian) into a more complex one with the same di-
proximate the posterior p(z|x) given that p(z) follows a mension. While flow based methods have not been
Gaussian distribution. After the training of VAE, the widely applied to image synthesis with intuitive user
decoder is used as a generator, similar to the genera- inputs, few works [52] show that they have great po-
tor in GAN, which can draw samples from the latent tential in visual attributes guided synthesis and may be
space and generate new synthetic data. Based on the applicable to broader scenarios.
vanilla VAE, Sohn et al. proposed a conditional VAE Among the aforementioned mainstream paradigms,
(cVAE) [109, 54, 44] which is a conditional directed traditional retrieval and composition methods have the
graphical model whose input observations modulate the advantage of better controllability and interpretability,
latent variables that generate the outputs. Similar to although the diversity of synthesized images and the
cGANs, cVAEs allow the user to provide guidance to flexibility of the models are limited. In comparison,
the image synthesis process via user input. The train- deep learning based methods generally have stronger

 4
Deep Image Synthesis from Intuitive User Input: A Review and Perspectives - arXiv
feature representation capacity, with GANs having the ble way of describing visual concepts and objects. As
potential of generating images with highest quality. text is one of the most intuitive types of user input,
While having been successfully applied to various im- text-to-image synthesis has gained much attention from
age synthesis tasks due to their flexibility, GAN models the research community and numerous efforts have been
lack tractable and explicit likelihood estimation. On made towards developing better text-to-image synthesis
the contrary, autoregressive models admit a tractable models. In this subsection, we will review state-of-the-
likelihood estimation, and can assign a probability to a art text-to-image synthesis models and discuss recent
single sample. VAEs with latent representation learn- advances.
ing provide better feature representation power and can Learning Correspondence Between Text and Im-
be more interpretable. Compared with VAEs and au- age Representations. One of the major challenges of
toregressive models, normalizing flow methods provide the text-to-image synthesis task is that the input text
both feature representation power and tractable likeli- and output image are in different modalities, which re-
hood estimation. quires learning of correspondence between text and im-
 age representations. Such multi-modality nature and
 the need to learn text-to-image correspondence moti-
4 Methods Specific to Appli- vated Reed et al. [100] to first propose to solve the
 cations with Various Input task using a GAN model. In [100], the authors pro-
 posed to generate images conditioned on the embed-
 Types ding of text descriptions, instead of class labels as in
In this section, we review works in the literature that traditional cGANs [86]. To learn the text embedding
target application scenarios with specific input types. from input sentences, a deep convolutional image en-
We will review methods for image synthesis from text coder and a character level convolutional-recurrent text
descriptions, sketches and strokes, semantic label maps, encoder are trained jointly so that the text encoder can
poses, and other input modalities including visual at- learn a vector-representation of the input text descrip-
tributes, graphs and layouts. Among the different in- tions. Adapted from the DCGAN architecture [99], the
put types, text descriptions are flexible, expressive and learned text encoding is then concatenated with both
user-friendly, yet the comprehension of input content the input noise vector in the generator and the im-
and responding to interactive editing can be challeng- age features in the discriminator along the depth di-
ing to the generative models; example applications of mension. The method [100] generated encouraging re-
text-to-image systems are computer generated art, im- sults on both the Oxford-102 dataset [90] and the CUB
age editing, computer-aided design, interactive story dataset [128], with the limitation that the resolution
telling and visual chat for education and language learn- of generated images is relatively low (64 × 64). An-
ing. Image-like inputs such as sketches and semantic other work proposed around the same time as DCGAN
maps contain richer information and can better guide is by Mansimov et al. [81], which proposes a combi-
the synthesis process, but may require more efforts from nation of a recurrent variational autoencoder with an
users to provide adequate input; such inputs can be attention model which iteratively draws patches on a
used in applications such as image and photo editing, canvas, while attending to the relevant words in the
computer-assisted painting and rendering. Other in- description. Input text descriptions are represented as
puts such as visual attributes, graphs and layouts allow a sequence of consecutive words and images are rep-
appearance, structural or other constraints to be given resented as a sequence of patches drawn on a canvas.
as conditional input and can help guide the generation For image generation which samples from a Gaussian
of images that preserve the visual properties of objects distribution, the Gaussian mean and variance depend
and geometric relations between objects; they can be on the previous hidden states of the generative LSTM.
used in various computer-aided design applications for Experiments by [81] on the MS-COCO dataset show
architecture, manufacturing, publishing, arts, and fash- reasonable results that correspond well to text descrip-
ion. tions.
 To further improve the visual quality and realism of
 generated images given text descriptions, Han et al.
4.1 Text Description as Input
 proposed multi-stage GAN models, StackGAN [143]
The task of text-to-image synthesis (Fig. 2) is using and StackGAN++ [144], to enable multi-scale, incre-
descriptive sentences as inputs to guide the generation mental refinement in the image generation process.
of corresponding images. The generated image types Given text descriptions, StackGAN [143] decomposes
vary from single-object images [90, 128] to multi-object the text-to-image generative process into two stages,
images with complex background [72]. Descriptive sen- where in Stage-I it captures basic object features and
tences in a natural language offer a general and flexi- background layout, then in Stage-II it refines details of

 5
backbone of MirrorGAN uses a multi-scale generator as
 in [144]. The proposed text reconstruction model is pre-
 trained to stabilize the training of MirrorGAN. Zhu et
 al. [152] introduces a gating mechanism where a writing
 gate writes selected important textual features from the
 given sentence into a dynamic memory, and a response
 gate adaptively reads from the memory and the visual
 features from some initially generated images. The pro-
 posed DM-GAN relies less on the quality of the initial
 images and can refine poorly-generated initial images
 with wrong colors and rough shapes.
Figure 2: Example bird image synthesis results given To learn expression variants in different text descrip-
text descriptions as input with an attention mechanism. tions of the same image, Yin et al. proposes SD-
Key words in the input sentences are correctly captured GAN [136] to distill the shared semantics from texts
and represented in the generated images. Image taken that describe the same image. The authors propose a
from AttnGAN [133]. Siamese structure with a contrastive loss to minimize
 the distance between images generated from descrip-
 tions of the same image, and maximize the distance
the objects and generates a higher resolution image. between those generated from the descriptions of dif-
Unlike [100] which transforms high dimensional text ferent images. To retain the semantic diversity for fine-
encoding into low dimensional latent variables, Stack- grained image generation, a semantic-conditioned batch
GAN adopts a Conditioning Augmentation which is to normalization is also introduced for enhanced visual-
sample the latent variables from an independent Gaus- semantic embedding.
sian distribution parameterized by the text encoding. Location and Layout Aware Generation. With
Experiments on the Oxford-102 [90], CUB [128] and advances in correspondence learning between text and
COCO [72] datasets show that StackGAN can generate image, content described in the input text can already
compelling images with resolution up to 256 × 256. In be well captured in the generated image. However, to
StackGAN++ [144], the authors extended the original achieve finer control of generated images such as object
StackGAN into a more general and robust model which locations, additional inputs or intermediate steps are of-
contains multiple generators and discriminators to han- ten required. For text-based and location-controllable
dle images at different resolutions. Then, Zhang et synthesis, Reed et al. [101] proposes to generate images
al. [146] extended the multi-stage generation idea by conditioned on both the text description and object lo-
proposing a HDGAN model with a single-stream gen- cations. Built upon the similar idea of inferring scene
erator and multiple hierarchically-nested discrimina- structure for image generation, Hong et al. [37] intro-
tors for high-resolution image synthesis. Hierarchically- duces a novel hierarchical approach for text-to-image
nested discriminators distinguish outputs from interme- synthesis by inferring semantic layout from the text de-
diate layers of the generator to capture hierarchical vi- scription. Bounding boxes are first generated from text
sual features. The training of HDGAN is done via opti- input through an auto-regressive model, then semantic
mizing a pair loss [100] and a patch-level discriminator layouts are refined from the generated bounding boxes
loss [43]. using a convolutional recurrent neural network. Con-
 In addition to generation via multi-stage refine- ditional on both the text and the semantic layouts,
ment [143, 144], the attention mechanism is introduced the authors adopt a combination of pix2pix [43] and
to improve text to image synthesis at a more fine- CRN [12] image-to-image translation model to gener-
grained level. Xu et al. introduced AttnGAN [133], ate the final images. With predicted semantic layouts,
an attention driven image synthesis model that gener- this work [37] has potential in generating more realis-
ates images by focusing on different regions described tic images containing complex objects such as those in
by different words of the text input. A Deep Attentional the MS-COCO [72] dataset. Li et al. [63] extends the
Multimodal Similarity Model (DAMSM) module is also work by [37] and introduces Obj-GAN, which generates
proposed to match the learned embedding between im- salient objects given text description. Semantic layout
age regions and text at the word level. To achieve better is first generated as in [37] then later converted into the
semantic consistency between text and image, Qiao et synthetic image. A Fast R-CNN [28] based object-wise
al. [98] proposed MirrorGAN which guides the image discriminator is developed to retain the matching be-
generation with both sentence- and word-level atten- tween generated objects and the input text and layout.
tion and further tried to reconstruct the original text Experiments on the MS-COCO dataset show improved
input to guarantee the image-text consistency. The performance in generating complex scenes compared to

 6
previous methods. focused on proposing more accurate evaluation metrics
 Compared to [37], Johnson et al. [46] includes an- for text to image synthesis and for evaluating the corre-
other intermediate step which converts the input sen- spondence between generated image content and input
tences into scene graphs before generating the semantic condition. R-precision is proposed in [133] to evaluate
layouts. A graph convolutional network is developed to whether a generated image is well conditioned on the
generate embedding vectors for each object. Bounding given text description. Hinz et al. proposes the Seman-
boxes and segmentation masks for each object, consti- tic Object Accuracy (SOA) score [36] which uses a pre-
tuting the scene layout, are converted from the object trained object detector to check whether the generated
embedding vectors. Final images are synthesized by a image contains the objects described in the caption, es-
CRN model [12] from the noise vectors and scene lay- pecially for the MS-COCO dataset. SOA shows better
outs. In addition to text input, [46] also allows direct correlation with human perception than IS in the user
generation from input scene graphs. Experiments are study and provides a better guidance for training text
conducted on Visual Genome [56] dataset and COCO- to image synthesis models.
Stuff [7] dataset which is augmented on a subset of Benchmark Datasets. For text-guided image synthe-
the MS-COCO [72] dataset, and show better depiction sis tasks, popular benchmark datasets include datasets
of complex sentences with many objects than previous with a single object category and datasets with multiple
method [143]. object categories. For single object category datasets,
 Without taking the complete semantic layout as ad- the Oxford-102 dataset [90] contains 102 different types
ditional input, Hinz et al. [35] introduces a model con- of flowers common in the UK. The CUB dataset [128]
sisting of a global pathway and an object pathway for contains photos of 200 bird species of which mostly are
finer control of object location and size within an image. from North America. Datasets with multiple object cat-
The global pathway is responsible for creating a general egories and complex relationships can be used to train
layout of the global scene, while the object pathway gen- models for more challenging image synthesis tasks. One
erates object features within the given bounding boxes. such dataset is MS-COCO [72], which has a training set
Then the outputs of the global and object pathways are with 80k images and a validation set with 40k images.
combined to generate the final synthetic image. When Each image in the COCO dataset has five text descrip-
there is no text description available, [35] can take a tions.
noise vector and the individual object bounding boxes
as input. 4.2 Image-like Inputs
 Taking an approach different from GAN based meth-
ods, Tan et al. [113] proposes a Text2Scene model In this section, we summarize image synthesis works
for text-to-scene generation, which learns to sequen- based on three types of intuitive inputs, namely sketch,
tially generate objects and their attributes such as lo- semantic map and pose. We call them “image-like in-
cation, size, and appearance at every time step. With a puts” because all of them can be, and have been repre-
convolutional recurrent module and attention module, sented as rasterized images. Therefore, synthesizing im-
Text2Scene can generate abstract scenes and object lay- ages from these image-like inputs can be regarded as an
outs directly from descriptive sentences. For image syn- image-to-image translation problem. Several works pro-
thesis, Text2Scene retrieves patches from real images to vide general solutions to this problem, like pix2pix [43]
generate the image composites. and pix2pixHD [124]. In this survey, we focus on works
Fusion of Conditional and Unconditional Gen- that deal with a specific type of input.
eration. While most existing text-to-image synthe-
sis models are based on conditional image generation,
 4.2.1 Sketches and Strokes as Input
Bodla et al. [6] proposes a FusedGAN which combines
unconditional image generation and conditional image Sketches, or line drawings, can be used to express users’
generation. An unconditional generator produces a intention in an intuitive way, even for those without
structure prior independent of the condition, and the professional drawing skills. With the widespread use
other conditional generator refines details and creates of touch screens, it has become very easy to create
an image that matches the input condition. FusedGAN sketches; and the research community is paying increas-
is evaluated on both the text-to-image generation task ingly more attention to the understanding and pro-
and the attribute-to-face generation task which will be cessing of hand-drawn sketches, especially in applica-
discussed later in Sec. 4.3.1. tions such as sketch-based image retrieval and sketch-
Evaluation Metrics for Text to Image Synthe- to-image generation. Generating realistic images from
sis. Widely used metrics for image synthesis such sketches is not a trivial task, since the synthesized
as IS [103] lack awareness of matching between the text images need to be aligned spatially with the given
and generated images. Recently, more efforts have been sketches, while maintain semantic coherence.

 7
Deep Learning based Approaches. In recent
 years, deep convolutional neural networks (CNNs) have
 achieved significant progress in image-related tasks.
 CNNs have been used to map sketches to images with
 the benefit of being able to synthesize novel images
 that are different from those in pre-built databases.
 One challenge to using deep CNNs is that training of
Figure 3: A classical pipeline of retrieval-and- such networks require paired sketch-image data, which
composition methods for synthesis. Candidate images can be expensive to acquire. Hence, various techniques
are generated by composing image segments retrieved have been proposed to generate synthetic sketches from
from a pre-built image database. Image taken from [15]. images, and then use the synthetic sketch and image
 pairs for training. Methods for synthetic sketch gen-
 eration include boundary detection algorithms such as
 Canny, Holistically-nested Edge Detection (HED) [132],
Retrieval-and-Composition based Approaches. and stylization algorithms for image-to-sketch conver-
Early approaches of generating image from sketch sion [130, 48, 64, 62, 26]. Post-processing steps are
mainly take a retrieval-and-composition strategy. For adopted for small stroke removal, spline fitting [32] and
each object in the user-given sketch, they search for stroke simplification [108]. A few works utilize crowd-
candidate images in a pre-built object-level image (frag- sourced free-hand sketches for training [25, 73]. They ei-
ment) database, using some similarity metric to evalu- ther construct pseudo-paired data by matching sketches
ate how well the sketch matches the image. The final and images [25], or propose a method that does not re-
image is synthesized as the composition of retrieved re- quire paired data [73]. Another aspect of CNN train-
sults, mainly by image blending algorithms. Chen et ing that has been investigated is the representation of
al. [15] presented a system called Sketch2Photo, which sketches. In some works [16, 68], the input sketches
composes a realistic image from a simple free-hand are transformed into distance fields to obtain a dense
sketch annotated with text labels. The authors pro- representation, but no experimental comparisons have
posed a contour-based filtering scheme to search for been done to demonstrate which form of input is more
appropriate photographs according to the given sketch suitable for CNNs to process. Next, we review specific
and text labels, and proposed a novel hybrid blending works that utilize a deep-learning based approach for
algorithm, which is a combination of alpha blending sketch to image generation.
and Poisson blending, to improve the synthesis qual-
ity. Eitz et al. [24] created Photosketcher, a system Treating a sketch as an “image-like” input, several
that finds semantically relevant regions from appropri- works use a fully convolutional neural network archi-
ate images in a large image collection and composes tecture to generate photorealistic images. Gucluturk et
the regions automatically. Users can also interact with al. [30] first attempted to use deep neural networks to
the system by drawing scribbles on the retrieved images tackle the problem of sketch-based synthesizing. They
to improve region segmentation quality, re-sketching to developed three different models to generate face im-
find better candidates, or choosing from different blend- ages from three different types of sketches, namely line
ing strategies. Hu et al. [38] introduced PatchNet, a sketch, grayscale sketch and color sketch. An encoder-
hierarchical representation of image regions that sum- decoder fully convolutional neural network is adopted
marizes a homogeneous image patch by a graph node and trained with various loss terms. A total variation
and represents geometric relationships between regions loss is proposed to encourage smoothness. Sangkloy et
by labeled graph edges. PatchNet was shown to be a al. [106] proposed Scribbler, a system that can generate
compact representation that can be used efficiently for realistic images from human sketches and color strokes.
sketch-based, library-driven, interactive image editing. XDoG filter is used for boundary detection to gener-
Wang et al. [120] proposed a sketch-based image syn- ate image-sketch pairs and color strokes are sampled to
thesis method that compares sketches with contours of provide color constraints in training. The authors also
object regions via the GF-HOG descriptor, and novel use an encoder-decoder network architecture and adopt
images are composited by GrabCut followed by Pos- similar loss functions as in [30]. The users can interact
sion blending or alpha blending. For generating images with the system in real time. The authors also provide
of a single object like an animal under user-specified applications for colorization of grayscale images.
poses and appearances, Turmukhambetov et al. [115] Generative Adversarial Networks have also been used
presented a sketch-based interactive system that gener- for sketch-to-image synthesis. Chen et al. [16] proposed
ates the target image by composing patches of nearest a novel GAN-based architecture with multi-scale inputs
neighbour images on the joint manifold of ellipses and for the problem. The generator and discriminator both
contours for object parts. consist of several Masked Residual Unit (MRU) blocks.

 8
MRU takes in a feature map and an image, and outputs from all positions of the feature map by the calculated
a new feature map, which can allow a network to re- self-attention map. A multi-scale discriminator is used
peatedly condition on an input image, like the recurrent to distinguish patches of different receptive fields, to si-
network. They also adopt a novel data augmentation multaneously ensure local and global realism. Chen et
technique, which generates sketch-image pairs automat- al. [14] introduced DeepFaceDrawing, a local-to-global
ically through edge detection and some post-processing approach for generating face images from sketches that
steps including binarization, thinning, small component uses input sketches as soft constraints and is able to pro-
removal, erosion, and spur removal. To encourage diver- duce high-quality face images even from rough and/or
sity of generated images, the authors proposed a diver- incomplete sketches. The key idea is to learn feature
sity loss, which maximizes the L1 distance between the embeddings of key face components and then train a
outputs of two identical input sketches with different deep neural network to map the embedded component
noise vectors. Lu et al. [77] considered the sketch-to- features to realistic images.
image synthesis problem as an image completion task While most works in sketch-to-image synthesis with
and proposed a contextual GAN for the task. Unlike deep learning techniques have focused on synthesiz-
a traditional image completion task where only part of ing object-level images from sketches, Gao et al. [25]
an object is masked, the entire real image is treated explored synthesis at the scene level by proposing a
as the missing piece in a joint image that consists of deep learning framework for scene-level image gener-
both sketch and the corresponding photo. The advan- ation from freehand sketches. The framework first
tage of using such a joint representation is that, in- segments the sketch into individual objects, recog-
stead of using the sketch as a hard constraint, the sketch nizes their classes, and categories them into fore-
part of the joint image serves as a weak contextual con- ground/background objects. Then the foreground ob-
straint. Furthermore, the same framework can also be jects are generated by an EdgeGAN module that learns
used for image-to-sketch generation where the sketch a common vector representation for images and sketches
would be the masked or missing piece to be completed. and maps the vector representation of an input sketch
Ghosh et al. [27] presents an interactive GAN-based to an image. The background generation module is
sketch-to-image translation system. As the user draws based on the pix2pix [43] architecture. The synthe-
a sketch of a desired object type, the system automati- sized foregrounds along with background sketches are
cally recommends completions and fills the shape with fed to a network to get the final generated scene. To
class-conditioned texture. The result changes as the train the network and evaluate their method, the au-
user adds or removes strokes over time, which enables thors constructed a composite dataset called Sketchy-
a feedback loop that the user can leverage for interac- COCO based on the Sketchy database [105], Tuberlin
tive editing. The system consists of a shape completion dataset [23], QuickDraw dataset, and COCO Stuff [8].
stage based on a non-image generation network [84],
 Considering that collecting paired training data can
and a class-conditioned appearance translation stage
 be labor intensive, learning from unpaired sketch-photo
based on the encoder-decoder model from MUNIT [41].
 data in an unsupervised setting is an interesting di-
To perform class-conditioning more effectively, the au-
 rection to explore. Liu et al. [73] proposed an unsu-
thors propose a soft gating mechanism, instead of using
 pervised solution by decomposing the synthesis process
simple concatenation of class codes and features.
 into a shape translation stage and a content enrichment
 Several works focus on sketch-based synthesis for hu- stage. The shape translation network transforms an in-
man face images. Portenier et al. [94] developed an put sketch into a gray-scale image, trained using un-
interactive system for face photo editing. The user can paired sketches and images, under the supervision of a
provide shape and color constraints by sketching on the cycle-consistency loss. In the content enrichment stage,
original photo, to get an edited version of it. The edit- a reference image can be provided as style guidance,
ing process is done by a CNN, which is trained on ran- whose information is injected into the synthesis process
domly masked face photos with sampled sketches and following the AdaIN framework [40].
color strokes in an adversarial manner. Xia et al. [131] Benchmark Datasets. For synthesis from sketches,
proposed a two-stage network for sketch-based portrait various datasets covering multiple types of objects are
synthesis. The stroke calibration network is responsible used [139, 55, 137, 138, 128, 76, 49, 105, 125, 72, 8].
for converting the input poorly-drawn sketch to a more However, only a few of them [139, 105, 125] have
detailed and calibrated one that resembles edge maps. paired image-sketch data. For the other datasets, edge
Then the refined sketch is used in the image synthe- maps or line strokes are extracted using edge extrac-
sis network to get a photo-realistic portrait image. Li tion or style transfer techniques and used as fake sketch
et al. [68] proposed a self-attention module to capture data for training and validation. SketchyCOCO [25]
long-range connections of sketch structures, where self- built a paired image-sketch dataset from existing image
attention mechanism is adopted to aggregate features datasets [8] and sketch datasets [105, 23] by looking for

 9
the most similar sketch with the same class label for Deep Learning based Methods. Methods based on
each foreground object in a natural image. deep learning mainly vary in network architecture de-
 sign and optimization objective. Chen et al. [13] pro-
 posed a regression approach for synthesizing realistic
4.2.2 Semantic Label Maps as Input images from semantic maps, without the need for adver-
 sarial training. To improve synthesis quality, they pro-
 Semantic Map Ground Truth Pix2PixHD SPADE SEAN
 posed a Cascaded Refinement Network (CRN), which
 progressively generates images from low resolution to
 high resolution (up to 2 megapixels at 1024x2048 pixel
 resolution) through a cascade of refinement modules.
 To encourage diversity in generated images, the authors
 proposed a diversity loss, which lets the network out-
 put multiple images at a time and optimize diversity
 within the collection. Wang et al. [123] proposed a style-
 consistent GAN framework that generates images given
 a semantic label map input and an exemplary image
 indicating style. A novel style-consistent discriminator
Figure 4: Illustration for image synthesis from semantic is designed to determine whether a pair of images are
label maps. Image taken from [153]. consistent in style and an adaptive semantic consistency
 loss is optimized to ensure correspondence between the
 generated image and input semantic label map.
 Synthesizing photorealistic images from semantic la-
bel maps is the inverse problem of semantic image seg- Having found that directly synthesizing images from
mentation. It has applications in controllable image semantic maps through a sequence of convolutions
synthesis and image editing. Existing methods either sometimes provides non-satisfactory results because of
work with a traditional retrieval-and-composition ap- semantic information loss during forward propagation,
proach [47, 3], a deep learning based method [13, 58, some works seek to better use the input semantic map
93, 74, 155, 114], or a hybrid of the two [96]. Differ- and preserve semantic information in all stages of the
ent types of datasets are utilized to allow synthesiz- synthesis network. Park et al. [93] proposed a spatially-
ing images of various scenes or subjects, such as in- adaptive normalization layer (SPADE), which is a nor-
door/outdoor scenes, or human bodies. malization layer with learnable parameters that utilizes
Retrieval-and-Composition based Methods. the original semantic map to help retain semantic infor-
Non-parametric methods follow the traditional mation in the feature maps after the traditional batch
retrieval-and-composition strategy. Johnson et al. [47] normalization. The authors incorporated their SPADE
first proposed to synthesize images from semantic layers into the pix2pixHD architecture and produced
concepts. Given an empty canvas, the user can state-of-the-art results on multiple datasets. Liu et
paint regions with corresponding keywords at desired al. [74] argues that the convolutional network should
locations. The algorithm searches for candidate be sensitive to semantic layouts at different locations.
images in the stock and uses a graph-cut based seam Thus they proposed Conditional Convolution Blocks
optimization process to generate realistic photographs (CC Block), where parameters for convolution kernels
for each combination. The best combination with are predicted from semantic layouts. They also pro-
the minimum seam cost is chosen as the final result. posed a feature pyramid semantics-embedding (FPSE)
Bansal et al. [3] proposed a non-parametric matching discriminator, which predicts semantic alignment scores
and hierarchical composition strategy to synthesize in addition to real/fake scores. It explicitly forces the
realistic images from semantic maps. The strategy generated images to be better aligned semantically with
consists of four stages: a global consistency stage to the given semantic map. Zhu et al. [155] proposed a
retrieve relevant samples based on indicator vectors of Group Decreasing Network (GroupDNet). GroupDNet
presented categories, a shape consistency stage to find utilizes group convolutions in the generator and the
candidate segments based on shape context similarity group number in the decoder decreases progressively.
between the input label mask and the ones in the Inspired by SPADE, the authors also proposed a novel
database, a part consistency stage and a pixel consis- normalization layer to make better use of the informa-
tency stage that re-synthesize patches and pixels based tion in the input semantic map. Experiments show that
on best-matching areas as measured by Hamming the GroupDNet architecture is more suitable for the
distance. The proposed method outperforms state- multi-modal image synthesis (SMIS) task, and can pro-
of-the-art parametric methods like pix2pix [43] and duce plausible results.
pix2pixHD [124] both qualitatively and quantitatively. Observing that results from existing methods often

 10
lack detailed local texture, resulting from large objects 95, 19, 22, 65, 111, 154]. In these methods, a pose is of-
dominating the training, Tang et al. [114] aims for bet- ten represented as a set of well-defined body keypoints.
ter synthesis of small objects in the image. In their Each of the keypoints can be modeled as an isotropic
design, each class has its own class-level generation net- Gaussian that is centered at the ground-truth joint lo-
work that is trained with feedback from a classification cation and has a small standard deviation, giving rise
loss, and all the classes share an image-level global gen- to a heatmap. The concatenation of the joint-centered
erator. The class-level generator generates parts of the heatmaps then can be used as the input to the image
image that correspond to each class, from masked fea- synthesis network. Heatmaps of rigid parts and the
ture maps. All the class-specific image parts are then whole body can also be utilized [19].
combined and fused with the image-level generation re- Supervised Deep Learning Methods. In a super-
sult. In another work, to provide more fine-grained in- vised setting, ground truth target images under target
teractivity, Zhu et al. [153] proposed semantic region- poses are required for training. Thus, datasets with the
adaptive normalization (SEAN), which allows manipu- same person in multiple poses are needed. Ma et al. [79]
lation of each semantic region individually, to improve proposed the Pose Guided Person Generation Network
image quality. for generating person images under given poses. It
Integration methods. While deep learning based adopts a GAN-like architecture and generates images
generative methods are better able to synthesize novel in a coarse-to-fine manner. In the coarse stage, an im-
images, traditional retrieval-and-composition methods age of a person along with a novel pose are fed into the
generate images with more reliable texture and less ar- U-Net based generator, where the pose is represented as
tifacts. To combine the advantages of both parametric heatmaps of body keypoints. The coarse output is then
and non-parametric methods, Qi et al. [96] presented a concatenated again with the person image, and a refine-
semi-parametric approach. They built a memory bank ment network is trained to learn a difference map that
offline, containing segments of different classes of ob- can be added to the coarse output to get the final re-
jects. Given an input semantic map, segments are first fined result. The discriminator is trained to distinguish
retrieved using a similarity metric defined by IoU score synthesized outputs and real images. Besides the GAN
of the masks. The retrieved segments are fed to a spa- loss, an L1 loss is used to measure dissimilarity between
tial transformer network where they are aligned, and the generated output and the target image. Since the
further put onto a canvas by an ordering network. The target image may have different background from the
canvas is refined by a synthesis network to get the final input condition image, the L1 loss is modified to give
result. This combination of retrieval-and-composition higher weight to the human body utilizing a pose mask
and deep-learning based methods allows high-fidelity derived from the pose skeleton.
image generation, but it takes more time during infer-
ence and the framework is not end-to-end trainable. Although GANs have achieved great success in im-
 age synthesis, there are still some difficulties when it
Benchmark Datasets. For synthesis from seman-
 comes to pose-based synthesis, one of which being the
tic label maps, experiments are mainly conducted on
 deformation problem. The given novel pose can be
datasets of human body [69, 70, 75], human face [59],
 drastically different from the original pose, resulting in
indoor scenes [149, 150, 89] and outdoor scenes [18].
 large deformations in both shape and texture in the
Lassner et al. [58] augmented the Chictopia10K [69, 70]
 synthesized image and making it hard to directly train
dataset by adding 2D keypoint locations and fitted
 a network that is able to generate images without ar-
SMPL body models, and the augmented dataset is used
 tifacts. Existing works mainly adopt transformation
by Bem et al. [19]. Park et al. [93] and Zhu et al. [153]
 strategies to overcome this problem, because transfor-
collected images from the Internet and applied state-of-
 mation makes it explicit about which body part will
the-art semantic segmentation models [10, 11] to build
 be moved to which place, being aware of the original
paired datasets.
 and target poses. These methods usually transform
 body parts of the original image [2], the human parsing
4.2.3 Poses as Input map [22], or the feature map [107, 22, 154]. Balakrish-
 nan et al. [2] explicitly separate the human body from
Given a reference person image, its corresponding pose, the background and synthesize person images of unseen
and a novel pose, pose-based image synthesis meth- poses and background in separate steps. Their method
ods can generate an image of the person in that novel consists of four modules: a segmentation module that
pose. Different from synthesizing images from sketches produces masks of the whole body and each body part
or semantic maps, pose-guided synthesis requires novel based on the source image and pose; a transformation
views to be generated, which cannot be done by the module that calculates and applies affine transforma-
retrieval and composition pipeline. Thus we focus on tion to each body part and corresponding feature maps;
reviewing deep learning-based methods [2, 79, 80, 107, a background generation module that applies inpaint-

 11
ing to fill the body-removed foreground region; and a aforementioned pose-to-image synthesis methods re-
final integration module that uses the transformed fea- quire ground truth images under target poses for train-
ture maps and the target pose to get the synthesized ing because of their use of L1, L2 or perceptual
foreground, which is then combined with the inpainted losses. To eliminate the need for target images, some
background to get the final result. To train the net- works focus on the unsupervised setting of this prob-
work, they use a VGG-19 perceptual loss along with a lem [95, 111], where the training process does not re-
GAN loss. Siarohin et al. [107] noted that it is hard for quire ground truth image of the target pose. The basic
the generator to directly capture large body movements idea is to ensure cycle consistency. After the forward
because of the restricted receptive field, and introduced pass, the synthesized result along with the target pose
deformable GANs to tackle the problem. The method will be treated as the reference, and be used to synthe-
decomposes the body joints into several semantic parts, size the image under the original reference pose. This
and calculates an affine transform from the source to synthesized image should be consistent with the origi-
the target pose for each part. The affine transforms nal reference image. Pumarola et al. [95] further uti-
are used to align the feature maps of the source image lize a pose estimator, to ensure pose consistency. Song
with the target pose. The transformed feature maps are et al. [111] use parsing maps as supervision instead of
then concatenated with the target pose features and de- poses. They predict parsing maps under new target
coded to synthesize the output image. The authors also poses and use them to synthesize the corresponding im-
proposed a novel nearest-neighbor loss based on feature ages. Since the parsing maps under the target poses are
maps, instead of using L1 or L2 loss. Their method is not available due to operating in the unsupervised set-
more robust to large pose changes and produces higher ting, the authors proposed a pseudo-label selection tech-
quality images compared to [79]. Dong et al. [22] utilize nique to get “fake” parsing maps by searching for the
parsing results as a proxy to achieve better synthesizing ones with the same clothes type and minimum trans-
results. They first estimate parsing results for the target formation energy.
pose, then fit a Thin Plate Spline (TPS) transformation Benchmark Datasets. For synthesis from poses, the
between the original and estimated parsing maps. The DeepFashion [75] and Market-1501 [148] datasets are
TPS transformation is further applied to warp the fea- most widely used. The DeepFashion dataset is built
ture maps for feature alignment and a soft-gated warp- for clothes recognition but has also been used for pose-
ing block is developed to provide controllability to the based image synthesis because of the rich annotations
transformation degree. The final image is synthesized available such as clothing landmarks as well as im-
based on the transformed feature maps. Zhu et al. [154] ages with corresponding foreground but diverse back-
proposed that large deformations can be divided into a grounds. The Market-1501 dataset was initially intro-
sequence of small deformations, which are more friendly duced for the purpose of person re-identification, and
to network training. In this way, the original pose can it contains a large number of person images produced
be transformed progressively, through many interme- using a pedestrian detector and annotated bounding
diate poses. They proposed a Pose-Attentional Trans- boxes; also, each identity has multiple images from dif-
fer Block (PATB), which transforms the feature maps ferent camera views.
under the guidance of an attention mask. By stack-
ing multiple PATBs, the feature maps undergo several 4.3 Other Input Modalities
transformations and the transformed maps are used to
synthesize the final result. Except for text descriptions and image-like inputs,
 While most of the deep learning based methods there are other intuitive user inputs such as class la-
for synthesis from poses adopt an adversarial train- bels, attribute vectors, and graph-like inputs.
ing paradigm, Bem et al. [19] proposed a conditional-
VAEGAN architecture that combines a conditional- 4.3.1 Visual Attributes as Input
VAE framework and a GAN discriminator module to In this subsection, we mainly focus on works that use
generate realistic natural images of people in a unified one of the fine-grained class conditional labels or vec-
probabilistic framework where the body pose and ap- tors, i.e. visual attributes, as inputs. Visual attributes
pearance are kept as separated and interpretable vari- provide a simple and accurate way of describing ma-
ables, allowing the sampling of people with independent jor features present in images, such as in describing at-
variations of pose and appearance. The loss function tributes of a certain category of birds or details of a
used includes both conditional-VAE and GAN losses person’s face. Current methods either take a discrete
composed of L1 reconstruction loss, closed-form KL- one-hot vector as attribute labels, or a continuous vec-
divergence loss between recognition and prior distribu- tor as visual attribute input.
tions, and discriminator cross-entropy loss. Yan et al. [135] proposes a disentangling CVAE (dis-
Unsupervised Deep Learning Methods. The CVAE) for conditioned image generation from visual at-

 12
You can also read