A Deep Factorization of Style and Structure in Fonts

Page created by Jacqueline Blake
 
CONTINUE READING
A Deep Factorization of Style and Structure in Fonts
A Deep Factorization of Style and Structure in Fonts

 Nikita Srivatsan1 Jonathan T. Barron2 Dan Klein3 Taylor Berg-Kirkpatrick4
 1 Language Technologies Institute, Carnegie Mellon University, nsrivats@cmu.edu
 2 Google Research, barron@google.com
 3 Computer Science Division, University of California, Berkeley, klein@cs.berkeley.edu
 4 Computer Science and Engineering, University of California, San Diego, tberg@eng.ucsd.edu

Figure 1: Example fonts from the Capitals64 dataset. The task of font reconstruction involves generating missing
glyphs from partially observed novel fonts.

 Abstract which typically presume a library of known fonts.
 We propose a deep factorization model for ty- In the case of historical document recognition, for
 pographic analysis that disentangles content example, this problem is more pronounced due
 from style. Specifically, a variational infer- to the wide range of lost, ancestral fonts present
 ence procedure factors each training glyph into in such data (Berg-Kirkpatrick et al., 2013; Berg-
 the combination of a character-specific con- Kirkpatrick and Klein, 2014). Models that capture
 tent embedding and a latent font-specific style this wide stylistic variation of glyph images may
 variable. The underlying generative model
 eventually be useful for improving optical charac-
 combines these factors through an asymmet-
 ric transpose convolutional process to gener- ter recognition on unknown fonts.
 ate the image of the glyph itself. When trained In this work we present a probabilistic latent
 on corpora of fonts, our model learns a man- variable model capable of disentangling stylistic
 ifold over font styles that can be used to an- features of fonts from the underlying structure of
 alyze or reconstruct new, unseen fonts. On each character. Our model represents the style of
 the task of reconstructing missing glyphs from each font as a vector-valued latent variable, and
 an unknown font given only a small num-
 parameterizes the structure of each character as a
 ber of observations, our model outperforms
 both a strong nearest neighbors baseline and learned embedding. Critically, each style latent
 a state-of-the-art discriminative model from variable is shared by all characters within a font,
 prior work. while character embeddings are shared by char-
 acters of the same type across all fonts. Thus,
1 Introduction
 our approach is related to a long literature on us-
One of the most visible attributes of digital lan- ing tensor factorization as a method for disentan-
guage data is its typography. A font makes use gling style and content (Freeman and Tenenbaum,
of unique stylistic features in a visually consistent 1997; Tenenbaum and Freeman, 2000; Vasilescu
manner across a broad set of characters while pre- and Terzopoulos, 2002; Tang et al., 2013) and to
serving the structure of each underlying form in recent deep tensor factorization techniques (Xue
order to be human readable – as shown in Figure 1. et al., 2017).
Modeling these stylistic attributes and how they Inspired by neural methods’ ability to disentan-
compose with underlying character structure could gle loosely coupled phenomena in other domains,
aid typographic analysis and even allow for auto- including both language and vision (Hu et al.,
matic generation of novel fonts. Further, the vari- 2017; Yang et al., 2017; Gatys et al., 2016; Zhu
ability of these stylistic features presents a chal- et al., 2017), we parameterize the distribution that
lenge for optical character recognition systems, combines style and structure in order to generate
glyph images as a transpose convolutional neural plicit by Zhang et al. (2018, 2020) – as the goal
decoder (Dumoulin and Visin, 2016). Further, the of our analysis is to learn a smooth manifold of
decoder is fed character embeddings early on in font styles that allows for stylistic inference given
the process, while the font latent variables directly a small sample of glyphs. However, many other
parameterize the convolution filters. This archi- style transfer tasks in the language domain (Shen
tecture biases the model to capture the asymmet- et al., 2017) suffer from ambiguity surrounding
ric process by which structure and style combine the underlying division between style and seman-
to produce an observed glyph. tic content. By contrast, in this setting the distinc-
 We evaluate our learned representations on the tion is clearly defined, with content (i.e. the char-
task of font reconstruction. After being trained acter) observed as a categorical label denoting the
on a set of observed fonts, the system reconstructs coarse overall shape of a glyph, and style (i.e. the
missing glyphs in a set of previously unseen fonts, font) explaining lower-level visual features such
conditioned on a small observed subset of glyph as boldness, texture, and serifs. The modeling
images. Under our generative model, font recon- approach taken here might inform work on more
struction can be performed via posterior inference. complex domains where the division is less clear.
Since the posterior is intractable, we demonstrate
how a variational inference procedure can be used 3 Font Reconstruction
to perform both learning and accurate font recon-
struction. In experiments, we find that our pro- We can view a collection of fonts as a matrix,
posed latent variable model is able to substantially X, where each column corresponds to a particu-
outperform both a strong nearest-neighbors base- lar character type, and each row corresponds to a
line as well as a state-of-the-art discriminative sys- specific font. Each entry in the matrix, xij , is an
tem on a standard dataset for font reconstruction. image of the glyph for character i in the style of a
Further, in qualitative analysis, we demonstrate font j, which we observe as a 64 × 64 grayscale
how the learned latent space can be used to inter- image as shown in Figure 1. In a real world set-
polate between fonts, hinting at the practicality of ting, the equivalent matrix would naturally have
more creative applications. missing entries wherever the encoding for a char-
 acter type in a font is undefined. In general, not all
2 Related Work fonts contain renderings of all possible character
 types; many will only support one particular lan-
Discussion of computational style and content guage or alphabet and leave out uncommon sym-
separation in fonts dates at least as far back as the bols. Further, for many commercial applications,
writings of Hofstadter (1983, 1995). Some prior only the small subset of characters that appears in
work has tackled this problem through the use of a specific advertisement or promotional message
bilinear factorization models (Freeman and Tenen- will have been designed by the artist – the major-
baum, 1997; Tenenbaum and Freeman, 2000), ity of glyphs are missing. As a result, we may wish
while others have used discriminative neural mod- to have models that can infer these missing glyphs,
els (Zhang et al., 2018, 2020) and adversarial a task referred to as font reconstruction.
training techniques (Azadi et al., 2018). In con- Following recent prior work (Azadi et al.,
trast, we propose a deep probabilistic approach 2018), we define the task setup as follows: Dur-
that combines aspects of both these lines of past ing training we have access to a large collection
work. Further, while some prior approaches to of observed fonts for the complete character set.
modeling fonts rely on stroke or topological rep- At test time we are required to predict the missing
resentations of observed glyphs (Campbell and glyph images in a collection of previously unseen
Kautz, 2014; Phan et al., 2015; Suveeranont and fonts with the same character set. Each test font
Igarashi, 2010), ours directly models pixel values will contain observable glyph images for a small
in rasterized glyph representations and allows us randomized subset of the character set. Based on
to more easily generalize to fonts with variable the style of this subset, the model must reconstruct
glyph topologies. glyphs for the rest of the character set.
 Finally, while we focus our evaluation on font Font reconstruction can be thought of as a form
reconstruction, our approach has an important re- of matrix completion; given various observations
lationship with style transfer – a framing made ex- in both a particular row and column, we wish to
Gaussian
 3 Prior
 1 4
 2 3
 Font Embedding
 Latent Variables 1 2 3 4

 4
 1 Transpose
 Conv

 2

 3
 Character
 Embedding Observed Glyphs
 Parameters Decoder Architecture

Figure 2: Depiction of the generative process of our model. Each observed glyph image is generated conditioned
on the latent variable of the corresponding font and the embedding parameter of the corresponding character type.
For a more detailed description of the decoder architecture and hyperparameters, see Appendix A.

reconstruct the element at their intersection. Alter- a vector-valued latent variable rather than a deter-
natively we can view it as a few-shot style transfer ministic embedding.
task, in that we want to apply the characteristic More specifically, for each font in the collec-
attributes of a new font (e.g. serifs, italicization, tion, a font embedding variable, zj ∈ Rk , is
drop-shadow) to a letter using a small number of sampled from a fixed multivariate Gaussian prior,
examples to infer those attributes. Past work on p(zj ) = N (0, Ik ). Next, each glyph image, xij ,
font reconstruction has focused on discriminative is generated independently, conditioned on the
techniques. For example Azadi et al. (2018) used corresponding font variable, zj , and a character-
an adversarial network to directly predict held out specific parameter vector, ei ∈ Rk , which we re-
glyphs conditioned on observed glyphs. By con- fer to as a character embedding. Thus, glyphs of
trast, we propose a generative approach using a the same character type share a character embed-
deep latent variable model. Under our approach ding, while glyphs of the same font share a font
fonts are generated based on an unobserved style variable. A corpus of I ∗ J glyphs is modeled
embedding, which we can perform inference over with only I character embeddings and J font vari-
given any number of observations. ables, as seen in the left half of Figure 2. This
 modeling approach can be thought of as a form of
4 Model deep matrix factorization, where the content at any
Figure 2 depicts our model’s generative process. given cell is purely a function of the vector repre-
Given a collection of images of glyphs consisting sentations of the corresponding row and column.
of I character types across J fonts, our model hy- We denote the full corpus of glyphs as a matrix
pothesizes a separation of character-specific struc- X = ((x11 , ..., x1J ) , ..., (xI1 , ...xIJ )) and denote
tural attributes and font-specific stylistic attributes the corresponding character embeddings as E =
into two different representations. Since all char- (e1 , ..., eI ) and font variables as Z = (z1 , ..., zJ ).
acters are observed in at least one font, each char- Under our model, the probability distribution
acter type is represented as an embedding vector over each image, conditioned on zj and ei , is pa-
which is part of the model’s parameterization. In rameterized by a neural network, described in the
contrast, only a subset of fonts is observed during next section and depicted in Figure 2. We denote
training and our model will be expected to gen- this decoder distribution as p(xij |zj ; ei , φ), and
eralize to reconstructing unseen fonts at test time. let φ represent parameters, shared by all glyphs,
Thus, our representation of each font is treated as that govern how font variables and character em-
beddings combine to produce glyph images. In the character embedding is projected to a low res-
contrast, the character embedding parameters, ei , olution matrix with a large number of channels.
which feed into the decoder, are only shared by Following that, we apply several transpose convo-
glyphs of the same character type. The font vari- lutional layers which increase the resolution, and
ables, zj , are unobserved during training and will reduce the number of channels. Critically, the con-
be inferred at test time. The joint probability under volutional filter at each step is not a learned param-
our model is given by: eter of the model, but rather the output of a small
 Y multilayer perceptron whose input is the font la-
 p(X, Z; E, φ) = p(xij |zj ; ei , φ)p(zj ) tent variable z. Between these transpose convo-
 i,j lutions, we insert vanilla convolutional layers to
 fine-tune following the increase in resolution.
4.1 Decoder Architecture Overall, the decoder consists of four blocks,
One way to encourage the model to learn disen- where each block contains a transpose convolu-
tangled representations of style and content is by tion, which upscales the previous layer and re-
choosing an architecture that introduces helpful duces the number of channels by a factor of two,
inductive bias. For this domain, we can think of followed by two convolutional layers. Each (trans-
the character type as specifying the overall shape pose) convolution is followed by an instance norm
of the image, and the font style as influencing the and a ReLU activation. The convolution filters all
finer details; we formulate our decoder with this have a kernel size of 5 × 5. The character em-
difference in mind. We hypothesize that a glyph bedding is reshaped via a two-layer MLP into a
can be modeled in terms of a low-resolution char- 8 × 8 × 256 tensor before being fed into the de-
acter representation to which a complex operator coder. The final 64 × 64 dimensional output layer
specific to that font has been applied. is treated as a grid of parameters which defines
 The success of transpose convolutional layers at the output distribution on pixels. We describe the
generating realistic images suggests a natural way specifics of this distribution in the next section.
to apply this intuition. A transpose convolution1
is a convolution performed on an undecimated in- 4.2 Projected Loss
put (i.e. with zeros inserted in between pixels in
 The conventional approach for computing loss on
alternating rows and columns), resulting in an up-
 image observations is to use an independent out-
scaled output. Transpose convolutional architec-
 put distribution, typically a Gaussian, on each
tures generally start with a low resolution input
 pixel’s intensity. However, deliberate analysis of
which is passed through several such layers, it-
 the statistics of natural images has shown that im-
eratively increasing the resolution and decreasing
 ages are not well-described in terms of statistically
the number of channels until the final output di-
 independent pixels, but are instead better modeled
mensions are reached. We note that the asymme-
 in terms of edges (Field, 1987; Huang and Mum-
try between the coarse input and the convolutional
 ford, 1999). It has also been demonstrated that im-
filters closely aligns with the desired inductive bi-
 ages of text have similar statistical distributions as
ases, and therefore use this framework as a starting
 natural images (Melmer et al., 2013). Following
point for our architecture.
 this insight, as our reconstruction loss we use a
 Broadly speaking, our architecture represents
 heavy-tailed (leptokurtotic) distribution placed on
the underlying shape that defines the specific char-
 a transformed representation of the image, similar
acter type (but not the font) as coarse-grained in-
 to the approach of Barron (2019). Modeling the
formation that therefore enters the transpose con-
 statistics of font glyphs in this fashion results in
volutional process early on. In contrast, the stylis-
 sharper samples, while modeling independent pix-
tic content that specifies attributes of the specific
 els with a Gaussian distribution results in blurry,
font (such as serifs, drop shadow, texture) is rep-
 oversmoothed results.
resented as finer-grained information that enters
 More specifically, we adopt one of the strate-
into the decoder at a later stage, by parameteriz-
 gies employed in Barron (2019), and transform
ing filters, as shown in the right half of Figure 2.
 image observations using the orthonormal variant
Specifically we form our decoder as follows: first
 of the 2-Dimensional Discrete Cosine Transform
 1
 sometimes erroneously referred to as a “deconvolution” (2-D DCT-II) (Ahmed et al., 1974), which we de-
Convolutional Transpose Convolutional
 Encoder Decoder
 Latent Space
 1
 1 ( , Σ)

 Observation Pointwise 2
 Subsampling Max

 3
 (0, I)
 3

 Discrete Cosine
 Transform log 

Figure 3: Depiction of the computation graph of the amortized variational lower bound (for simplicity, only one
font is shown). The encoder approximates the generative model’s true posterior over the font style latent variables
given the observations. It remains insensitive to the number of observations by pooling high-level features across
glyphs. For a more specific description of the encoder architecture details, see Appendix A.

note as f : R64×64 → R64×64 for our 64 × 64 di- ages. Intuitively, it models the fact that images
mensional image observations. We transform both tend to be mostly smooth, with a small amount of
the observed glyph image and the corresponding non-smooth variation in the form of edges. Com-
grid or parameters produced by our decoder be- puting this heavy-tailed loss over the frequency
fore computing the observation’s likelihood. decomposition provided by the DCT-II instead of
 This procedure projects observed images onto a the raw pixel values encourages the decoder to
grid of orthogonal bases comprised of shifted and generate sharper images without needing either
scaled 2-dimensional cosine functions. Because an adversarial discriminator or a vectorized rep-
the DCT-II is orthonormal, this transformation is resentation of the characters during training. Note
volume-preserving, and so likelihoods computed that while our training loss is computed in DCT-II
in the projected space correspond to valid mea- space, at test time we treat the raw grid of param-
surements in the original pixel domain. eter outputs x̂ as the glyph reconstruction.
 Note that because the DCT-II is simply a rota-
tion in our vector space, imposing a normal dis- 5 Learning and Inference
tribution in this transformed space should have lit- Note that in our training setting, the font variables
tle effect (ignoring the scaling induced by the di- Z are completely unobserved, and we must induce
agonal of the covariance matrix of the Gaussian their manifold with learning. As our model is gen-
distribution) as Euclidean distance is preserved erative, we wish to optimize the marginal proba-
under rotations. For this reason we impose a bility of just the observed X with respect to the
heavy-tailed distribution in this transformed space, model parameters E and φ:
specifically a Cauchy distribution. This gives the Z
following probability density function p(X; E, φ) = p(X, Z; E, φ)dZ
 Z

 1 However, the integral over Z is computation-
 g(x; x̂, γ) =   2  ally intractable, given that the complex relation-
 f (x)−f (x̂)
 πγ 1 + γ ship between Z and X does not permit a closed
 form solution. Related latent variable models
 where x is an observed glyph, x̂ is the location such as Variational Autoencoders (VAE) (Kingma
parameter grid output by our decoder, and γ is a and Welling, 2014) with intractable marginals
hyperparameter which we set to γ = 0.001. have successfully performed learning by opti-
 The Cauchy distribution accurately captures the mizing a variational lower bound on the log
heavy-tailed structure of the edges in natural im- marginal likelihood. This surrogate objective,
called the evidence lower bound (ELBO), intro- combine the features obtained from each charac-
duces a variational approximation, q(Z|X) = ter in a manner that is largely invariant to the
 j q(zi |x1j , . . . , xIj ) to the model’s true poste- total number and types of characters observed.
Q
rior, p(Z|X). Our model’s ELBO is as follows: This provides an inductive bias that encourages the
 X model to extract similar features from each charac-
 ELBO = Eq [log p(x1j , . . . , xIj |zj )] ter type, which should therefore represent stylistic
 j as opposed to structural properties.
 − KL(q(zj |x1j , . . . , xIj )||p(zj )) Overall, the encoder over each glyph consists of
 three blocks, where each block consists of a con-
where the approximation q is parameterized via volution followed by a max pool with a stride of
a neural encoder network. This lower bound can two, an instance norm (Ulyanov et al., 2016), and
be optimized by stochastic gradient ascent if q is a ReLU. The activations are then pooled across the
a Gaussian, via the reparameterization trick de- characters via an elementwise max into a single
scribed in (Kingma and Welling, 2014; Rezende vector, which is then passed through four fully-
et al., 2014) to sample the expectation under q connected layers, before predicting the parameters
while still permitting backpropagation. of the Gaussian approximate posterior.
 Practically speaking, a key property which we Reconstruction via Inference: At test time, we
desire is the ability to perform consistent inference pass an observed subset of a new font to our en-
over z given a variable number of observed glyphs coder in order to estimate the posterior over zj ,
in a font. We address this in two ways: through and take the mean of that distribution as the in-
the architecture of our encoder, and through a spe- ferred font representation. We then pass this en-
cial masking process in training; both of which are coding to the decoder along with the full set of
shown in Figure 3. character embeddings E in order to produce re-
 constructions of every glyph in the font.
5.1 Posterior Approximation
Observation Subsampling: To get reconstruc- 6 Experiments
tions from only partially observed fonts at test
time, the encoder network must be able to infer zj We now provide an overview of the specifics of
from any subset of (x1j , . . . , xIj ). One approach the dataset and training procedure, and describe
for achieving robustness to the number of obser- our experimental setup and baselines.
vations is through the training procedure. Specifi-
 6.1 Data
cally when computing the approximate posterior
for a particular font in our training corpus, we We compare our model against baseline sys-
mask out a randomly selected subset of the char- tems at font reconstruction on the Capitals64
acters before passing them to the encoder. This dataset (Azadi et al., 2018), which contains the 26
incentivizes the encoder to produce reasonable es- capital letters of the English alphabet as grayscale
timates without becoming too reliant on the fea- 64 × 64 pixel images across 10, 682 fonts. These
tures extracted from any one particular character, are broken down into training, dev, and test splits
which more closely matches the setup at test time. of 7649, 1473, and 1560 fonts respectively.
Encoder Architecture: Another way to encour- Upon manual inspection of the dataset, it is
age this robustness is through inductive bias in the apparent that several fonts have an almost visu-
encoder architecture. Specifically we use a convo- ally indistinguishable nearest neighbor, making
lutional neural network which takes in a batch of the reconstruction task trivial using a naive algo-
characters from a single font, concatenated with rithm (or an overfit model with high capacity) for
their respective character type embedding. Fol- those particular font archetypes. Because these
lowing the final convolutional layer, we perform datapoints are less informative with respect to a
an elementwise max operation across the batch, model’s ability to generalize to previously unseen
reducing to a single vector representation for the styles, we additionally evaluate on a second test set
entire font which we pass through further fully- designed to avoid this redundancy. Specifically,
connected layers to obtain the output parameters we choose the 10% of test fonts that have max-
of q as shown in Figure 3. By including this ac- imal L2 distance from their closest equivalent in
cumulation across the elements of the batch, we the training set, which we call “Test Hard”.
Test Full Test Hard
Observations 1 2 4 8 1 2 4 8
 NN 483.13 424.49 386.81 363.97 880.22 814.67 761.29 735.18
 GlyphNet 669.36 533.97 455.23 416.65 935.01 813.50 718.02 653.57
 Ours (FC) 353.63 316.47 293.67 281.89 596.57 556.21 527.50 513.25
Ours (Conv) 352.07 300.46 271.03 254.92 615.87 556.03 511.05 489.58

Table 1: L2 reconstruction per glyph by number of observed characters. “Full” includes the entire test set while
“Hard” is measured only over the 10% of test fonts with the highest L2 distance from the closest font in train.

6.2 Baselines 7 Results
As stated previously, many fonts fall into visually We now present quantitative results from our ex-
similar archetypes. Based on this property, we use periments in both automated and human annotated
a nearest neighbors algorithm for our first base- metrics, and offer qualitative analysis of recon-
line. Given a partially observed font at test time, structions and the learned font manifold.
this approach simply “reconstructs” by searching
the training set for the font with the lowest L2 dis- 7.1 Quantitative Evaluation
tance over the observed characters, and copy its Automatic Evaluation: We show font reconstruc-
glyphs verbatim for the missing characters. tion results for our system against nearest neigh-
 For our second comparison, we use the Glyph- bors and GlyphNet in Table 1. Each model is given
Net model from Azadi et al. (2018). This ap- a random subsample of glyphs from each test font
proach is based on a generative adversarial net- (we measure at 1, 2, 4, and 8 observed characters),
work, which uses discriminators to encourage the with their character labels. We measure the av-
model to generate outputs that are difficult to erage L2 distance between the image reconstruc-
distinguish from those in the training set. We tions for the unobserved characters and the ground
test from the publicly available epoch 400 check- truth, after scaling intensities to [0, 1].
point, with modifications to the evaluation script Our system achieves the best performance for
to match the setup described above. both the overall and hard subset of test for all num-
 We also perform an ablation using fully- bers of observed glyphs. Nearest neighbors pro-
connected instead of convolutional layers. For vides a strong baseline on the full test set, even
more architecture details see Appendix A. outperforming GlyphNet. However it performs
 much worse on the hard subset. This makes sense
6.3 Training Details as we expect nearest neighbors to do extremely
We train our model to maximize the expected well on any test fonts that have a close equiva-
log likelihood using the Adam optimization al- lent in train, but suffer in fidelity on less tradi-
gorithm (Kingma and Ba, 2015) with a step size tional styles. GlyphNet similarly performs worse
of 10−5 (default settings otherwise), and perform on test hard, which could reflect the missing modes
early stopping based on the approximate log like- problem of GANs failing to capture the full di-
lihood on a hard subset of dev selected by the pro- versity of the data distribution (Che et al., 2016;
cess described earlier. To encourage robustness in Tolstikhin et al., 2017). The fully-connected ab-
the encoder, we randomly drop out glyphs during lation is also competitive, although we see that
training with a probability of 0.7 (rejecting sam- the convolutional architecture is better able to in-
ples where all characters in a font are dropped). fer style from larger numbers of observations. On
All experiments are run with a dimensionality of the hard test set, the fully-connected network even
32 for the character embeddings and font latent outperforms the convolutional system when only
variables. Our implementation2 is built in Py- one observation is present, perhaps indicating that
Torch (Paszke et al., 2017) version 1.1.0. its lower-capacity architecture better generalizes
 2
 from very limited data.
 https://bitbucket.org/NikitaSrivatsan/
DeepFactorizationFontsEMNLP19
Figure 5: t-SNE projection of latent font variables in-
 ferred for the full training set, colored by k-means clus-
 tering with k = 10. The glyph for the letter “A” for
Figure 4: Reconstructions of partially observed fonts in each centroid is shown in overlay.
the hard subset from our model, GlyphNet, and nearest
neighbors. Given images of glyphs for ‘A’ and ‘B’ in
 while occasionally blurry at the edges, are gener-
each font, we visualize reconstructions of the remain-
ing characters. Fonts are chosen such that the L2 loss
 ally faithful at reproducing the principal stylistic
of our model on these examples closely matches the features of the font. For example, we see that for
average loss over the full evaluation set. font (1) in Figure 4, we match not only the overall
 shape of the letters, but also the drop shadow and
 to an extent the texture within the lettering, while
 Human Evaluation: To measure how consis- GlyphNet does not produce fully enclosed letters
tent these perceptual differences are, we also per- or match the texture. The output of nearest neigh-
form a human evaluation of our model’s recon- bors, while well-formed, does not respect the style
structions against GlyphNet using Amazon Me- of the font as closely as it fails to find a font in
chanical Turk (AMT). In our setup, turkers were training that matches these stylistic properties. In
asked to compare the output of our model against font (2) the systems all produce a form of gothic
the output of GlyphNet for a single font given one lettering, but the output of GlyphNet is again lack-
observed character, which they were also shown. ing in certain details, and nearest neighbors makes
Turkers selected a ranking based on which recon- subtle but noticeable changes to the shape of the
structed font best matched the style of the ob- letters. In the final example (3) we even see that
served character, and a separate ranking based on our system appears to attempt to replicate the pix-
which was more realistic. On the full test set (1560 elated outline, while nearest neighbors ignores this
fonts, with 5 turkers assigned to each font), hu- subtlety. GlyphNet is in this case somewhat in-
mans preferred our system over GlyphNet 81.3% consistent, doing reasonably well on some letters,
and 81.8% of the time for style and realism respec- but much worse on others. Overall, nearest neigh-
tively. We found that on average 76% of turkers bors will necessarily output well-formed glyphs,
shown a given pair of reconstructions selected the but with lower fidelity to the style, particularly on
same ranking as each other on both criteria, sug- more unique fonts. While GlyphNet does pick up
gesting high annotator agreement. on some subtle features, our model tends to pro-
 duce the most coherent output on harder fonts.
7.2 Qualitative Analysis
In order to fully understand the comparative be- 7.3 Analysis of Learned Manifold
havior of these systems, we also qualitatively com- Since our model attempts to learn a smooth man-
pare the reconstruction output of these systems to ifold over the latent style, we can also perform
analyze their various failure modes, showing ex- interpolation between the inferred font represen-
amples in Figure 4. We generally find that our tations, something which is not directly possible
approach tends to produce reconstructions that, using either of the baselines. In this analysis, we
Figure 6: Interpolation be-
tween font variants from
the same font family, show-
ing smoothness of the latent
manifold. Linear combina-
tions of the embedded fonts
correspond to outputs that
lie intuitively “in between”.

 ���
 ���
 ��� �������
 ��� �����������
 ��� ���������
 ��� ����������
 ��� �����
 ��� �������
 ���
 �������

 Figure 7: t-SNE projection of latent font variables inferred on Google Fonts, colored by weight and category.

take two fonts from the same font family, which the high-dimensional representations to visualize
differ along one property, and pass them through these high-level groupings. Additionally, we dis-
our encoder to obtain the latent font variable for play a sample glyph for the centroid for each clus-
each. We then interpolate between these values, ter. We see that while the clusters are not com-
passing the result at various steps into our decoder pletely separated, the centroids generally corre-
to produce new fonts that exists in between the ob- spond to common font styles, and are largely ad-
servations. In Figure 6 we see how our model can jacent to or overlapping with those with similar
apply serifs, italicization, and boldness gradually stylistic properties; for example script, handwrit-
while leaving the font unchanged in other respects. ten, and gothic styles are very close together.
This demonstrates that our manifold is smooth and To analyze how well our latent embeddings cor-
interpretable, not just at the points corresponding respond to human defined notions of font style,
to those in our dataset. This could be leveraged to we also run our system on the Google Fonts 3
modify existing fonts with respect to a particular dataset which despite containing fewer fonts and
attribute to generate novel fonts efficiently. less diversity in style, lists metadata including nu-
 Beyond looking at the quality of reconstruc- merical weight and category (e.g. serif, hand-
tions, we also wish to analyze properties of the writing, monospace). In Figure 7 we show t-
latent space learned by our model. To do this, SNE projections of latent font variables from our
we use our encoder to infer latent font variables model trained on Google Fonts, colored accord-
z for each of the fonts in our training data, and ingly. We see that the embeddings do generally
use a t-SNE projection (Van Der Maaten, 2013) cluster by weight as well as category suggesting
to plot them in 2D, shown in Figure 5. Since that our model is learning latent information con-
the broader, long-distance groupings may not be sistent with how humans perceive font style.
preserved by this transformation, we perform a k-
means clustering (Lloyd, 1982) with k = 10 on 3
 https://github.com/google/fonts
8 Conclusion William T Freeman and Joshua B Tenenbaum. 1997.
 Learning bilinear models for two-factor problems in
We presented a latent variable model of glyphs vision. In Proceedings of IEEE Computer Society
which learns disentangled representations of the Conference on Computer Vision and Pattern Recog-
structural properties of underlying characters from nition, pages 554–560. IEEE.
stylistic features of the font. We evaluated our Leon A Gatys, Alexander S Ecker, and Matthias
model on the task of font reconstruction and Bethge. 2016. Image style transfer using convolu-
showed that it outperformed both a strong nearest tional neural networks. CVPR.
neighbors baseline and prior work based on GANs
 Douglas R Hofstadter. 1983. Metamagical themas.
especially for fonts highly dissimilar to any in- Scientific American, 248(5):16–E18.
stance in the training set. In future work, it may be
worth extending this model to learn a latent mani- Douglas R Hofstadter. 1995. Fluid concepts and cre-
fold on content as well as style, which could allow ative analogies: Computer models of the fundamen-
 tal mechanisms of thought. Basic books.
for reconstruction of previously unseen character
types, or generalization to other domains where Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan
the notion of content is higher dimensional. Salakhutdinov, and Eric P Xing. 2017. Controllable
 text generation. arXiv preprint arXiv:1703.00955,
Acknowledgments 7.

This project is funded in part by the NSF under Jinggang Huang and David Mumford. 1999. Statistics
grant 1618044 and by the NEH under grant HAA- of natural images and models. CVPR.
256044-17. Special thanks to Si Wu for help con- Diederik P Kingma and Jimmy Ba. 2015. Adam: A
structing the Google Fonts figures. method for stochastic optimization. ICLR.

 Diederik P Kingma and Max Welling. 2014. Auto-
References encoding variational bayes. ICLR.

Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Stuart Lloyd. 1982. Least squares quantization in pcm.
 1974. Discrete cosine transform. IEEE Transac- IEEE Transactions on Information Theory.
 tions on Computers.
 Tamara Melmer, Seyed Ali Amirshahi, Michael Koch,
Samaneh Azadi, Matthew Fisher, Vladimir G Kim, Joachim Denzler, and Christoph Redies. 2013.
 Zhaowen Wang, Eli Shechtman, and Trevor Darrell. From regular text to artistic writing and artworks:
 2018. Multi-content GAN for few-shot font style Fourier statistics of images with low and high aes-
 transfer. CVPR. thetic appeal. Frontiers in Human Neuroscience.
Jonathan T. Barron. 2019. A general and adaptive ro- Adam Paszke, Sam Gross, Soumith Chintala, Gre-
 bust loss function. CVPR. gory Chanan, Edward Yang, Zachary DeVito, Zem-
 ing Lin, Alban Desmaison, Luca Antiga, and Adam
Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. Lerer. 2017. Automatic differentiation in PyTorch.
 2013. Unsupervised transcription of historical doc- In NIPS Autodiff Workshop.
 uments. ACL.
 Huy Quoc Phan, Hongbo Fu, and Antoni B Chan. 2015.
Taylor Berg-Kirkpatrick and Dan Klein. 2014. Im-
 Flexyfont: Learning transferring rules for flexible
 proved typesetting models for historical OCR. ACL.
 typeface synthesis. In Computer Graphics Forum,
Neill DF Campbell and Jan Kautz. 2014. Learning a volume 34, pages 245–256. Wiley Online Library.
 manifold of fonts. ACM TOG.
 Danilo Jimenez Rezende, Shakir Mohamed, and Daan
Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Wierstra. 2014. Stochastic backpropagation and
 Bengio, and Wenjie Li. 2016. Mode regularized approximate inference in deep generative models.
 generative adversarial networks. arXiv preprint ICML.
 arXiv:1612.02136.
 Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi
Vincent Dumoulin and Francesco Visin. 2016. A guide Jaakkola. 2017. Style transfer from non-parallel text
 to convolution arithmetic for deep learning. arXiv by cross-alignment. NIPS.
 preprint arXiv:1603.07285.
 Rapee Suveeranont and Takeo Igarashi. 2010.
David J. Field. 1987. Relations between the statistics Example-based automatic font generation. In
 of natural images and the response properties of cor- International Symposium on Smart Graphics, pages
 tical cells. JOSA A. 127–138. Springer.
Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey • Ci : convolution filters with i filters of 5 × 5,
 Hinton. 2013. Tensor analyzers. In International 2 pixel zero-padding, stride of 1, dilation of 1
 conference on machine learning, pages 163–171.
 • I : instance normalization
Joshua B Tenenbaum and William T Freeman. 2000. • Ti,j,k : transpose convolution with i filters of
 Separating style and content with bilinear models. 5×5, 2 pixel zero-padding, stride of j, k pixel
 Neural computation, 12(6):1247–1283. output padding, dilation of 1, where kernel
Ilya O Tolstikhin, Sylvain Gelly, Olivier Bous- and bias are the output of an MLP (described
 quet, Carl-Johann Simon-Gabriel, and Bernhard below)
 Schölkopf. 2017. Adagan: Boosting generative • H : reshape to 26 × 256 × 8 × 8
 models. In Advances in Neural Information Pro-
 cessing Systems, pages 5424–5433. A.2 Network Architecture
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lem- Our fully-connected encoder is:
 pitsky. 2016. Instance normalization: The miss- F128 -R-F128 -R-F128 -R-F1024 -R-M-F128 -R-F128 -
 ing ingredient for fast stylization. arXiv preprint R-F128 -R-F64
 arXiv:1607.08022.
 Our convolutional encoder is:
Laurens Van Der Maaten. 2013. Barnes-Hut-SNE.
 arXiv preprint arXiv:1301.3342.
 C64 -S-I-R-C128 -S-I-R-C256 -S-I-R-F1024 -M-R-
 F128 -R-F128 -R-F128 -R-F64
M Alex O Vasilescu and Demetri Terzopoulos. 2002.
 Multilinear analysis of image ensembles: Tensor- Our fully-connected decoder is:
 faces. In European Conference on Computer Vision, F128 -R-F128 -R-F128 -R-F128 -R-F128 -R-F4096
 pages 447–460. Springer.
 Our transpose convolutional decoder is:
Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian F128 -R-F16384 -R-H-T256,2,1 -R-C256 -I-R-C256 -I-
 Huang, and Jiajun Chen. 2017. Deep matrix factor-
 ization models for recommender systems. IJCAI. R-T128,2,1 -R-C128 -I-R-C128 -I-R-T64,2,1 -I-R-C64 -
 I-R-C64 -I-R-T32,1,0 -I-R-C32 -I-R-C1
Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and
 Taylor Berg-Kirkpatrick. 2017. Improved varia- MLP to compute a transpose convolutional param-
 tional autoencoders for text modeling using dilated eter of size j is:
 convolutions. ICML. F128 -R-Fj
Yexun Zhang, Ya Zhang, and Wenbin Cai. 2018. Sep-
 arating style and content for generalized style trans-
 fer. In Proceedings of the IEEE conference on com-
 puter vision and pattern recognition, pages 8447–
 8455.

Yexun Zhang, Ya Zhang, and Wenbin Cai. 2020. A
 unified framework for generalizable style transfer:
 Style and content separation. IEEE Transactions on
 Image Processing, 29:4085–4098.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
 Efros. 2017. Unpaired image-to-image translation
 using cycle-consistent adversarial networks. ICCV.

A Appendix
A.1 Architecture Notation
We now provide further details on the specific ar-
chitectures used to parameterize our model and in-
ference network. The following abbreviations are
used to represent various components:
 • Fi : fully-connected layer with i hidden units
 • R : ReLU activation
 • M : batch max pool
 • S : 2 × 2 spatial max pool
You can also read