Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

Page created by Ruben Shelton

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in
 Euclidean Latent Space

 Keizo Kato 1 Jing Zhou 2 Tomotake Sasaki 1 Akira Nakagawa 1

 Abstract model with univariate Gaussian priors (Kingma & Welling,
 To analyze high-dimensional and complex data 2014). For more flexible estimation of Pz (z), successor
 in the real world, deep generative models such as models has been proposed, such as using Gaussian mixture
 variational autoencoder (VAE) embed data in a model (GMM) (Zong et al., 2018), combining univariate
 reduced dimensional latent space and learn the Gaussian model and GMM (Liao et al., 2018), etc.
 probabilistic model in the latent space. How- In the tasks where the quantitative analysis is vital, Px (x)
 ever, they struggle to reproduce the probability should be reproduced from Pz (z). For instance, in anomaly
 distribution function (PDF) in the input space detection, the anomaly likelihood is calculated based on
 from that of the latent space accurately. If the PDF value of data sample (Chalapathy & Chawla, 2019).
 embedding were isometric, this problem can be However, the embedding of VAEs is not isometric, that
 solved since PDFs in both spaces become propor- is, distance between data points x(1) and data x(2) is in-
 tional. To achieve isometric property, we propose consistent to the distance of corresponding latent variables
 Rate-Distortion Optimization guided autoencoder z (1) and z (2) (Chen et al., 2018; Shao et al., 2018; Geng
 inspired by orthonormal transform coding. We et al., 2020). Obviously mere estimation of Pz (z) cannot
 show our method has the following properties: (i) be the substitution of estimation for Px (x) under such sit-
 the columns of the Jacobian matrix between two uation. As McQueen et al. (2016) mentioned, for reliable
 spaces is constantly-scaled orthonormal system data analysis, the isometric embedding in low-dimensional
 and enable to embed data in latent space isomet- space is necessary. In addition, to utilize the standard PDF
 rically; (ii) the PDF of the latent space is propor- estimation techniques, the latent space is preferred to be a
 tional to that of the data observation space. Fur- Euclidean space. Despite of its importance, this point is not
 thermore, our method outperforms state-of-the-art considered even in methods developed for the quantitative
 methods in unsupervised anomaly detection with analysis of PDF (Johnson et al., 2016; Zong et al., 2018;
 four public datasets. Liao et al., 2018; Zenati et al., 2018; Song & Ou, 2018).
 According to the Nash embedding theorem, an arbitrary
 smooth and compact Riemannian manifold M can be em-
1. Introduction
 bedded in a Euclidean space RN (N ≥ dim M + 1, suffi-
Capturing the inherent features of a dataset from high- ciently large) isometrically (Han & Hong, 2006). Besides
dimensional and complex data is an essential issue in ma- that, the manifold hypothesis argues that real-world data
chine learning. Generative model approach learns the prob- presented in a high-dimensional space concentrates in the
ability distribution of data, aiming at data generation, unsu- vicinity of a much lower dimensional manifold Mx ⊂ RM
pervised learning, disentanglement, etc. (Tschannen et al., (Bengio et al., 2013). Based on these theories, it is ex-
2018). It is generally difficult to directly estimate a prob- pected that the input data x can be embedded isomet-
ability density function (PDF) Px (x) of high-dimensional rically in a low-dimensional Euclidean space RN when
data x ∈ RM . Instead, one promising approach is to embed dim Mx < N  M . Although the existence of the iso-
x in a low-dimensional latent variable z ∈ RN (N < M ), metric embedding was proved, the method to achieve it
and capture PDF Pz (z). Variational autoencoder (VAE) is a has been challenging. Some previous works proposed al-
widely used generative model to capture z as a probabilistic gorithms to do that (McQueen et al., 2016; Bernstein et al.,
 1
 2000). Yet, they do not deal with high-dimensional input
 FUJITSU LABORATORIES LTD. 2 Fujitsu R&D Center Co., data such as images. Another thing to consider is the dis-
Ltd.. Correspondence to: kato.keizo, anaka .
 tance on Mx may be defined by the data tendency. For
Proceedings of the 37 th International Conference on Machine instance, we can choose binary cross entropy (BCE) for
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the binary data, structured similarity (SSIM) for image, etc.
the author(s).

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

As a whole, our challenge is to develop a deep generative holds. Here v and w are tangent vectors in Tp M (tan-
model which guarantees the isometric embedding even for gent space of M at p ∈ M) represented as elements of
the high-dimentional data observed in the Mx endowed RM and dg is the differential of g (this can be written
with an arbitrary distance definition. as a Jacobian matrix). hv, wip = v > AM (p)w, where
 AM (p) ∈ RM ×M is a metric tensor represented as a posi-
Mathematically, the condition of isometric embedding to
 tive define matrix. The inner product in the right side is also
Euclidean space is equivalent to that the columns of the
 defined by another metric tensor AN (q) ∈ RN ×N . AM (p)
Jacobian matrix between two spaces form an orthonormal
 or AN (q) is an identity matrix for a Euclidean case and the
system. When we turn our sight to conventional image
 inner product becomes the standard one (the dot product).
compression area, orthonormal transform is necessary for
efficient compression. This is proved by Rate-Distortion In this paper, we slightly abuse the terminology and call a
theory (Berger, 1971) . Furthermore, the empirical method map g isometric if
for optimal compression with orthonormal transform coding
is established as Rate-Distortion Optimization (RDO) (Sul- hv, wip = Chdg(v), dg(w)ig(p) (2)
livan & Wiegand, 1998). It is intuitive to regard data embed-
ding to a low-dimensional latent space as an analogy of effi- holds for some constant C > 0, since √ Eq. (1) can be
cient data compression. Actually, deep learning based image achieved by replacing g with g̃ = (1/ C)g.
compression (DIC) methods with RDO (Ballé et al., 2018;
Zhou et al., 2019) have been proposed and they achieve good 2. Related Work
compression performance. Although it is not discussed in
Ballé et al. (2018); Zhou et al. (2019), we guess that behind Flow based model: Flow based generative models generate
the success of DIC, there should be theoretical relation to image with astonishing quality (Kingma & Dhariwal, 2018;
RDO of convetional transform coding. Dinh et al., 2015). Flow mechanism explicitly takes Jaco-
 bian of x and z into account. The transformation function
Hence, in this paper, we investigate the theoretical property z = f (x) is learned, calculating and storing Jacobian of
and dig out the proof that RDO guides deep-autoencoder to x and z. Unlike ordinary autoencoder, which reverse z to
have the orthonormal property. Based on this finding, we x with function g(·) different from f (·), inverse function
propose an approach which enables isometric data embed- transforms z to x as x = f −1 (z). Since the model stores
ding and allow us to comprehensive data analysis, named Jacobian, Px (x) can be estimated from Pz (z). However,
Rate-Distortion Optimization Guided Autoencoder for Gen- in these approaches, the form of f (·) is limited so that the
erative Analysis (RaDOGAGA). We show the validity of explicit calculation of Jacobian is manageable, such as f (·)
RaDOGAGA in the following steps. can not reduce the dimension of x.
(1) We show that RaDOGAGA has the following properties Data interpolation with autoencoder: For smooth data
both theoretically and experimentally. interpolation, in Chen et al. (2018); Shao et al. (2018), a
• Jacobian matrix between the data observation space (inner function learns to map latent variables to geodesic (shortest
 product space endowed with a metric tensor) and latent path in a manifold) space, in which the distance corresponds
 space forms constantly-scaled orthonormal system. Thus, to the metric of data space. In Geng et al. (2020); Pai et al.
 data can be embedded in a Euclidean latent space isomet- (2019), a penalty for the anisometricity of map is added
 rically. to training loss. Although these approaches may remedy
 scale inconsistency, they do not deal with PDF estimation.
• Thanks to the property above, Pz (z) are proportional to
 Furthermore, the distance for the input data is assumed to
 the PDF of x in the data observation space. Thus, PDF
 be Euclidean distance and the cases for other distances are
 of x in the data observation space can be estimated by
 not considered.
 maximizing log-likelihood of parametric PDF Pz,ψ (z) in
 the low-dimensional Euclidean space. Deep image compression (DIC) with RDO: RD theory
 is a part of Shannon’s information theory for lossy com-
(2) Thanks to (1), RaDOGAGA outperforms the current pression which formulates the optimal condition between
state-of-the-art method in unsupervised anomaly detection information rate and distortion. The signal is decorrelated by
task with four public datasets. orthonormal transformation such as Karhunen-Loève trans-
 form (KLT) (Rao & Yip, 2000) and discrete cosine transform
Isometric Map and Notions of Differential Geometry (DCT). In RDO, a cost L = R + λD is minimized at given
Given two Riemannian manifolds M ⊂ RM and N ⊂ RN , Lagrange parameter λ. Recently, DIC methods with RDO
a map g : M → N is called isometric if (Ballé et al., 2018; Zhou et al., 2019) have been proposed.
 In these works, instead of orthonormal transform in the con-
 hv, wip = hdg(v), dg(w)ig(p) (1) ventional lossy compression method, a deep autoencoder is

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

trained for RDO. In the next section, we explain the idea of the rate-distortion trade-off at desirable rate can be realized
RDO guided autoencoder and its relationship with VAE. as discussed in Alemi et al. (2018).
Note that the beta-VAE and the RDO in image compression
3. Overview of RDO Guided Approach are analogous to each other. That is, β −1 , −Lkl , and Lrec
correspond to λ, a rate R, and a distortion D respectively.
3.1. Derivation from RDO in Transform Coding
However, the probability models of latent variables are quite
Figure 1 shows the overview of our idea based on Rate- different. VAE uses a fixed prior distribution. This causes a
Distortion Optimization (RDO) inspired by transform cod- non-linear scaling relationship between real data and latent
ing. In the transform coding, the optimal method to en- variables. Figure 2 shows the conditions to achieve RDO in
code data with Gaussian distribution is as follows (Goyal, both VAE and RaDOGAGA. In VAE, for RDO condition,
2001). At first, the data are transformed deterministically to a nonlinear scaling of the data distribution is necessary to
decorrelated data using the orthonormal transform such as fit prior. To achieve it, Brekelmans et al. (2019) suggest to
Karhunen-Loève transform (KLT) and discrete cosine trans- control a noise as a posterior variance precisely for each
form (DCT). Then these decorrelated data are quantized channel.
stochastically with uniform quantizer for all channels such
Actually, as proved in Rolı́nek et al. (2019), in the optimal
that quantization noise for each channel is equal. At last
condition, Jacobian matrix of VAE forms orthogonal sys-
optimal entropy encoding is applied to quantized data where
tem, but the norm is not constant. In RaDOGAGA, uniform
the rate can be calculated by the logarithm of symbol’s es-
quantization noises are added to all channels. Instead, para-
timated probability. From this fact, we have an intuition
metric probability distribution should be estimated as a prior.
that the columns of Jacobian matrix of autoencoder forms
As a result, the Jacobian matrix forms orthonormal system
orthonormal system if the data were compressed based on
because both orthogonality and scaling normalization are
RDO with uniform quantized noise and parametric distri-
simultaneously achieved. As discussed above, the precise
bution of latent variables. Inspired by this, we propose
noise control in VAE and parametric prior optimization in
autoencoder which scales latent variables according to the
RaDOGAGA are essentially the same. Accordingly, com-
definition of metrics of data.
plexities in both methods are estimated to be at the same
Model of Transform coding based on Rate-Distortion Theory degree.
Uniform
Image Orthonormal Compressed
Quantization / Conditions to achieve Rate-Distortion Optimization
Data Transform Decorrelated Data
Entropy coding
data Method PDF model Noise Jacobi Matrix
Source Coding Channel Coding VAE with fixed Fixed as prior Variable for Orthogonal
prior each data and (Variable
Analogy between transform coding and RDO guided autoencoder
channels scaling)
Method Source Coding Channel Coding Rate-Distortion
RDO guided Variable Uniform for all Orthonormal
(Deterministic) (Stochastic) Relation Autoencoder parametric PDF data and (Constant
Transform Condition: Condition: Result: channels scaling)
coding Orthonormal Uniform Rate-Distortion
Relationship between PDF, Noise, and Jacobian Matrix
KLT / PCA / DCT quantization / Optimization is RDO guided autoencoder
VAE
transform Entropy coding achieved x x
RDO guided Expected Result: Condition: Condition: Orthogonal Orthonormal
(Variable
Autoencoder Autoencoder with Uniform noise Rate-Distortion scaling)
Fixed PDF as prior
(Constant
Parametric PDF
scaling)
orthonormal for all channel / based Loss
Jacobi matrix Rate estimation function
z z
Posterior with varying variance Posterior with constant variance
(variable quantization noise) ( uniform quantization noise)

Figure 1. Overview of our idea. Orthonormal transformation
and uniform quantization noise bring RDO. Our idea is that Figure 2. The condition of RDO in VAE and our method. In VAE,
uniform quantization noise and RDO lead an antoencoder to be to fit the fixed prior (blue-line), data is transformed anisometrically
orthonormal. with precisely controlled noise as a posterior variance (gray area).
A wider distribution of noise make the PDF of transformed data
smaller. In our method, a parametric prior distribution is estimated,
and data is transformed isometrically with uniform noise.
3.2. Relationship with VAE
There is a number of VAE studies taking rate-distortion
trade-off into account. In VAEs, it is common to maximize 4. METHOD AND THEORETICAL
ELBO instead of maximizing log-likelihood of Px (x) di- PROPERTIES
rectly. In beta-VAE (Higgins et al., 2017), the objective
4.1. Method
function LV AE is described as LV AE = Lrec − βLkl . Here
Lkl is the KL divergence between encoder output and prior Our method is based on the RDO of the autoencoder for
distribution, usually Gaussian distribution. By changing β, image compression proposed in Ballé et al. (2018) with

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

some modifications. In Ballé et al. (2018), the cost function decomposition, we can avoid the interference between better
 reconstruction and RDO trade-off durning the training. λ1
 L = R + λD (3) controls the degree of reconstruction, and λ2 (' β −1 of
 beta-VAE) controls a scaling between data and latent spaces
consists of (i) reconstruction error D between input data and respectively.
decoder output with noise to latent variable and (ii) Rate
R of latent variable. This is analogous to beta-VAE where h(·) in the second term of Eq. (7) is a monotonically in-
λ = β −1 . creasing function. In experiments in this paper, we use
 h(d) = log(d). In the theory shown in Appendix A, better
Figure 3 depicts the architecture of our method. The details reconstruction provide much rigid orthogonality. We find
are given in the following. First, let x be an M -dimensional h(d) = log(d) is much appropriate for this purpose than
domain data, RM be a data space and Px (x) be the PDF of h(d) = d as detailed in Appendix C.
x. Then x is converted to an N -dimensional latent variable
z in the latent space RN by encoder. Let fθ (x), gφ (z), and For metric function D(· , ·), a variety of function is ap-
Pz,ψ (z) be parametric encoder, decoder, and PDF of latent plicable. As long as the function can be approximated by
variable with parameters θ, φ, and ψ. Note that both encoder the following quadratic form in the neighborhood of x, the
and decoder are deterministic while the encoder of VAE is property of the model holds:
stochastic. Then latent variable z and decoded data x̂ are
generated as bellow: D(x, x + ∆x) ' ∆x> A(x) ∆x. (8)

 z = fθ (x), x̂ = gφ (z). (4) Here, ∆x is an arbitrary infinitesimal variation of x, A(x)
 is an M ×M positive definite matrix depending on x. When
Let  ∈ RN be a noise vector to emulate uniform quantiza- D(· , ·) is square of L2 norm, A(x) is the identity ma-
tion, where each component is independent from others and trix. For another instance, a cost with structure similarity
has an equal average 0 and an equal variance σ 2 : (SSIM (Wang et al., 2004)) and binary cross entropy (BCE)
 can also be approximated in a quadratic form. This is ex-
  = (1 , 2 , ..N ), E [i ] = 0, E [i j ] = δij σ 2 , (5) plained in Appendix D. By deriving parameters that mini-
 mize the average of Eq. (7) according to x ∼ Px (x) and
where δij denotes the Kronecker’s delta. Then, given the  ∼ P (), the encoder, decoder, and probability distribu-
sum of latent variable z and noise , the decoder output x̆ tion of the latent space are trained as
is obtained as
 x̆ = gφ (z + ). (6) θ, φ, ψ = arg min(Ex∼Px (x), ∼P () [ L ]). (9)
 θ,φ,ψ
This is analogous to the stochastic sampling and decoding
procedure in VAE. Here, the cost function is defined based
on Eq. (3) with some modifications as follows: ෝ))
 ℎ( ( , 
 
 Encoder Decoder
L = − log(Pz,ψ (z)) + λ1 h (D (x, x̂)) + λ2 D (x̂, x̆) . (7) Input data ෝ
 
The first term corresponds to the estimated rate of the la- D(ෝ ෕)
 , 

tent variable. We can use arbitrary probabilistic model as Parametric PDF of 
 +
 Decoder
 , , ( ) ( + ) ෕
 
Pz,ψ (z). For example, Ballé et al. (2018) uses univariate = −log( , , )
 QN ~ ( )
independent (factorized) model Pz,ψ (z) = i=1 Pzi ,ψ (zi ).
In this work, a parametric function cψ (zi ) outputs cumula-
tive distribution function of zi . A rate for quantized symbol Figure 3. Architecture of RaDOGAGA
is calculated by cψ (z + 12 )−cψ (z − 21 ), assuming the symbol
is quantized with the side length of 1. To model by GMM
 4.2. Theoretical Properties
like Zong et al. (2018) is another instance.
 In this section, we explain the theoretical properties of the
D(x(1) , x(2) ) in the second and the third term is a met-
 method. To show the essence in a simple form, we first
ric function between points x(1) and x(2) ∈ RM . Actu-
 (formally) consider the case M = N . The theory for M >
ally, the second and the third term is decomposition of
 N is then outlined. All details are given in Appendices.
D (x, x̆) ' D (x, x̂) + D (x̂, x̆) as shown in Rolı́nek et al.
(2019). The second term in Eq. (7) purely calculate re- We begin with examining the condition to minimize the
construction loss as an autoencoder. In the RDO, the con- loss function analytically, assuming that the reconstruc-
sideration is trade-off between rate (the first term) and the tion part is trained enough so that x ' x̂. In this case,
distortion by the quantization noise (the third term). By this the second term in Eq. (7) can be ignored. Let J (z) =

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

∂x/∂z ∈ RN ×N be the Jacobian matrix between the data determined by A(x) (data observation space). In practice,
space and the latent space, which is assumed to be full- the generative probability of x is typically considered in
rank Pat every point. Then, x̆ − x̂ can be approximated as the Euclidean space. Therefore, it is useful to convert the
 N
´ = i=1 i (∂x/∂zi ) by Taylor expansion. By applying PDF of x considered in the data observation space to that
E[i j ] = σ 2 δij and Eq. (8), the expectation of the third considered in the Euclidean space.
term in Eq. (7) is re-written as
 First, we consider the case M = N . Note that Eq. (13)
 N  >   can also be expressed as follows: J (z)> A(x)J (z) =
  > X ∂x ∂x
  = σ2 (1/2λ2 σ 2 )IN (IN is the N × N identity matrix). We get
 
 E ´ A(x)´ A(x) . (10)
∼P ()
 j=1
 ∂zj ∂zj the following equation by taking the determinants of the
 both sides of this equation and using the properties of the de-
As is well known, the relation between Pz (z) and Px (x) terminant: | det(J (z))| = (1/2λ2 σ 2 )N/2 det(A(x))−1/2 .
in such case is described as Pz (z) = | det(J (z))|Px (x). QN
 Note that det(A(x)) = j=1 αj (A(x)), where 0 <
Expectation of L in Eq. (7) is approximated as
 α1 (A(x)) ≤ · · · ≤ αN (A(x)) are the eigenvalues of
 E [L] ' − log(| det(J (z))|) − log(Px (x)) A(x). Thus, Pz (z) and Px (x) (PDF considered in the
 ∼P ()
 Euclidean space) is related in the following form:
 N  >  
 2
 X ∂x ∂x  N2  Y
 N − 12
 + λ2 σ A(x) .
 
 (11) 1
 ∂zj ∂zj Pz (z) = αj (A(x)) Px (x). (15)
 j=1 2λ2 σ 2 j=1
By differentiating Eq. (11) by ∂x/∂zj , the following equa-
tion is derived as a condition to minimize the expected loss: To consider the relationship of Pz (z) and Px (x) for M >
   N , we follow the manifold hypothesis and assume the sit-
 ∂x 1
 2λ2 σ 2 A(x) = J˜(z):,j , (12) uation where the data x substantially exists in the vicin-
 ∂zj det(J (z)) ity of a low dimensional manifold Mx , and z ∈ RN
 can sufficiently capture its feature. In such case, we
where J˜(z):,j ∈ RN is j-th column vector of the cofac-
 can regard the distribution of x away from Mx is neg-
tor matrix of J (z). According to the trait of cofactor ma-
 ligible and the ratio of Pz (z) and Px (x) is equivalent
trix, (∂x/∂zi )> J˜(z):,j = δij det(J (z)) holds. Thus, the
 to that of the volumes of corresponding regions in RN
following relation is obtained by multiplying Eq. (12) by
 and RM . This ratio is shown to be Jsv (z), the product
(∂x/∂zi )> from the left and rearranging the results:
 of the singular values of J (z), and we get the relation
 
 ∂x
 > 
 ∂x
 
 1 Pz (z) = Jsv (z)Px (x). We can further show that Jsv (z)
 QN
 A(x) = δij . (13) is also (1/2λ2 σ 2 )N/2 ( j=1 αj (A(x)))−1/2 under a cer-
 ∂zi ∂zj 2λ2 σ 2
 tain condition that includes the case A(x) = IM (see
This means that the columns of the Jacobian matrix of two Appendix B). Consequently, Eq. (15) holds even for the
spaces form a constantly-scaled orthonormal system with case M > N . As a result, when we obtain a param-
respect to the inner product defined by A(x) for all z, x. eter ψ attaining Pz,ψ (z) ' Pz (z) by training, Px (x)
 is proportional to Pz,ψ (z) with metric depending scaling
Given tangent vectors vz and wz in the tangent space of QN
 ( j=1 αj (A(x)))1/2 as:
RN at z, let vx and wx be corresponding tangent vectors in
the data space. The following relation holds due to Eq. (13), N
 Y  12
which means the map is isometric in the sense of Eq. (2): Px (x) ∝ αj (A(x)) Pz,ψ (z). (16)
 N X
 N  >   j=1
 X ∂x ∂x
 vx A(x)wx = vzi A(x) wzj In the case of A(x) = IM (or more generally κIM for a
 i=0 j=0
 ∂zi ∂zj
 constant κ > 0 ), Px (x) is simply propotional to Pz,ψ (z):
 N
 1 X 1
 = vz w z = vz · wz . (14) Px (x) ∝ Pz,ψ (z). (17)
 2λ2 σ 2 i=0 i i 2λ2 σ 2

Even for the case M > N , equations in the same form as 5. Experimental Validations
Eqs. (13) and (14) can be derived essentially in the same
 We here show the properties of our method experimentally.
manner (Appendix A), namely, RaDOGAGA achieves iso-
 In Section 5.1, we examine the isometricity of the map
metric data embedding for the case M > N as well.
 as in Eq. (2) with real data. In Section 5.2, we confirm
Since the data embedding is isometric, Pz (z) is propor- the proportionality of PDFs as in Eq. (15) with toy data. In
tional to that of x considered in the inner product space Section 5.3, the usefulness is validated in anomaly detection.

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space
 × − × − 
5.1. Isometric Embedding 1.5 6
 r = 0.71 r = 0.63

 beta-VAE
 ( ) 

 ( ) 
In this section, we confirm that our model can embed data
 0 0
in the latent space isometrically. First, randomly picked
data point x is mapped to z(= f (x)). Then, let vz be a -1.5 -6
small tangent vector in the latent space. The corresponding − . 0 . − . 0 . 
 � � 
tangent vector in the data space vx is approximated by g(z + × − × − 
 3 1

 RaDOGAGA
vz )−g(z). Given randomly generated two different tangent r = 0.97 r = 0.97

 ( ) 

 ( ) 
vectors vz and wz , vz · wz is compared with vx > A(x)wx
 0 0
for randomly picked up x. We use CelebA dataset (Liu
et al., 2015)1 which consists of 202,599 celebrity images. -3 -1
Images are center-cropped with a size of 64 x 64. − . 0 . − . 0 . 
 � � 
 (a) M SE (b) 1 − SSIM
5.1.1. CONFIGURATION
In this experiment, factorized distributions (Ballé et al., Figure 4. Plot of vz · wz (horizontal axis) and vx > A(x)wx (ver-
2018) is used to estimate Pz,ψ (z) 2 . Encoder part is con- tical axis). In beta-VAE (top row), the correlation is week while in
structed with 4 convolution (CNN) layers and 2 fully con- our method (bottom row) we can observe proportionality.
nected (FC) layers. For CNN layers, the kernel size is 9×9
for the first one and 5×5 for the rest. The dimension is 1.0 1.0

 Variance of z

 Variance of z
64, stride size is 2, and activation function is generalized
divisive normalization (GDN) (Ballé et al., 2016), which
is suitable for image compression, for all layers. The di-
mension of FC layers is 8192 and 256. For the first one, 0.0 0.0
sof tplus function is attached. The decoder part is the in- 0 127 255 0 127 255
 z z
verse form of the encoder. For comparison, we evaluate
beta-VAE with the same form of autoencoder with 256- (a) beta-VAE (b) RaDOGAGA
dimensional z. In this experiment, we test two different met- Figure 5. Variance of z. In beta-VAE, variance of all dimension is
 1
rics M SE, where A(x) = M IM  . and 1 − SSIM , where trained to be 1. In RaDOGAGA, the energy is concentrated in few
A(x) = 1
 2µx 2 Wm + 1
 2σx 2 Wv . Wm ∈ RM ×M is a ma- dimensions.
trix such that allelement is M12 and Wv = M 1
 IM − Wm .
Note that, in practice, 1 − SSIM for an image is calcu- vx > A(x)wx are almost proportional regardless of metric
lated with a small window. In this experiment the window function. The correlation coefficients r reach to 0.97, while
size is 11×11, and this local calculation is done for entire that of VAE are around 0.7. It can be seen that our method
image with the stride size of 1. The cost is the average enables isometric embedding to a Euclidean space even
of local values. For the second term in Eq. (7), h(d) is with this large scale real dataset. For the reader with interest,
log(d) and A(x) = M 1
 IM . For beta-VAE, we set β −1 as we provide the experimental result with MNIST dataset in
 5 4
1 × 10 and 1 × 10 regarding to the training with M SE Appendix F.
and 1 − SSIM respectively. For RaDOGAGA, (λ1 , λ2 )
is (0.1, 0.1) and (0.2, 0.1). Optimization is done by Adam 5.1.3. C ONSISTENCY TO NASH E MBEDDING T HEOREM
optimizer (Kingma & Ba, 2014) with learning rate 1 × 10−4 .
All models are trained so that the 1 − SSIM between input As explained in Introduction, the Nash embedding theo-
and reconstructed images is around 0.05. rem and the manifold hypothesis is behind our exploration
 for isometric embedding of real data. Here, the question
5.1.2. R ESULTS is whether the trained model satisfied the condition that
 dim Mx < N . With RaDOGAGA, we can confirm it by
Figure 4 depicts vz · wz (horizontal axis) and vx > A(x)wx observing the variance of each latent variable. Because the
(vertical axis). The top row is result of VAE and the bottom Jacobian matrix forms orthonormal system, RaDOGAGA
row shows that of our model. In our method, vz · wz and can work like principal component analysis (PCA) and eval-
 1 uates the importance of each latent variable. The theoretical
 http://mmlab.ie.cuhk.edu.hk/projects/
CelebA.html proof for this property is described in Appendix E. Figure 5
 2
 Implementation is done with a library for Tensor- shows the variance of each dimension of the model trained
Flow provided at https://github.com/tensorflow/ with M SE. The variance concentrates on the few dimen-
compression with default parameters. sions. This means that RN is large enough to represent the
 data. Figure 13a shows decoder outputs when each com-

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

 − + − + 
 0 0
 1 1
 2 2
 3 3
 4 4
 5 5
 6 6
 7 7
 8 8
 (a) beta-VAE (b) RaDOGAGA

Figure 6. Latent space traversal. In beta-VAE, some latent variables does not influence on the visual so much even though they have
almost the same variance. In RaDOGAGA, all latent variables with large variance have important information for image.

ponent zi is traversed from −2σ to 2σ, fixing rest of z as points s = (s1 , s2 ...s10,000 ) from three-dimensional GMM
mean µ. Note the index i is arranged in descending order consists of three mixture-components with mixture weight
of σ 2 . Here, σ 2 and µ for i-th dimension of z(= f (x)) are π = (1/3, 1/3, 1/3), mean µ1 = (0, 0, 0), µ2 = (15, 0, 0),
V ar[zi ] and E[zi ] respectively with all data samples. From µ2 = (15, 15, 15), and covariance Σk = diag(1, 2, 3). k is
the top, each row corresponds to z0 , z1 , z2 ..., and the center the index for mixture-component. Then, we scatter s with
 uniform random noise u ∈ R3×16 , udm ∼ Ud − 12 , 21 ,
 
column is mean. In Fig. 6b, the image changes visually
in any dimension of z, while in Fig. 6a, depending on the where d and m are index for dimension of sampled data and
dimension i, there are cases where no significant changes scattered data. The ud s are uncorrelated withP each other.
 3
can be seen (such as z1 , z2 , z3 , and so on). In summary, we We produce 16-dimensional input data as x = d=1 ud sd
can qualitatively observe that V ar[zi ] corresponds to the with a sample number of 10,000 in the end. The appear-
eigenvalue of PCA, that is, a latent variable with larger σ ance probability of input data Px (x) equals to generation
have bigger impact on image. probability of s.
These results suggest the important information to express
 5.2.1. CONFIGURATION
data is concentrated in the lower few dimension and the
dimension number of 256 was large enough to satisfy In the experiment, we estimate the Pz,ψ (z) using GMM
dim Mx < N . To confirm the sufficiency of the dimension with parameter ψ as in DAGMM (Zong et al., 2018). Instead
is difficult in beta-VAE because the σ 2 should be 1 for all of EM algorithm, GMM parameters are learned using Esti-
dimension since it is trained to fit to prior. However, some mation Network (EN), which consists of multi-layer neural
dimension has a large impact on the image, meaning the σ network. When the GMM consists of N -dimensional Gaus-
does not work as the measure of importance. sian distribution NN (z; µ, Σ) with K mixture-components,
 and L is the size of batch samples, EN outputs the mixture-
We believe this PCA-like trait is very useful for the interpre-
 components membership prediction as K-dimensional vec-
tation of latent variables. For instance, if the metric function
 tor γ
 b as follows:
were designed so as to reflect semantics, an important vari-
able for a semantics is easily found. Furthermore, we argue p = EN (z; ψ), γ
 b = sof tmax(p). (18)
this is a promising way to capture the minimal feature to
express data, that is one of the goals of machine learning. Then, k-th mixture weight π bk , mean µ b k , covariance Σb k,
 PL R of z are further
 and entropy Pcalculated P as follows:
5.2. PDF Estimation with Toy Data π
 bk = l=1 γ blk /L, µ bk = L l=1 γ
 b lk z l /
 L
 l=1 γ
 blk ,
 P L >
 P L
In this section, we describe our experiment using toy data to Σ
 bk = γ
 Plk l
 l=1 b (z − µ
 b k )(zl − µ
 b )
 k / γ
 l=1 lk ,
 b
 K
demonstrate whether the probability density of the input data R = − log k=1NN (z; µbk , Σ
 b k) .
Px (x) and that of estimated in the latent space Pz,ψ (z) are
proportional to each other as in theory. First, we sample data Overall network is trained by Eqs. (7) and (9). In this ex-
 periment, we set D(x1 , x2 ) as square of the L2 norm be-

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

 PDF
cause the input data is generated obeying the PDF in the (High) S3 × − 

Euclidean space. We test two types of h(·), h(d) = d S1 6
 z2
and h(d) = log(d) and denote models trained with these S2 2
h(·) as RaDOGAGA(d) and RaDOGAGA(log(d)) respec- -2
 -9
 (Low)
 -14 z1
 z0
tively. We used DAGMM as a baseline method. DAGMM 0

also consists of encoder, decoder, and EN. In DAGMM, (a) Input source s (b) DAGMM
to avoid falling into the trivial solution that entropy is 2.0 1.0
minimized when the diagonal component of the covari- z2 z2
ance matrix is 0, the inverse of the diagonal component 0.0
 0.4
 b = PK PN Σ
P (Σ) b −1 -0.5 -0.9
 k=1 i=1 ki is added to the cost: -2.0 z1 0.0 z1
 z0 0.0 -3.0 z0 0.6 1.4

 (c) RaDOGAGA(d) (d) RaDOGAGA(log(d))

 L = kx − x̂k2 + λ1 (− log(Pz,ψ (z))) + λ2 P (Σ).
 b (19) Figure 7. Plot of source of input data s and latent variables z. The
 color bar located left of (a) corresponds to normalized PDF. Both
 DAGMM and RaDOGAGA capture three mixture-components, but
The only differences between our model and DAGMM is the PDF in DAGMM looks different from input data source. Points
 with high PDF do not concentrate on the center of the cluster
regulation term P (Σ)b is replaced by D(x̂, x̆). The model
 especially in upper right one.
size (the number of parameter) is the same Thus, model
complexity such as the number of parameter is the same.
 , ( ) , ( ) , ( )
For both RaDOGAGA and DAGMM, the autoencoder part 400 800 700

is constructed with fully connected (FC) layers with sizes
 200 400 350
of 64, 32, 16, 3, 16, 32, and 64. For all FC layers except
for the last of the encoder and the decoder, we use tanh
 0 0.02 0 0.02 0 0.02
 ( ) ( ) ( )
as the activation function. The EN part is also constructed
with FC layer with a size of 10, 3. For the first layer, we use (a) DAGMM (b) Ours (d) (c) Ours (log(d))
the tanh as activation function and dropout (ratio=0.5). For
the last one, sof tmax is used. (λ1 , λ2 ) is set as (1 × 10−4 , Figure 8. Plot of Px (x) vs Pz,ψ (z). In RaDOGAGA, Px (x) and
1 × 10−9 ), (1 × 106 , 1 × 103 ) and (1 × 103 , 1 × 103 ) Pz,ψ (z) are proportional while we can not see that in DAGMM.
for DAGMM, RaDOGAGA(d) and RaDOGAGA(log(d))
respectively. We optimize all models by Adam with learning
rate 1 × 10−4 . We set σ 2 as 1/12. 5.3. Anomaly Detection Using Real Data

5.2.2. R ESULTS We here examine whether the clear relationship between
 Px (x) and Pz,ψ (z) is useful in anomaly detection in which
Figure 7 displays the distribution of input data source s and PDF estimation is key issue. We use four public datasets‡ :
latent variable z. Even though both methods can capture KDDCUP99, Thyroid, Arrhythmia, and KDDCUP-Rev.
that s is generated from three mixture-components, there The (instance number, dimension, anomaly ratio(%)) of
is a difference in PDFs. Since the data is generated from each dataset is (494021, 121, 20), (3772, 6, 2.5), (452, 274,
GMM, the value of PDF gets higher as the sample gets 15), and (121597, 121, 20). Detail of datasets is given in
closer to the centers of clusters. Although, in DAGMM, Appendix G.
this tendency looks distorted. This difference is further
demonstrated in Fig. 8 which shows a plot of Px (x) (hori- 5.3.1. EXPERIMENTAL SETUP
zontal axis) against Pz,ψ (z) (vertical axis). In our method,
 For a fair comparison with previous works, we follow the
Px (x) and Pz,ψ (z) are almost proportional to each other as
 setting in Zong et al. (2018). Randomly extracted 50%
in theory while we cannot observe such proportionality in
 of the data is assigned to training and the rest to testing.
DAGMM. This difference is quantitatively obvious as well.
 During training, only normal data is used. During the test,
That is, correlation coefficients between Px (x) and Pz,ψ (z)
 the entropy R for each sample is calculated as the anomaly
are 0.882 (DAGMM), 0.997 (RaDOGAGA(d)), and 0.998
 score, and if the anomaly score is higher than a threshold,
(RaDOGAGA(log(d))). We can also observe that in RaDO-
 it is detected as an anomaly. The threshold is determined
GAGA(d), there is a slight distortion in its proportionality
 by the ratio of anomaly data in each data set. For example,
in the area of Px (x) < 0.01. When Pz,ψ (z) is sufficiently
 in KDDCup99, data with R in the top 20 % is detected as
fitted, h(d) = log(d) makes Px (x) and Pz,ψ (z) be propor-
tional more rigidly. More detail is described in Appendix C. ‡
 Datasets can be downloaded at https://kdd.ics.uci.
 edu/ and http://odds.cs.stonybrook.edu.

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

 Table 1. Average and standard deviations (in brackets) of Precision, Recall and F1
 Dataset Methods Precision Recall F1
 ALAD∗ 0.9427 (0.0018) 0.9577 (0.0018) 0.9501 (0.0018)
 INRF∗ 0.9452 (0.0105) 0.9600 (0.0113) 0.9525 (0.0108)
 GMVAE∗ 0.952 0.9141 0.9326
 KDDCup DAGMM 0.9427 (0.0052) 0.9575 (0.0053) 0.9500 (0.0052)
 RaDOGAGA(d) 0.9550 (0.0037) 0.9700 (0.0038) 0.9624 (0.0038)
 RaDOGAGA(log(d)) 0.9563 (0.0042) 0.9714 (0.0042) 0.9638 (0.0042)
 GMVAE∗ 0.7105 0.5745 0.6353
 DAGMM 0.4656 (0.0481) 0.4859 (0.0502) 0.4755 (0.0491)
 Thyroid RaDOGAGA(d) 0.6313 (0.0476) 0.6587 (0.0496) 0.6447 (0.0486)
 RaDOGAGA(log(d)) 0.6562 (0.0572) 0.6848 (0.0597) 0.6702 (0.0585)
 ALAD∗ 0.5000 (0.0208) 0.5313 (0.0221) 0.5152 (0.0214)
 GMVAE∗ 0.4375 0.4242 0.4308
 DAGMM 0.4985 (0.0389) 0.5136 (0.0401) 0.5060 (0.0395)
 Arrythmia
 RaDOGAGA(d) 0.5353 (0.0461) 0.5515 (0.0475) 0.5433 (0.0468)
 RaDOGAGA(log(d)) 0.5294 (0.0405) 0.5455 (0.0418) 0.5373 (0.0411)
 DAGMM 0.9778 (0.0018) 0.9779 (0.0017) 0.9779 (0.0018)
 KDDCup-rev RaDOGAGA(d) 0.9768 (0.0033) 0.9827 (0.0012) 0.9797 (0.0015)
 RaDOGAGA(log(d)) 0.9864 (0.0009) 0.9865 (0.0009) 0.9865 (0.0009)
 ∗
 Scores are cited from Zenati et al. (2018) (ALAD), Song & Ou (2018) (INRF), and Liao et al. (2018) (GMVAE).

an anomaly. As metrics, precision, recall, and F1 score are except for Arrhythmia. This result suggests a much rigid
calculated. We run experiments 20 times for each dataset orthonormality likely to bring better performance.
split by 20 different random seeds.
 6. Conclusion
5.3.2. BASELINE METHODS
 In this paper, we propose RaDOGAGA which embeds data
As in the previous section, DAGMM is taken as a baseline
 in a low-dimensional Euclidean space isometrically. With
method. We also compare with the scores reported in previ-
 RaDOGAGA, the relation of latent variables and data is
ous works that conducted the same experiment (Zenati et al.,
 quantitatively tractable. For instance, Pz,ψ (z) obtained by
2018; Song & Ou, 2018; Liao et al., 2018).
 the proposed method is proportional to the PDF of x in the
 data observation space. Furthermore, thanks to these prop-
5.3.3. CONFIGURATION
 erties, we achieve state-of-the-art performance in anomaly
As in Zong et al. (2018), in addition to the output from the detection.
 0 0
encoder, kx−x
 kxk2
 k2
 and kxkx·x 0
 2 kx k2
 are concatenated to z. It is Although we focused on the PDF estimation as a practical
sent to EN. Note that z is sent to the decoder before con- task in this paper, the properties of RaDOGAGA will benefit
catenation. Other configuration except for hyper parameter a variety of applications. For instance, data interpolation
is same as in the previous experiment. Hyper parameter will be easier because a straight line in the latent space is
for each dataset is described in Appendix G. Input data is geodesic in the data space. It also may help the unsupervised
max-min normalized. or semi-supervised learning since distance of z reliably re-
 flects distance of x. Furthermore, our method will promote
5.3.4. R ESULTS disentanglement because thanks to the orthonormality, PCA-
Table 1 reports the average scores and standard deviations like analysis is possible.
(in brackets). Compared to DAGMM, RaDOGAGA has a To capture essential features of data, it is important to fairly
better performance regardless of types of h(·). Note that, evaluate the significance of latent variables. Since isometric
our model does not increase model complexity at all. Sim- embedding promise this fairness, we believe RaDOGAGA
ply introducing the RDO mechanism to the autoencoder is will bring a Breakthru for generative analysis. As future
effective for anomaly detection. Moreover, our approach work, we explore the usefulness in various tasks mentioned
achieves the highest performance compared to other meth- above.
ods. RaDOGAGA(log(d)) is superior to RaDOGAGA(d)

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

Acknowledgement Geng, C., Wang, J., Chen, L., Bao, W., Chu, C., and Gao, Z.
 Uniform interpolation constrained geodesic learning on
We gratitude to Naoki Hamada, Ramya Srinivasan, Kentaro data manifold. arXiv preprint arXiv:2002.04829, 2020.
Takemoto, Moyuru Yamada, Tomoya Iwakura, and Hiyori
Yoshikawa for constructive discussion. Goyal, V. K. Theoretical foundations of transform coding.
 IEEE Signal Processing Magazine, 18(5):9–21, 2001.
References
 Han, Q. and Hong, J.-X. Isometric Embedding of Rieman-
Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., nian Manifolds in Euclidean Spaces. American Mathe-
 and Murphy, K. Fixing a broken ELBO. In Proceedings matical Society, 2006.
 of the International Conference on Machine Learning
 (ICML), pp. 159–168, 2018. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,
 Botvinick, M., Mohamed, S., and Lerchner, A. β-VAE:
Ballé, J., Laparra, V., and Simoncelli, E. Density modeling Learning basic visual concepts with a constrained vari-
 of images using a generalized normalization transforma- ational framework. In Proceedings of the International
 tion. In Proceedings of the International Conference on Conference on Learning Representations (ICLR), 2017.
 Learning Representations (ICLR), 2016.
 Horn, R. A. and Johnson, C. R. Matrix Analysis. Cambridge
Ballé, J., Minnen, D., Singh, S., Hwang, S. J., and Johnston, University Press, 2nd edition, 2013.
 N. Variational image compression with a scale hyper-
 prior. In Proceedings of the International Conference on Johnson, M., Duvenaud, D., Wiltschko, A. B., Datta, S. R.,
 Learning Representations (ICLR), 2018. and Adams, R. P. Composing graphical models with
 neural networks for structured representations and fast
Bengio, Y., Courville, C., and Vincent, P. Representation
 inference. In Proceedings of the Advances in Neural
 learning: A review and new perspectives. IEEE Transac-
 Information Processing Systems, 2016.
 tions on Pattern Analysis and Machine Intelligence, 35
 (8):798–1828, 2013. Kingma, D. P. and Ba, J. Adam: A method for stochastic
 optimization. arXiv preprint arXiv:1412.6980, 2014.
Berger, T. (ed.). Rate Distortion Theory: A Mathematical
 Basis for Data Compression. Prentice Hall, 1971. Kingma, D. P. and Dhariwal, P. Glow: Generative flow
Bernstein, M., De Silva, V., Langford, J. C., and Tenenbaum, with invertible 1x1 convolutions. In Proceedings of the
 J. B. Graph approximations to geodesics on embedded Advances in Neural Information Processing Systems, pp.
 manifolds. Technical report, Stanford University, 2000. 10215–10224, 2018.

Brekelmans, R., Moyer, D., Galstyan, A., and Ver Steeg, Kingma, D. P. and Welling, M. Auto-encoding variational
 G. Exact rate-distortion in autoencoders via echo noise. Bayes. In Proceedings of the International Conference
 In Proceedings of the Advances in Neural Information on Learning Representations (ICLR), 2014.
 Processing Systems, pp. 3889–3900, 2019.
 LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
Chalapathy, R. and Chawla, S. Deep learning for anomaly based learning applied to document recognition. Proceed-
 detection: A survey. arXiv preprint arXiv:1901.03407, ings of the IEEE, 86(11):2278–2324, 1998.
 2019.
 Liao, W., Guo, Y., Chen, X., and Li, P. A unified unsuper-
Chen, N., , Klushyn, A., Kurle, Jiang, X., Bayer, J., and vised Gaussian mixture variational autoencoder for high
 van der Smagt, P. Metrics for deep generative models. In dimensional outlier detection. In Proceedings of the IEEE
 Proceedings of the International Conference on Artifcial International Conference on Big Data, pp. 1208–1217,
 Intelligence and Statistics (AISTATS), pp. 1540–1550, 2018.
 2018.
 Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning
Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear in- face attributes in the wild. In Proceedings of the IEEE
 dependent components estimation. In Proceedings of the International Conference on Computer Vision (ICCV), pp.
 International Conference on Learning Representations 3730–3738, 2015.
 (ICLR) Workshop, 2015.
 McQueen, J., Meila, M., and Joncas, D. Nearly isometric
Dua, D. and Graff, C. UCI machine learning repository. embedding by relaxation. In Advances in Neural Infor-
 http://archive.ics.uci.edu/ml, 2019. mation Processing Systems, pp. 2631–2639, 2016.

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

Pai, G., Talmon, R., Alex, B., and Kimmel, R. DIMAL:
 Deep isometric manifold learning using sparse geodesic
 sampling. In IEEE Winter Conference on Applications of
 Computer Vision (WACV), pp. 819–828, 2019.
Rao, K. R. and Yip, P. (eds.). The Transform and Data
 Compression Handbook. CRC Press, Inc., 2000.
Rolı́nek, M., Zietlow, D., and Martius, G. Variational au-
 toencoders pursue PCA directions (by accident). In Pro-
 ceedings of the IEEE Conference on Computer Vision and
 Pattern Recognition (CVPR), pp. 12406–12415, 2019.
Shao, H., Kumar, A., and Fletcher, P. T. The Riemannian
 geometry of deep generative models. In Proceedings of
 the IEEE Conference on Computer Vision and Pattern
 Recognition (CVPR) Workshops, pp. 315–323, 2018.
Song, Y. and Ou, Z. Learning neural random fields
 with inclusive auxiliary generators. arXiv preprint
 arXiv:1806.00271, 2018.
Strang, G. Linear Algebra and its Applications. Cengage
 Learning, 4th edition, 2006.
Sullivan, G. J. and Wiegand, T. Rate-distortion optimization
 for video compression. IEEE Signal Processing Maga-
 zine, 15(6):74–90, 1998.
Teoh, H. S. Formula for vector rotation in arbitrary
 planes in Rn . http://eusebeia.dyndns.org/
 4d/genrot.pdf, 2005.
Tschannen, M., Bachem, O., and Lucic, M. Recent ad-
 vances in autoencoder-based representation learning. In
 Proceedings of the Third Workshop on Bayesian Deep
 Learning (NeurIPS 2018 Workshop), 2018.
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli,
 E. P. Image quality assessment: from error visibility
 to structural similarity. IEEE Transactions on Image
 Processing, 13(4):600–612, 2004.
Zenati, H., Romain, M., Foo, C.-S., Lecouat, B., and Chan-
 drasekhar, V. Adversarially learned anomaly detection.
 In Proceedings of the IEEE International Conference on
 Data Mining (ICDM), pp. 727–736, 2018.
Zhou, J., Wen, S., Nakagawa, A., Kazui, K., and Tan, Z.
 Multi-scale and context-adaptive entropy model for im-
 age compression. In Proceedings of the Workshop and
 Challenge on Learned Image Compression (CVPR 2019
 Workshop), pp. 4321–4324, 2019.
Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu,
 C., Cho, D., and Chen, H. Deep autoencoding Gaussian
 mixture model for unsupervised anomaly detection. In
 Proceedings of the International Conference on Learning
 Representations (ICLR), 2018.

You can also read