# Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space

←

→

**Page content transcription**

If your browser does not render page correctly, please read the page content below

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space Keizo Kato 1 Jing Zhou 2 Tomotake Sasaki 1 Akira Nakagawa 1 Abstract model with univariate Gaussian priors (Kingma & Welling, To analyze high-dimensional and complex data 2014). For more flexible estimation of Pz (z), successor in the real world, deep generative models such as models has been proposed, such as using Gaussian mixture variational autoencoder (VAE) embed data in a model (GMM) (Zong et al., 2018), combining univariate reduced dimensional latent space and learn the Gaussian model and GMM (Liao et al., 2018), etc. probabilistic model in the latent space. How- In the tasks where the quantitative analysis is vital, Px (x) ever, they struggle to reproduce the probability should be reproduced from Pz (z). For instance, in anomaly distribution function (PDF) in the input space detection, the anomaly likelihood is calculated based on from that of the latent space accurately. If the PDF value of data sample (Chalapathy & Chawla, 2019). embedding were isometric, this problem can be However, the embedding of VAEs is not isometric, that solved since PDFs in both spaces become propor- is, distance between data points x(1) and data x(2) is in- tional. To achieve isometric property, we propose consistent to the distance of corresponding latent variables Rate-Distortion Optimization guided autoencoder z (1) and z (2) (Chen et al., 2018; Shao et al., 2018; Geng inspired by orthonormal transform coding. We et al., 2020). Obviously mere estimation of Pz (z) cannot show our method has the following properties: (i) be the substitution of estimation for Px (x) under such sit- the columns of the Jacobian matrix between two uation. As McQueen et al. (2016) mentioned, for reliable spaces is constantly-scaled orthonormal system data analysis, the isometric embedding in low-dimensional and enable to embed data in latent space isomet- space is necessary. In addition, to utilize the standard PDF rically; (ii) the PDF of the latent space is propor- estimation techniques, the latent space is preferred to be a tional to that of the data observation space. Fur- Euclidean space. Despite of its importance, this point is not thermore, our method outperforms state-of-the-art considered even in methods developed for the quantitative methods in unsupervised anomaly detection with analysis of PDF (Johnson et al., 2016; Zong et al., 2018; four public datasets. Liao et al., 2018; Zenati et al., 2018; Song & Ou, 2018). According to the Nash embedding theorem, an arbitrary smooth and compact Riemannian manifold M can be em- 1. Introduction bedded in a Euclidean space RN (N ≥ dim M + 1, suffi- Capturing the inherent features of a dataset from high- ciently large) isometrically (Han & Hong, 2006). Besides dimensional and complex data is an essential issue in ma- that, the manifold hypothesis argues that real-world data chine learning. Generative model approach learns the prob- presented in a high-dimensional space concentrates in the ability distribution of data, aiming at data generation, unsu- vicinity of a much lower dimensional manifold Mx ⊂ RM pervised learning, disentanglement, etc. (Tschannen et al., (Bengio et al., 2013). Based on these theories, it is ex- 2018). It is generally difficult to directly estimate a prob- pected that the input data x can be embedded isomet- ability density function (PDF) Px (x) of high-dimensional rically in a low-dimensional Euclidean space RN when data x ∈ RM . Instead, one promising approach is to embed dim Mx < N M . Although the existence of the iso- x in a low-dimensional latent variable z ∈ RN (N < M ), metric embedding was proved, the method to achieve it and capture PDF Pz (z). Variational autoencoder (VAE) is a has been challenging. Some previous works proposed al- widely used generative model to capture z as a probabilistic gorithms to do that (McQueen et al., 2016; Bernstein et al., 1 2000). Yet, they do not deal with high-dimensional input FUJITSU LABORATORIES LTD. 2 Fujitsu R&D Center Co., data such as images. Another thing to consider is the dis- Ltd.. Correspondence to: kato.keizo, anaka . tance on Mx may be defined by the data tendency. For Proceedings of the 37 th International Conference on Machine instance, we can choose binary cross entropy (BCE) for Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the binary data, structured similarity (SSIM) for image, etc. the author(s).

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space As a whole, our challenge is to develop a deep generative holds. Here v and w are tangent vectors in Tp M (tan- model which guarantees the isometric embedding even for gent space of M at p ∈ M) represented as elements of the high-dimentional data observed in the Mx endowed RM and dg is the differential of g (this can be written with an arbitrary distance definition. as a Jacobian matrix). hv, wip = v > AM (p)w, where AM (p) ∈ RM ×M is a metric tensor represented as a posi- Mathematically, the condition of isometric embedding to tive define matrix. The inner product in the right side is also Euclidean space is equivalent to that the columns of the defined by another metric tensor AN (q) ∈ RN ×N . AM (p) Jacobian matrix between two spaces form an orthonormal or AN (q) is an identity matrix for a Euclidean case and the system. When we turn our sight to conventional image inner product becomes the standard one (the dot product). compression area, orthonormal transform is necessary for efficient compression. This is proved by Rate-Distortion In this paper, we slightly abuse the terminology and call a theory (Berger, 1971) . Furthermore, the empirical method map g isometric if for optimal compression with orthonormal transform coding is established as Rate-Distortion Optimization (RDO) (Sul- hv, wip = Chdg(v), dg(w)ig(p) (2) livan & Wiegand, 1998). It is intuitive to regard data embed- ding to a low-dimensional latent space as an analogy of effi- holds for some constant C > 0, since √ Eq. (1) can be cient data compression. Actually, deep learning based image achieved by replacing g with g̃ = (1/ C)g. compression (DIC) methods with RDO (Ballé et al., 2018; Zhou et al., 2019) have been proposed and they achieve good 2. Related Work compression performance. Although it is not discussed in Ballé et al. (2018); Zhou et al. (2019), we guess that behind Flow based model: Flow based generative models generate the success of DIC, there should be theoretical relation to image with astonishing quality (Kingma & Dhariwal, 2018; RDO of convetional transform coding. Dinh et al., 2015). Flow mechanism explicitly takes Jaco- bian of x and z into account. The transformation function Hence, in this paper, we investigate the theoretical property z = f (x) is learned, calculating and storing Jacobian of and dig out the proof that RDO guides deep-autoencoder to x and z. Unlike ordinary autoencoder, which reverse z to have the orthonormal property. Based on this finding, we x with function g(·) different from f (·), inverse function propose an approach which enables isometric data embed- transforms z to x as x = f −1 (z). Since the model stores ding and allow us to comprehensive data analysis, named Jacobian, Px (x) can be estimated from Pz (z). However, Rate-Distortion Optimization Guided Autoencoder for Gen- in these approaches, the form of f (·) is limited so that the erative Analysis (RaDOGAGA). We show the validity of explicit calculation of Jacobian is manageable, such as f (·) RaDOGAGA in the following steps. can not reduce the dimension of x. (1) We show that RaDOGAGA has the following properties Data interpolation with autoencoder: For smooth data both theoretically and experimentally. interpolation, in Chen et al. (2018); Shao et al. (2018), a • Jacobian matrix between the data observation space (inner function learns to map latent variables to geodesic (shortest product space endowed with a metric tensor) and latent path in a manifold) space, in which the distance corresponds space forms constantly-scaled orthonormal system. Thus, to the metric of data space. In Geng et al. (2020); Pai et al. data can be embedded in a Euclidean latent space isomet- (2019), a penalty for the anisometricity of map is added rically. to training loss. Although these approaches may remedy scale inconsistency, they do not deal with PDF estimation. • Thanks to the property above, Pz (z) are proportional to Furthermore, the distance for the input data is assumed to the PDF of x in the data observation space. Thus, PDF be Euclidean distance and the cases for other distances are of x in the data observation space can be estimated by not considered. maximizing log-likelihood of parametric PDF Pz,ψ (z) in the low-dimensional Euclidean space. Deep image compression (DIC) with RDO: RD theory is a part of Shannon’s information theory for lossy com- (2) Thanks to (1), RaDOGAGA outperforms the current pression which formulates the optimal condition between state-of-the-art method in unsupervised anomaly detection information rate and distortion. The signal is decorrelated by task with four public datasets. orthonormal transformation such as Karhunen-Loève trans- form (KLT) (Rao & Yip, 2000) and discrete cosine transform Isometric Map and Notions of Differential Geometry (DCT). In RDO, a cost L = R + λD is minimized at given Given two Riemannian manifolds M ⊂ RM and N ⊂ RN , Lagrange parameter λ. Recently, DIC methods with RDO a map g : M → N is called isometric if (Ballé et al., 2018; Zhou et al., 2019) have been proposed. In these works, instead of orthonormal transform in the con- hv, wip = hdg(v), dg(w)ig(p) (1) ventional lossy compression method, a deep autoencoder is

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space trained for RDO. In the next section, we explain the idea of the rate-distortion trade-off at desirable rate can be realized RDO guided autoencoder and its relationship with VAE. as discussed in Alemi et al. (2018). Note that the beta-VAE and the RDO in image compression 3. Overview of RDO Guided Approach are analogous to each other. That is, β −1 , −Lkl , and Lrec correspond to λ, a rate R, and a distortion D respectively. 3.1. Derivation from RDO in Transform Coding However, the probability models of latent variables are quite Figure 1 shows the overview of our idea based on Rate- different. VAE uses a fixed prior distribution. This causes a Distortion Optimization (RDO) inspired by transform cod- non-linear scaling relationship between real data and latent ing. In the transform coding, the optimal method to en- variables. Figure 2 shows the conditions to achieve RDO in code data with Gaussian distribution is as follows (Goyal, both VAE and RaDOGAGA. In VAE, for RDO condition, 2001). At first, the data are transformed deterministically to a nonlinear scaling of the data distribution is necessary to decorrelated data using the orthonormal transform such as fit prior. To achieve it, Brekelmans et al. (2019) suggest to Karhunen-Loève transform (KLT) and discrete cosine trans- control a noise as a posterior variance precisely for each form (DCT). Then these decorrelated data are quantized channel. stochastically with uniform quantizer for all channels such Actually, as proved in Rolı́nek et al. (2019), in the optimal that quantization noise for each channel is equal. At last condition, Jacobian matrix of VAE forms orthogonal sys- optimal entropy encoding is applied to quantized data where tem, but the norm is not constant. In RaDOGAGA, uniform the rate can be calculated by the logarithm of symbol’s es- quantization noises are added to all channels. Instead, para- timated probability. From this fact, we have an intuition metric probability distribution should be estimated as a prior. that the columns of Jacobian matrix of autoencoder forms As a result, the Jacobian matrix forms orthonormal system orthonormal system if the data were compressed based on because both orthogonality and scaling normalization are RDO with uniform quantized noise and parametric distri- simultaneously achieved. As discussed above, the precise bution of latent variables. Inspired by this, we propose noise control in VAE and parametric prior optimization in autoencoder which scales latent variables according to the RaDOGAGA are essentially the same. Accordingly, com- definition of metrics of data. plexities in both methods are estimated to be at the same Model of Transform coding based on Rate-Distortion Theory degree. Uniform Image Orthonormal Compressed Quantization / Conditions to achieve Rate-Distortion Optimization Data Transform Decorrelated Data Entropy coding data Method PDF model Noise Jacobi Matrix Source Coding Channel Coding VAE with fixed Fixed as prior Variable for Orthogonal prior each data and (Variable Analogy between transform coding and RDO guided autoencoder channels scaling) Method Source Coding Channel Coding Rate-Distortion RDO guided Variable Uniform for all Orthonormal (Deterministic) (Stochastic) Relation Autoencoder parametric PDF data and (Constant Transform Condition: Condition: Result: channels scaling) coding Orthonormal Uniform Rate-Distortion Relationship between PDF, Noise, and Jacobian Matrix KLT / PCA / DCT quantization / Optimization is RDO guided autoencoder VAE transform Entropy coding achieved x x RDO guided Expected Result: Condition: Condition: Orthogonal Orthonormal (Variable Autoencoder Autoencoder with Uniform noise Rate-Distortion scaling) Fixed PDF as prior (Constant Parametric PDF scaling) orthonormal for all channel / based Loss Jacobi matrix Rate estimation function z z Posterior with varying variance Posterior with constant variance (variable quantization noise) ( uniform quantization noise) Figure 1. Overview of our idea. Orthonormal transformation and uniform quantization noise bring RDO. Our idea is that Figure 2. The condition of RDO in VAE and our method. In VAE, uniform quantization noise and RDO lead an antoencoder to be to fit the fixed prior (blue-line), data is transformed anisometrically orthonormal. with precisely controlled noise as a posterior variance (gray area). A wider distribution of noise make the PDF of transformed data smaller. In our method, a parametric prior distribution is estimated, and data is transformed isometrically with uniform noise. 3.2. Relationship with VAE There is a number of VAE studies taking rate-distortion trade-off into account. In VAEs, it is common to maximize 4. METHOD AND THEORETICAL ELBO instead of maximizing log-likelihood of Px (x) di- PROPERTIES rectly. In beta-VAE (Higgins et al., 2017), the objective 4.1. Method function LV AE is described as LV AE = Lrec − βLkl . Here Lkl is the KL divergence between encoder output and prior Our method is based on the RDO of the autoencoder for distribution, usually Gaussian distribution. By changing β, image compression proposed in Ballé et al. (2018) with

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space some modifications. In Ballé et al. (2018), the cost function decomposition, we can avoid the interference between better reconstruction and RDO trade-off durning the training. λ1 L = R + λD (3) controls the degree of reconstruction, and λ2 (' β −1 of beta-VAE) controls a scaling between data and latent spaces consists of (i) reconstruction error D between input data and respectively. decoder output with noise to latent variable and (ii) Rate R of latent variable. This is analogous to beta-VAE where h(·) in the second term of Eq. (7) is a monotonically in- λ = β −1 . creasing function. In experiments in this paper, we use h(d) = log(d). In the theory shown in Appendix A, better Figure 3 depicts the architecture of our method. The details reconstruction provide much rigid orthogonality. We find are given in the following. First, let x be an M -dimensional h(d) = log(d) is much appropriate for this purpose than domain data, RM be a data space and Px (x) be the PDF of h(d) = d as detailed in Appendix C. x. Then x is converted to an N -dimensional latent variable z in the latent space RN by encoder. Let fθ (x), gφ (z), and For metric function D(· , ·), a variety of function is ap- Pz,ψ (z) be parametric encoder, decoder, and PDF of latent plicable. As long as the function can be approximated by variable with parameters θ, φ, and ψ. Note that both encoder the following quadratic form in the neighborhood of x, the and decoder are deterministic while the encoder of VAE is property of the model holds: stochastic. Then latent variable z and decoded data x̂ are generated as bellow: D(x, x + ∆x) ' ∆x> A(x) ∆x. (8) z = fθ (x), x̂ = gφ (z). (4) Here, ∆x is an arbitrary infinitesimal variation of x, A(x) is an M ×M positive definite matrix depending on x. When Let ∈ RN be a noise vector to emulate uniform quantiza- D(· , ·) is square of L2 norm, A(x) is the identity ma- tion, where each component is independent from others and trix. For another instance, a cost with structure similarity has an equal average 0 and an equal variance σ 2 : (SSIM (Wang et al., 2004)) and binary cross entropy (BCE) can also be approximated in a quadratic form. This is ex- = (1 , 2 , ..N ), E [i ] = 0, E [i j ] = δij σ 2 , (5) plained in Appendix D. By deriving parameters that mini- mize the average of Eq. (7) according to x ∼ Px (x) and where δij denotes the Kronecker’s delta. Then, given the ∼ P (), the encoder, decoder, and probability distribu- sum of latent variable z and noise , the decoder output x̆ tion of the latent space are trained as is obtained as x̆ = gφ (z + ). (6) θ, φ, ψ = arg min(Ex∼Px (x), ∼P () [ L ]). (9) θ,φ,ψ This is analogous to the stochastic sampling and decoding procedure in VAE. Here, the cost function is defined based on Eq. (3) with some modifications as follows: ෝ)) ℎ( ( , Encoder Decoder L = − log(Pz,ψ (z)) + λ1 h (D (x, x̂)) + λ2 D (x̂, x̆) . (7) Input data ෝ The first term corresponds to the estimated rate of the la- D(ෝ ) , tent variable. We can use arbitrary probabilistic model as Parametric PDF of + Decoder , , ( ) ( + ) Pz,ψ (z). For example, Ballé et al. (2018) uses univariate = −log( , , ) QN ~ ( ) independent (factorized) model Pz,ψ (z) = i=1 Pzi ,ψ (zi ). In this work, a parametric function cψ (zi ) outputs cumula- tive distribution function of zi . A rate for quantized symbol Figure 3. Architecture of RaDOGAGA is calculated by cψ (z + 12 )−cψ (z − 21 ), assuming the symbol is quantized with the side length of 1. To model by GMM 4.2. Theoretical Properties like Zong et al. (2018) is another instance. In this section, we explain the theoretical properties of the D(x(1) , x(2) ) in the second and the third term is a met- method. To show the essence in a simple form, we first ric function between points x(1) and x(2) ∈ RM . Actu- (formally) consider the case M = N . The theory for M > ally, the second and the third term is decomposition of N is then outlined. All details are given in Appendices. D (x, x̆) ' D (x, x̂) + D (x̂, x̆) as shown in Rolı́nek et al. (2019). The second term in Eq. (7) purely calculate re- We begin with examining the condition to minimize the construction loss as an autoencoder. In the RDO, the con- loss function analytically, assuming that the reconstruc- sideration is trade-off between rate (the first term) and the tion part is trained enough so that x ' x̂. In this case, distortion by the quantization noise (the third term). By this the second term in Eq. (7) can be ignored. Let J (z) =

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space ∂x/∂z ∈ RN ×N be the Jacobian matrix between the data determined by A(x) (data observation space). In practice, space and the latent space, which is assumed to be full- the generative probability of x is typically considered in rank Pat every point. Then, x̆ − x̂ can be approximated as the Euclidean space. Therefore, it is useful to convert the N ´ = i=1 i (∂x/∂zi ) by Taylor expansion. By applying PDF of x considered in the data observation space to that E[i j ] = σ 2 δij and Eq. (8), the expectation of the third considered in the Euclidean space. term in Eq. (7) is re-written as First, we consider the case M = N . Note that Eq. (13) N > can also be expressed as follows: J (z)> A(x)J (z) = > X ∂x ∂x = σ2 (1/2λ2 σ 2 )IN (IN is the N × N identity matrix). We get E ´ A(x)´ A(x) . (10) ∼P () j=1 ∂zj ∂zj the following equation by taking the determinants of the both sides of this equation and using the properties of the de- As is well known, the relation between Pz (z) and Px (x) terminant: | det(J (z))| = (1/2λ2 σ 2 )N/2 det(A(x))−1/2 . in such case is described as Pz (z) = | det(J (z))|Px (x). QN Note that det(A(x)) = j=1 αj (A(x)), where 0 < Expectation of L in Eq. (7) is approximated as α1 (A(x)) ≤ · · · ≤ αN (A(x)) are the eigenvalues of E [L] ' − log(| det(J (z))|) − log(Px (x)) A(x). Thus, Pz (z) and Px (x) (PDF considered in the ∼P () Euclidean space) is related in the following form: N > 2 X ∂x ∂x N2 Y N − 12 + λ2 σ A(x) . (11) 1 ∂zj ∂zj Pz (z) = αj (A(x)) Px (x). (15) j=1 2λ2 σ 2 j=1 By differentiating Eq. (11) by ∂x/∂zj , the following equa- tion is derived as a condition to minimize the expected loss: To consider the relationship of Pz (z) and Px (x) for M > N , we follow the manifold hypothesis and assume the sit- ∂x 1 2λ2 σ 2 A(x) = J˜(z):,j , (12) uation where the data x substantially exists in the vicin- ∂zj det(J (z)) ity of a low dimensional manifold Mx , and z ∈ RN can sufficiently capture its feature. In such case, we where J˜(z):,j ∈ RN is j-th column vector of the cofac- can regard the distribution of x away from Mx is neg- tor matrix of J (z). According to the trait of cofactor ma- ligible and the ratio of Pz (z) and Px (x) is equivalent trix, (∂x/∂zi )> J˜(z):,j = δij det(J (z)) holds. Thus, the to that of the volumes of corresponding regions in RN following relation is obtained by multiplying Eq. (12) by and RM . This ratio is shown to be Jsv (z), the product (∂x/∂zi )> from the left and rearranging the results: of the singular values of J (z), and we get the relation ∂x > ∂x 1 Pz (z) = Jsv (z)Px (x). We can further show that Jsv (z) QN A(x) = δij . (13) is also (1/2λ2 σ 2 )N/2 ( j=1 αj (A(x)))−1/2 under a cer- ∂zi ∂zj 2λ2 σ 2 tain condition that includes the case A(x) = IM (see This means that the columns of the Jacobian matrix of two Appendix B). Consequently, Eq. (15) holds even for the spaces form a constantly-scaled orthonormal system with case M > N . As a result, when we obtain a param- respect to the inner product defined by A(x) for all z, x. eter ψ attaining Pz,ψ (z) ' Pz (z) by training, Px (x) is proportional to Pz,ψ (z) with metric depending scaling Given tangent vectors vz and wz in the tangent space of QN ( j=1 αj (A(x)))1/2 as: RN at z, let vx and wx be corresponding tangent vectors in the data space. The following relation holds due to Eq. (13), N Y 12 which means the map is isometric in the sense of Eq. (2): Px (x) ∝ αj (A(x)) Pz,ψ (z). (16) N X N > j=1 X ∂x ∂x vx A(x)wx = vzi A(x) wzj In the case of A(x) = IM (or more generally κIM for a i=0 j=0 ∂zi ∂zj constant κ > 0 ), Px (x) is simply propotional to Pz,ψ (z): N 1 X 1 = vz w z = vz · wz . (14) Px (x) ∝ Pz,ψ (z). (17) 2λ2 σ 2 i=0 i i 2λ2 σ 2 Even for the case M > N , equations in the same form as 5. Experimental Validations Eqs. (13) and (14) can be derived essentially in the same We here show the properties of our method experimentally. manner (Appendix A), namely, RaDOGAGA achieves iso- In Section 5.1, we examine the isometricity of the map metric data embedding for the case M > N as well. as in Eq. (2) with real data. In Section 5.2, we confirm Since the data embedding is isometric, Pz (z) is propor- the proportionality of PDFs as in Eq. (15) with toy data. In tional to that of x considered in the inner product space Section 5.3, the usefulness is validated in anomaly detection.

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space × − × − 5.1. Isometric Embedding 1.5 6 r = 0.71 r = 0.63 beta-VAE ( ) ( ) In this section, we confirm that our model can embed data 0 0 in the latent space isometrically. First, randomly picked data point x is mapped to z(= f (x)). Then, let vz be a -1.5 -6 small tangent vector in the latent space. The corresponding − . 0 . − . 0 . � � tangent vector in the data space vx is approximated by g(z + × − × − 3 1 RaDOGAGA vz )−g(z). Given randomly generated two different tangent r = 0.97 r = 0.97 ( ) ( ) vectors vz and wz , vz · wz is compared with vx > A(x)wx 0 0 for randomly picked up x. We use CelebA dataset (Liu et al., 2015)1 which consists of 202,599 celebrity images. -3 -1 Images are center-cropped with a size of 64 x 64. − . 0 . − . 0 . � � (a) M SE (b) 1 − SSIM 5.1.1. CONFIGURATION In this experiment, factorized distributions (Ballé et al., Figure 4. Plot of vz · wz (horizontal axis) and vx > A(x)wx (ver- 2018) is used to estimate Pz,ψ (z) 2 . Encoder part is con- tical axis). In beta-VAE (top row), the correlation is week while in structed with 4 convolution (CNN) layers and 2 fully con- our method (bottom row) we can observe proportionality. nected (FC) layers. For CNN layers, the kernel size is 9×9 for the first one and 5×5 for the rest. The dimension is 1.0 1.0 Variance of z Variance of z 64, stride size is 2, and activation function is generalized divisive normalization (GDN) (Ballé et al., 2016), which is suitable for image compression, for all layers. The di- mension of FC layers is 8192 and 256. For the first one, 0.0 0.0 sof tplus function is attached. The decoder part is the in- 0 127 255 0 127 255 z z verse form of the encoder. For comparison, we evaluate beta-VAE with the same form of autoencoder with 256- (a) beta-VAE (b) RaDOGAGA dimensional z. In this experiment, we test two different met- Figure 5. Variance of z. In beta-VAE, variance of all dimension is 1 rics M SE, where A(x) = M IM . and 1 − SSIM , where trained to be 1. In RaDOGAGA, the energy is concentrated in few A(x) = 1 2µx 2 Wm + 1 2σx 2 Wv . Wm ∈ RM ×M is a ma- dimensions. trix such that allelement is M12 and Wv = M 1 IM − Wm . Note that, in practice, 1 − SSIM for an image is calcu- vx > A(x)wx are almost proportional regardless of metric lated with a small window. In this experiment the window function. The correlation coefficients r reach to 0.97, while size is 11×11, and this local calculation is done for entire that of VAE are around 0.7. It can be seen that our method image with the stride size of 1. The cost is the average enables isometric embedding to a Euclidean space even of local values. For the second term in Eq. (7), h(d) is with this large scale real dataset. For the reader with interest, log(d) and A(x) = M 1 IM . For beta-VAE, we set β −1 as we provide the experimental result with MNIST dataset in 5 4 1 × 10 and 1 × 10 regarding to the training with M SE Appendix F. and 1 − SSIM respectively. For RaDOGAGA, (λ1 , λ2 ) is (0.1, 0.1) and (0.2, 0.1). Optimization is done by Adam 5.1.3. C ONSISTENCY TO NASH E MBEDDING T HEOREM optimizer (Kingma & Ba, 2014) with learning rate 1 × 10−4 . All models are trained so that the 1 − SSIM between input As explained in Introduction, the Nash embedding theo- and reconstructed images is around 0.05. rem and the manifold hypothesis is behind our exploration for isometric embedding of real data. Here, the question 5.1.2. R ESULTS is whether the trained model satisfied the condition that dim Mx < N . With RaDOGAGA, we can confirm it by Figure 4 depicts vz · wz (horizontal axis) and vx > A(x)wx observing the variance of each latent variable. Because the (vertical axis). The top row is result of VAE and the bottom Jacobian matrix forms orthonormal system, RaDOGAGA row shows that of our model. In our method, vz · wz and can work like principal component analysis (PCA) and eval- 1 uates the importance of each latent variable. The theoretical http://mmlab.ie.cuhk.edu.hk/projects/ CelebA.html proof for this property is described in Appendix E. Figure 5 2 Implementation is done with a library for Tensor- shows the variance of each dimension of the model trained Flow provided at https://github.com/tensorflow/ with M SE. The variance concentrates on the few dimen- compression with default parameters. sions. This means that RN is large enough to represent the data. Figure 13a shows decoder outputs when each com-

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space − + − + 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 (a) beta-VAE (b) RaDOGAGA Figure 6. Latent space traversal. In beta-VAE, some latent variables does not influence on the visual so much even though they have almost the same variance. In RaDOGAGA, all latent variables with large variance have important information for image. ponent zi is traversed from −2σ to 2σ, fixing rest of z as points s = (s1 , s2 ...s10,000 ) from three-dimensional GMM mean µ. Note the index i is arranged in descending order consists of three mixture-components with mixture weight of σ 2 . Here, σ 2 and µ for i-th dimension of z(= f (x)) are π = (1/3, 1/3, 1/3), mean µ1 = (0, 0, 0), µ2 = (15, 0, 0), V ar[zi ] and E[zi ] respectively with all data samples. From µ2 = (15, 15, 15), and covariance Σk = diag(1, 2, 3). k is the top, each row corresponds to z0 , z1 , z2 ..., and the center the index for mixture-component. Then, we scatter s with uniform random noise u ∈ R3×16 , udm ∼ Ud − 12 , 21 , column is mean. In Fig. 6b, the image changes visually in any dimension of z, while in Fig. 6a, depending on the where d and m are index for dimension of sampled data and dimension i, there are cases where no significant changes scattered data. The ud s are uncorrelated withP each other. 3 can be seen (such as z1 , z2 , z3 , and so on). In summary, we We produce 16-dimensional input data as x = d=1 ud sd can qualitatively observe that V ar[zi ] corresponds to the with a sample number of 10,000 in the end. The appear- eigenvalue of PCA, that is, a latent variable with larger σ ance probability of input data Px (x) equals to generation have bigger impact on image. probability of s. These results suggest the important information to express 5.2.1. CONFIGURATION data is concentrated in the lower few dimension and the dimension number of 256 was large enough to satisfy In the experiment, we estimate the Pz,ψ (z) using GMM dim Mx < N . To confirm the sufficiency of the dimension with parameter ψ as in DAGMM (Zong et al., 2018). Instead is difficult in beta-VAE because the σ 2 should be 1 for all of EM algorithm, GMM parameters are learned using Esti- dimension since it is trained to fit to prior. However, some mation Network (EN), which consists of multi-layer neural dimension has a large impact on the image, meaning the σ network. When the GMM consists of N -dimensional Gaus- does not work as the measure of importance. sian distribution NN (z; µ, Σ) with K mixture-components, and L is the size of batch samples, EN outputs the mixture- We believe this PCA-like trait is very useful for the interpre- components membership prediction as K-dimensional vec- tation of latent variables. For instance, if the metric function tor γ b as follows: were designed so as to reflect semantics, an important vari- able for a semantics is easily found. Furthermore, we argue p = EN (z; ψ), γ b = sof tmax(p). (18) this is a promising way to capture the minimal feature to express data, that is one of the goals of machine learning. Then, k-th mixture weight π bk , mean µ b k , covariance Σb k, PL R of z are further and entropy Pcalculated P as follows: 5.2. PDF Estimation with Toy Data π bk = l=1 γ blk /L, µ bk = L l=1 γ b lk z l / L l=1 γ blk , P L > P L In this section, we describe our experiment using toy data to Σ bk = γ Plk l l=1 b (z − µ b k )(zl − µ b ) k / γ l=1 lk , b K demonstrate whether the probability density of the input data R = − log k=1NN (z; µbk , Σ b k) . Px (x) and that of estimated in the latent space Pz,ψ (z) are proportional to each other as in theory. First, we sample data Overall network is trained by Eqs. (7) and (9). In this ex- periment, we set D(x1 , x2 ) as square of the L2 norm be-

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space PDF cause the input data is generated obeying the PDF in the (High) S3 × − Euclidean space. We test two types of h(·), h(d) = d S1 6 z2 and h(d) = log(d) and denote models trained with these S2 2 h(·) as RaDOGAGA(d) and RaDOGAGA(log(d)) respec- -2 -9 (Low) -14 z1 z0 tively. We used DAGMM as a baseline method. DAGMM 0 also consists of encoder, decoder, and EN. In DAGMM, (a) Input source s (b) DAGMM to avoid falling into the trivial solution that entropy is 2.0 1.0 minimized when the diagonal component of the covari- z2 z2 ance matrix is 0, the inverse of the diagonal component 0.0 0.4 b = PK PN Σ P (Σ) b −1 -0.5 -0.9 k=1 i=1 ki is added to the cost: -2.0 z1 0.0 z1 z0 0.0 -3.0 z0 0.6 1.4 (c) RaDOGAGA(d) (d) RaDOGAGA(log(d)) L = kx − x̂k2 + λ1 (− log(Pz,ψ (z))) + λ2 P (Σ). b (19) Figure 7. Plot of source of input data s and latent variables z. The color bar located left of (a) corresponds to normalized PDF. Both DAGMM and RaDOGAGA capture three mixture-components, but The only differences between our model and DAGMM is the PDF in DAGMM looks different from input data source. Points with high PDF do not concentrate on the center of the cluster regulation term P (Σ)b is replaced by D(x̂, x̆). The model especially in upper right one. size (the number of parameter) is the same Thus, model complexity such as the number of parameter is the same. , ( ) , ( ) , ( ) For both RaDOGAGA and DAGMM, the autoencoder part 400 800 700 is constructed with fully connected (FC) layers with sizes 200 400 350 of 64, 32, 16, 3, 16, 32, and 64. For all FC layers except for the last of the encoder and the decoder, we use tanh 0 0.02 0 0.02 0 0.02 ( ) ( ) ( ) as the activation function. The EN part is also constructed with FC layer with a size of 10, 3. For the first layer, we use (a) DAGMM (b) Ours (d) (c) Ours (log(d)) the tanh as activation function and dropout (ratio=0.5). For the last one, sof tmax is used. (λ1 , λ2 ) is set as (1 × 10−4 , Figure 8. Plot of Px (x) vs Pz,ψ (z). In RaDOGAGA, Px (x) and 1 × 10−9 ), (1 × 106 , 1 × 103 ) and (1 × 103 , 1 × 103 ) Pz,ψ (z) are proportional while we can not see that in DAGMM. for DAGMM, RaDOGAGA(d) and RaDOGAGA(log(d)) respectively. We optimize all models by Adam with learning rate 1 × 10−4 . We set σ 2 as 1/12. 5.3. Anomaly Detection Using Real Data 5.2.2. R ESULTS We here examine whether the clear relationship between Px (x) and Pz,ψ (z) is useful in anomaly detection in which Figure 7 displays the distribution of input data source s and PDF estimation is key issue. We use four public datasets‡ : latent variable z. Even though both methods can capture KDDCUP99, Thyroid, Arrhythmia, and KDDCUP-Rev. that s is generated from three mixture-components, there The (instance number, dimension, anomaly ratio(%)) of is a difference in PDFs. Since the data is generated from each dataset is (494021, 121, 20), (3772, 6, 2.5), (452, 274, GMM, the value of PDF gets higher as the sample gets 15), and (121597, 121, 20). Detail of datasets is given in closer to the centers of clusters. Although, in DAGMM, Appendix G. this tendency looks distorted. This difference is further demonstrated in Fig. 8 which shows a plot of Px (x) (hori- 5.3.1. EXPERIMENTAL SETUP zontal axis) against Pz,ψ (z) (vertical axis). In our method, For a fair comparison with previous works, we follow the Px (x) and Pz,ψ (z) are almost proportional to each other as setting in Zong et al. (2018). Randomly extracted 50% in theory while we cannot observe such proportionality in of the data is assigned to training and the rest to testing. DAGMM. This difference is quantitatively obvious as well. During training, only normal data is used. During the test, That is, correlation coefficients between Px (x) and Pz,ψ (z) the entropy R for each sample is calculated as the anomaly are 0.882 (DAGMM), 0.997 (RaDOGAGA(d)), and 0.998 score, and if the anomaly score is higher than a threshold, (RaDOGAGA(log(d))). We can also observe that in RaDO- it is detected as an anomaly. The threshold is determined GAGA(d), there is a slight distortion in its proportionality by the ratio of anomaly data in each data set. For example, in the area of Px (x) < 0.01. When Pz,ψ (z) is sufficiently in KDDCup99, data with R in the top 20 % is detected as fitted, h(d) = log(d) makes Px (x) and Pz,ψ (z) be propor- tional more rigidly. More detail is described in Appendix C. ‡ Datasets can be downloaded at https://kdd.ics.uci. edu/ and http://odds.cs.stonybrook.edu.

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space Table 1. Average and standard deviations (in brackets) of Precision, Recall and F1 Dataset Methods Precision Recall F1 ALAD∗ 0.9427 (0.0018) 0.9577 (0.0018) 0.9501 (0.0018) INRF∗ 0.9452 (0.0105) 0.9600 (0.0113) 0.9525 (0.0108) GMVAE∗ 0.952 0.9141 0.9326 KDDCup DAGMM 0.9427 (0.0052) 0.9575 (0.0053) 0.9500 (0.0052) RaDOGAGA(d) 0.9550 (0.0037) 0.9700 (0.0038) 0.9624 (0.0038) RaDOGAGA(log(d)) 0.9563 (0.0042) 0.9714 (0.0042) 0.9638 (0.0042) GMVAE∗ 0.7105 0.5745 0.6353 DAGMM 0.4656 (0.0481) 0.4859 (0.0502) 0.4755 (0.0491) Thyroid RaDOGAGA(d) 0.6313 (0.0476) 0.6587 (0.0496) 0.6447 (0.0486) RaDOGAGA(log(d)) 0.6562 (0.0572) 0.6848 (0.0597) 0.6702 (0.0585) ALAD∗ 0.5000 (0.0208) 0.5313 (0.0221) 0.5152 (0.0214) GMVAE∗ 0.4375 0.4242 0.4308 DAGMM 0.4985 (0.0389) 0.5136 (0.0401) 0.5060 (0.0395) Arrythmia RaDOGAGA(d) 0.5353 (0.0461) 0.5515 (0.0475) 0.5433 (0.0468) RaDOGAGA(log(d)) 0.5294 (0.0405) 0.5455 (0.0418) 0.5373 (0.0411) DAGMM 0.9778 (0.0018) 0.9779 (0.0017) 0.9779 (0.0018) KDDCup-rev RaDOGAGA(d) 0.9768 (0.0033) 0.9827 (0.0012) 0.9797 (0.0015) RaDOGAGA(log(d)) 0.9864 (0.0009) 0.9865 (0.0009) 0.9865 (0.0009) ∗ Scores are cited from Zenati et al. (2018) (ALAD), Song & Ou (2018) (INRF), and Liao et al. (2018) (GMVAE). an anomaly. As metrics, precision, recall, and F1 score are except for Arrhythmia. This result suggests a much rigid calculated. We run experiments 20 times for each dataset orthonormality likely to bring better performance. split by 20 different random seeds. 6. Conclusion 5.3.2. BASELINE METHODS In this paper, we propose RaDOGAGA which embeds data As in the previous section, DAGMM is taken as a baseline in a low-dimensional Euclidean space isometrically. With method. We also compare with the scores reported in previ- RaDOGAGA, the relation of latent variables and data is ous works that conducted the same experiment (Zenati et al., quantitatively tractable. For instance, Pz,ψ (z) obtained by 2018; Song & Ou, 2018; Liao et al., 2018). the proposed method is proportional to the PDF of x in the data observation space. Furthermore, thanks to these prop- 5.3.3. CONFIGURATION erties, we achieve state-of-the-art performance in anomaly As in Zong et al. (2018), in addition to the output from the detection. 0 0 encoder, kx−x kxk2 k2 and kxkx·x 0 2 kx k2 are concatenated to z. It is Although we focused on the PDF estimation as a practical sent to EN. Note that z is sent to the decoder before con- task in this paper, the properties of RaDOGAGA will benefit catenation. Other configuration except for hyper parameter a variety of applications. For instance, data interpolation is same as in the previous experiment. Hyper parameter will be easier because a straight line in the latent space is for each dataset is described in Appendix G. Input data is geodesic in the data space. It also may help the unsupervised max-min normalized. or semi-supervised learning since distance of z reliably re- flects distance of x. Furthermore, our method will promote 5.3.4. R ESULTS disentanglement because thanks to the orthonormality, PCA- Table 1 reports the average scores and standard deviations like analysis is possible. (in brackets). Compared to DAGMM, RaDOGAGA has a To capture essential features of data, it is important to fairly better performance regardless of types of h(·). Note that, evaluate the significance of latent variables. Since isometric our model does not increase model complexity at all. Sim- embedding promise this fairness, we believe RaDOGAGA ply introducing the RDO mechanism to the autoencoder is will bring a Breakthru for generative analysis. As future effective for anomaly detection. Moreover, our approach work, we explore the usefulness in various tasks mentioned achieves the highest performance compared to other meth- above. ods. RaDOGAGA(log(d)) is superior to RaDOGAGA(d)

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space Acknowledgement Geng, C., Wang, J., Chen, L., Bao, W., Chu, C., and Gao, Z. Uniform interpolation constrained geodesic learning on We gratitude to Naoki Hamada, Ramya Srinivasan, Kentaro data manifold. arXiv preprint arXiv:2002.04829, 2020. Takemoto, Moyuru Yamada, Tomoya Iwakura, and Hiyori Yoshikawa for constructive discussion. Goyal, V. K. Theoretical foundations of transform coding. IEEE Signal Processing Magazine, 18(5):9–21, 2001. References Han, Q. and Hong, J.-X. Isometric Embedding of Rieman- Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., nian Manifolds in Euclidean Spaces. American Mathe- and Murphy, K. Fixing a broken ELBO. In Proceedings matical Society, 2006. of the International Conference on Machine Learning (ICML), pp. 159–168, 2018. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. β-VAE: Ballé, J., Laparra, V., and Simoncelli, E. Density modeling Learning basic visual concepts with a constrained vari- of images using a generalized normalization transforma- ational framework. In Proceedings of the International tion. In Proceedings of the International Conference on Conference on Learning Representations (ICLR), 2017. Learning Representations (ICLR), 2016. Horn, R. A. and Johnson, C. R. Matrix Analysis. Cambridge Ballé, J., Minnen, D., Singh, S., Hwang, S. J., and Johnston, University Press, 2nd edition, 2013. N. Variational image compression with a scale hyper- prior. In Proceedings of the International Conference on Johnson, M., Duvenaud, D., Wiltschko, A. B., Datta, S. R., Learning Representations (ICLR), 2018. and Adams, R. P. Composing graphical models with neural networks for structured representations and fast Bengio, Y., Courville, C., and Vincent, P. Representation inference. In Proceedings of the Advances in Neural learning: A review and new perspectives. IEEE Transac- Information Processing Systems, 2016. tions on Pattern Analysis and Machine Intelligence, 35 (8):798–1828, 2013. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Berger, T. (ed.). Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice Hall, 1971. Kingma, D. P. and Dhariwal, P. Glow: Generative flow Bernstein, M., De Silva, V., Langford, J. C., and Tenenbaum, with invertible 1x1 convolutions. In Proceedings of the J. B. Graph approximations to geodesics on embedded Advances in Neural Information Processing Systems, pp. manifolds. Technical report, Stanford University, 2000. 10215–10224, 2018. Brekelmans, R., Moyer, D., Galstyan, A., and Ver Steeg, Kingma, D. P. and Welling, M. Auto-encoding variational G. Exact rate-distortion in autoencoders via echo noise. Bayes. In Proceedings of the International Conference In Proceedings of the Advances in Neural Information on Learning Representations (ICLR), 2014. Processing Systems, pp. 3889–3900, 2019. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- Chalapathy, R. and Chawla, S. Deep learning for anomaly based learning applied to document recognition. Proceed- detection: A survey. arXiv preprint arXiv:1901.03407, ings of the IEEE, 86(11):2278–2324, 1998. 2019. Liao, W., Guo, Y., Chen, X., and Li, P. A unified unsuper- Chen, N., , Klushyn, A., Kurle, Jiang, X., Bayer, J., and vised Gaussian mixture variational autoencoder for high van der Smagt, P. Metrics for deep generative models. In dimensional outlier detection. In Proceedings of the IEEE Proceedings of the International Conference on Artifcial International Conference on Big Data, pp. 1208–1217, Intelligence and Statistics (AISTATS), pp. 1540–1550, 2018. 2018. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear in- face attributes in the wild. In Proceedings of the IEEE dependent components estimation. In Proceedings of the International Conference on Computer Vision (ICCV), pp. International Conference on Learning Representations 3730–3738, 2015. (ICLR) Workshop, 2015. McQueen, J., Meila, M., and Joncas, D. Nearly isometric Dua, D. and Graff, C. UCI machine learning repository. embedding by relaxation. In Advances in Neural Infor- http://archive.ics.uci.edu/ml, 2019. mation Processing Systems, pp. 2631–2639, 2016.

Rate-Distortion Optimization Guided Autoencoder for Isometric Embedding in Euclidean Latent Space Pai, G., Talmon, R., Alex, B., and Kimmel, R. DIMAL: Deep isometric manifold learning using sparse geodesic sampling. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 819–828, 2019. Rao, K. R. and Yip, P. (eds.). The Transform and Data Compression Handbook. CRC Press, Inc., 2000. Rolı́nek, M., Zietlow, D., and Martius, G. Variational au- toencoders pursue PCA directions (by accident). In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12406–12415, 2019. Shao, H., Kumar, A., and Fletcher, P. T. The Riemannian geometry of deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 315–323, 2018. Song, Y. and Ou, Z. Learning neural random fields with inclusive auxiliary generators. arXiv preprint arXiv:1806.00271, 2018. Strang, G. Linear Algebra and its Applications. Cengage Learning, 4th edition, 2006. Sullivan, G. J. and Wiegand, T. Rate-distortion optimization for video compression. IEEE Signal Processing Maga- zine, 15(6):74–90, 1998. Teoh, H. S. Formula for vector rotation in arbitrary planes in Rn . http://eusebeia.dyndns.org/ 4d/genrot.pdf, 2005. Tschannen, M., Bachem, O., and Lucic, M. Recent ad- vances in autoencoder-based representation learning. In Proceedings of the Third Workshop on Bayesian Deep Learning (NeurIPS 2018 Workshop), 2018. Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. Zenati, H., Romain, M., Foo, C.-S., Lecouat, B., and Chan- drasekhar, V. Adversarially learned anomaly detection. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 727–736, 2018. Zhou, J., Wen, S., Nakagawa, A., Kazui, K., and Tan, Z. Multi-scale and context-adaptive entropy model for im- age compression. In Proceedings of the Workshop and Challenge on Learned Image Compression (CVPR 2019 Workshop), pp. 4321–4324, 2019. Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., and Chen, H. Deep autoencoding Gaussian mixture model for unsupervised anomaly detection. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.

You can also read