Generative Zero-shot Network Quantization

Page created by Jesus Parks
 
CONTINUE READING
Generative Zero-shot Network Quantization
Generative Zero-shot Network Quantization

 Xiangyu He Qinghao Hu Peisong Wang Jian Cheng
 NLPR, CASIA
 xiangyu.he@nlpr.ia.ac.cn
arXiv:2101.08430v1 [cs.CV] 21 Jan 2021

 Abstract able. The following question is how to sample data x from
 the finite dataset X in the absence of original training data.
 Convolutional neural networks are able to learn realistic It is intuitive to introduce noise u „ N p0, 1q as the in-
 image priors from numerous training samples in low-level put data to estimate the distributions of intermediate lay-
 image generation and restoration [66]. We show that, for ers. Unfortunately, since the Single-Gaussian assumption
 high-level image recognition tasks, we can further recon- can be too simple, the results for low-bit activation quanti-
 struct “realistic” images of each category by leveraging zation are far from satisfactory [69]. Due to the zero-shot
 intrinsic Batch Normalization (BN) statistics without any setting, it is also hard to apply the well-developed GANs
 training data. Inspired by the popular VAE/GAN methods, to image synthesis without learning the image prior to the
 we regard the zero-shot optimization process of synthetic original dataset. Very recently, a large body of works sug-
 images as generative modeling to match the distribution of gests that the running mean µ and variance σ 2 in the Batch
 BN statistics. The generated images serve as a calibra- Normalization layer have captured the prior distribution of
 tion set for the following zero-shot network quantizations. the data [11, 25, 15, 69]. The square loss on µ and σ 2 (de-
 Our method meets the needs for quantizing models based tails in Equation (11) and (10)) coupled with cross-entropy
 on sensitive information, e.g., due to privacy concerns, no loss achieves the empirical success in zero-shot network
 data is available. Extensive experiments on benchmark quantization. Though the performance has been further im-
 datasets show that, with the help of generated data, our ap- proved over early works [7, 47, 57, 56] such as DeepDream
 proach consistently outperforms existing data-free quanti- [51], it remains unclear why these learning targets should
 zation methods. lead to meaningful data after training. Therefore, instead of
 directly presenting several loss functions, we hope to bet-
 ter describe the training process via generative modeling,
 1. Introduction which might provide another insight into the zero-shot net-
 Deep convolutional neural networks have achieved great work quantization.
 success in several computer vision tasks [26, 61], however, In this work, we consider the generative modeling that
 we still understand little why it performs well. There are deals with the distributions of mean and variance in Batch
 plenty of pioneering works aim at peeking inside these net- Normalization layers [35]. That is, for some random data
 works. Feature visualization [57, 51, 56] is a group of main- augmentations, the mean µ and variance σ 2 generated by
 stream methods that enhance an input image from noise to synthesized images I should look like their counterparts ex-
 elicit a particular interpretation. The generated images il- tracted from real images, with high probability. Recently
 lustrate how neural networks build up their understanding developed generative models like GAN commonly captures
 of images [57], as a byproduct, opens a path towards data- pp¨q instead of knowing one image, but we regard synthe-
 free model compression [70]. sized images I as model parameters optimized in zero-shot
 Network quantization is an efficient and effective way learning. The input transformations introduce the random-
 to compress deep neural networks with a small memory ness to allow the sampling on µ and σ 2 . Besides, our
 footprint and low latency. It is a common practice to in- method presents a potential interpretation for the popular
 troduce a calibration set to quantize activations. To recover Batch Normalization matching loss [11, 25, 15, 69, 70].
 the degraded accuracy, training-aware quantization even re- Due to the insufficient sampling in each iteration, we fur-
 quires re-training on the labeled dataset. However, in real- ther propose a prior regularization term to narrow the pa-
 world applications, the original (labeled) data is commonly rameter space of I. Since the accuracy drop of low-bit
 not available due to privacy and security concerns. In this activation quantization heavily relies on the image quality
 case, zero-shot/data-free quantization becomes indispens- of the calibration set, we conduct extensive experiments

 1
Generative Zero-shot Network Quantization
on the zero-shot network quantization task. The 4-bit net-
 
works including weights and activations show consistent ∼ (0, ) 

improvements over baseline methods on both CIFAR and
ImageNet. Hopefully, this work may not be restricted to
image synthesis, but shed some light on the interpretability X , 
of CNN through generating interpretable input images that N N
agree with deep networks’ internal knowledge.
 Figure 1: The standard VAE represented as a graphical
2. Related Work model (left subfigure), i.e., sampling from z N times to
 generate something similar to X with fixed parameters θ
 Model quantization has been one of the workhorses in
 [20]. In this work, we regard the synthetic sampes/images
industrial applications which enjoys low memory footprint
 I as model parameters θ to be optimized during the training
and inference latency. Due to its great practical value, low-
 process. We have a family of deterministic functions fI pzq
bit quantization becomes popular in recent literature.
 parameterized by I. Since z is random, fI pzq is a random
Training-aware quantization. Previous works [52, 24, 34,
 variable 1 . We hope to optimize I such that @zi from ppzq,
75] mainly focus on recovering the accuracy of quantized
 fI pzi q can cause the pre-trained network to generate µ, σ 2
model via backward propagations, i.e., label-based fine-
 and, with high probability, these random variables will be
tuning. Since the training process is similar to its float-
 like the Batch-Normalization statistics [35] generated by
ing point counterpart, the following works further prove
 real images.
that low-bit networks can still achieve comparable per-
formances with full-precision networks by training from
scratch [18, 74, 45, 36]. Ultra low-bit networks such as Normalization layer [35]. [70, 25] produce synthetic images
binary [23, 17, 33, 60] and ternary [2, 13, 41, 76], ben- without generators by directly optimizing the input images
efiting from bitwise operations to replace the computing- through backward propagations. Given any input noises,
intensive MACs, is another group of methods. Leading [12, 49, 71, 15, 37, 11] introduces a generator network gθ
schemes have reached less than five points accuracy drop on that yields synthetic images to perform Knowledge Distil-
ImageNet [42, 28, 48, 38]. While training-aware methods lation [29] between teacher and student networks.
achieve good performance, they suffer from the inevitable Since we have no access to the original training dataset
re-training with enormous labeled data. in zero-shot learning, it is hard to find the optimal generator
Label-free quantization. Since a sufficiently large open gθ . Generator-based methods have to optimize gθ indirectly
training dataset is inaccessible for many real-world ap- via KL divergence between categorical probability distribu-
plications, such as medical diagnosis [19], drug discov- tions instead of max Ex„p̂ rln ppxqs in VAE/GANs. In light
ery, and toxicology [10], it is imperative to avoid retrain- of this, we formulate the optimization of synthetic images I
ing or require no training data. Label-free methods take as generative modeling that maximizes Erln ppµ, σ 2 ; Iqs.
a step forward by only relying on limited unlabeled data
[27, 53, 4, 73, 3, 16]. Most works share the idea of minimiz- 3. Approach
ing the quantization error or matching the distribution of full
precision weights via quantized parameters. [4, 22, 67, 53] 3.1. Preliminary
observe that the changes of layer output play a core role A generative modeling whose purpose is to map random
in accuracy drop. By introducing the “bias correction” variables to samples and generates samples distributed ac-
technique (minimize the differences between EpW xq and cording to p̂pxq, defined over datapoints X . We are in-
EpW x xq), the performance of quantized model can be fur-
 terested in estimating p̂ using maximum likelihood esti-
ther improved. mation to find best ppxq that approximates p̂ measured by
Zero-shot quantization. Recent works show that deep neu- Kullback-Leibler divergence,
ral networks pre-trained on classification tasks can learn the
prior knowledge of the underlying data distribution [66]. Lpx; θq “ KLpp̂pxq||ppx; θqq “ Ep̂pxq rln ppx; θqs. (1)
The statistics of intermediate features, i.e., “metadata”, are
assumed to be provided and help to discriminate samples Here we use a parametric model for distribution p. We hope
from the same image domain as the original dataset [7, 44]. to optimize θ such that we can maximize the log likelihood.
To circumvent the need for extra “metadata”, [55] treats Ideally, ppx; θq should be sufficiently expressive and
the weights of the last fully-connected layer as the class flexible to describe the true distribution p̂. To this end, we
templates then exploits the class similarities learned by the 1 Given z , we have a deterministic function f pz q which applies flip-
 i I i
teacher network via Dirichlet sampling. Another group of ping/jitter/shift to the synthetic images I according to zi . We can sample
methods focus on the stored running statistics of the Batch zi from probability density function ppzq N times to allow the backprop.

 2
Generative Zero-shot Network Quantization
introduce a latent variable z, Then, we may approximate the distribution of mean and
 ż variance of a mini-batch, i.e., ppµq and ppσ 2 q.
 ppx; θq “ ppx|z; θqppzqdz “ Eppzq rppx|z; θqs, (2) Distribution řN
 matching For the mean variable, we have
 i“1 Fi
 µbatch “ N where Fi are features in the sampled
so that the marginal distribution p computed by the product mini-batch. We assume that samples of the random variable
of diverse distributions (i.e., joint distribution) can better are i.i.d. then by central limit theorem (CLT) we obtain
approximate p̂.
 σ2
3.2. Generative Zero-shot Quantization µbatch „ N pµ, q (7)
 N
 We now describe the optimization procedure of Genera-
tive Zero-shot Quantization (GZNQ) with batch normaliza- for sufficiently large N , given µ “ ErF s and σ 2 “ ErpF ´
tion and draw the resemblance to the generative modeling. µbatch q2 s. Similarly, we get
 The feed forward function of a deep neural network at VarrpF l ´ µq2 s
 2
l-th layer can be described as: σbatch „ N pσ 2 , q, (8)
 N
 F l “ Wl φpWl´1 ...φpW2 φpW1 pfI pzqqqq (3)
 řN where N accounts for the batchsize and Varr¨s is the fi-
 N
 Fl pF l ´ µbatch q2
 ř
 nite variance, details in Appendix. Then, we further rewrite
 µbatch “ i“1 i σbatch
 2
 “ i“1 i (4) Equation (5) through (6-8) as follows:
 N N
where φp¨q is an element-wise nonlinearity function and Wl 1N 1 N
is the weights of l-th layer (freezed during the optimization LDM “ 2
 pµbatch ´ µq2 ` l ´ µq2 s
 2
 pσbatch ´ σ 2 q2
 2 σ 2 VarrpF
 2 loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon
process). The mean µbatch and variance σbatch are calcu-
 term I
lated per-channel over the mini-batches. Furthermore, we
 1
denote the input to networks as fI pzq. That is, we have a ` ln VarrpF l ´ µq2 s.
 2
vector of latent variables z in some high-dimensional space (9)
Z 2 but we can easily sample zi according to ppzq. Then, Note that the popular Batch-Normalization matching loss in
we have a family of deterministic functions fI pzi q, parame- recent works [11, 70, 69]
terized by the synthetic samples/images I in pixel space I.
zi determines the parameters such as the number of places min ||~ ~ ||22 ` ||~σbatch
 µbatch ´ µ 2
 ´ ~σ 2 ||22 ,
by which the pixels of the image are shifted and whether
to apply flipping/jitter/shift function f to I. Though I, W which can be regarded as a simplified term I in Eq.(9), leav-
are fixed and the mapping Z ˆ I Ñ µ, σ 2 is determinis- ing out the correlation between µ and σ 2 (i.e., the coeffi-
 2 cients). Another group of recent methods [15, 25] present
tic, if z is random, then µbatch and σbatch are random vari-
 2
ables in the space of µ and σ . We wish to find the opti-
mal I ˚ such that, even with random flipping/jitter/shift, the µbatch 1 σ 2 ` pσ 2 ´ σbatch
 2
 q2
 min log ´ p1 ´ 2 q, (10)
 2
computed µbatch , σbatch are still very similar to the Batch- µ 2 σbatch
Normalization (BN) statistics [35] generated by real im-
 which actually minimizes the following object
ages.
 Recall Equation (1), minimizing the KL divergence is min KLpN pµ, σ 2 q || ppF l qq, F l „ N pµbatch , σbatch
 2
 q.
equivalent to maximizing the following log likelihood
 That is to approximate the distribution ppF l q defined over
 I ˚ “ arg max ln Eppzq rppµ, σ 2 |z; Iqs. (5)
 I features F l instead of Batch-Normalization statistics µ, σ 2
 in Eq.(5). Since the parameter space of featuremaps are
To perform stochastic gradient descent, we need to com-
 much larger than µ, σ 2 , we adopt LDM to facilitate the
pute the expectation. However, taking the expectation with
 learning process.
respect to ppzq in closed form is not possible in practice.
 Pseudo-label generator Unfortunately, Monte Carlo esti-
Instead, we take Monte Carlo (MC) estimation by sampling
 mation in (6) can be inaccurate given a limited sampling 3 ,
from ppzq
 which may lead to a large gradient variance and poor syn-
 n thesized results. Hence, it is common practice to introduce
 1 ÿ
 Eppzq rppµ, σ 2 |z; Iqs « ppµ, σ 2 |zi ; Iq. (6) regularizations via prior knowledge of I, e.g.,
 n i“1
 2 Formally, say z is a multivariate random variable. The distribu-
 i min LCE pϕω pIq, yq (11)
tion of each of the component random variables can be Bernoullippq or řT
 1
Unifpa, bq. 3 Consider I
 T “ T t f pxt q, IT Ñ Ex„p rf pxqs holds for T Ñ 8.

 3
Generative Zero-shot Network Quantization
Synthesized Images

 Noise, Model

 Synthesized
 Data
 Pseudo-label
 Generator
 Training-aware Original Quantized Pruned Low-rank
 Quantization Model Model Model Model

 Quantized Models BN Distribution Cross-Entropy KL Divergence : Forward & Backward propagation
 Matching Loss Loss Loss : Forward propagation

Figure 2: Generative Zero-shot Network Quantization (GZNQ) setup: GZNQ uses a generative model to produce the mean
and variance in Batch-Normalization (BN) layer [35], meanwhile, optimizes the synthesized images to perform the following
training-aware quantization. The pseudo-label generator consists of several data-free post-training compressed models, which
serve as a multi-model ensemble or voting classifier.

 (QVHPEOH  (QVHPEOH 
where ϕω pIq produces the categorical probability and y ac- /RZUDQN  /RZUDQN 
counts for the ground-truth, which can be regarded as a prior 3UXQHG  3UXQHG 
regularization on I. 4XDQWL]HG  4XDQWL]HG 
 Recently, [31, 32] shows that compressed models are 2ULJLQDO 
   
 2ULJLQDO 
   
more vulnerable to challenging or complicated examples. DKL ycompressed__yoriginal
Generative Zero-shot Network Quantization
DKL ycompressed__yoriginal
Inspired by these findings, we wished to introduce a post-
 (QVHPEOH  (QVHPEOH 
training quantized low-bit network as the pseudo-label gen- /RZUDQN  /RZUDQN 
erator to help produce “hard” samples. However, the se- 3UXQHG  3UXQHG 
lection of bitwidth can be tricky. High-bit networks yield 4XDQWL]HG  4XDQWL]HG 
nearly the same distribution as the full-precision counter- 2ULJLQDO
   
 
 
 2ULJLQDO
  
 
 
part, which results in noisy synthetic results (as pointed 7RS$FF 
Generative Zero-shot Network Quantization
7RS$FF 

out in adversarial attack, high-frequency noise can easily (a) ResNet-18 (b) ResNet-50
fool the network). Low-bit alternatives fail on easy sam-
ples that damage the image diversity. To solve this, we Figure 3: An ensemble of compressed models (same archi-
turn to the model ensemble technique, shown in Figure 2, tecture using different post-training compression schemes,
then reveal the similarity between (12) and multiple gener- e.g., weights MSE quantization + weights magnitude-based
ators/discriminators training in GANs [30, 21]. unstructured pruning + SVD decomposition of 3ˆ3 and FC
 An ensemble of different post-training compressed mod- layers) generates a relatively reliable pseudo-label, which is
els generates a similar categorical distribution with the orig- more similar to the distribution of original network outputs
inal network (illustrated in Figure 3) and it is more flexible than every single compressed model when the accuracy is
to adjust the regularization strength than discrete bitwidth comparable.
selection. Note that ensemble modeling still obtains a small
KL distance when the accuracy is relatively low. Here, we
get the prior regularization on I as “lower bound” for the objective with multiple compressed
 models
 M
 ´ 1 ÿ ¯
 LKL “ KL ϕω pfI pzqq || ϕω̂i pfI pzqq N
 ÿ ϕω ÿN
 ϕω
 M i“1 ϕωj log ř j ď ϕωj log ś j 1
 1
 N
 (12) j“1 M i ϕω̂ji j“1 p i ϕω̂ji q M
 ÿ ϕωj pfI pzqq (13)
 “ ϕωj pfI pzqq log 1
 řM 1 ÿ
 M ´ ¯
 j“1 M i“1 ϕω̂ji pfI pzqqq “ KL ϕω pfI pzqq ||ϕω̂i pfI pzqq .
 M i“1
 i
where ω̂ refers to the i-th compressed model. For no-
tational simplicity, we shall in the remaining text denote [21] shows multiple discriminators can alleviate the mode
ϕωj pfI pzqq as ϕωj . By simply applying the AM-GM in- collapse problem in GANs. Here, (13) encourages the syn-
equality to (12), we can easily prove that (12) serves as a thesized images to be generally “hard” for all compressed

 4
Generative Zero-shot Network Quantization
channel gating matrix, ∈ 0,1 × 
 where δ is the sigmoid function, U „ Uniformp0, 1q and
 τ is the temperature that controls the difference between
 the softmax and argmax function. Compared with channel
 pruning [6] and DARTS [43], we introduce a 2D trainable
 matrix α P RCˆN instead of a vector to better describe each
 Image 1 Image 
 sample/category. Figure 5b further illustrates the effect of
 mini-batch channel gating that samples of the same main category yield
 Figure 4: Illustration of the channel gating module. Note similar masks after training.
 that each sample has its own binary gating mask Gi P
 t0, 1u1ˆc . In this case, channel “ 4.
 4. Experiments
 We perform experiments on the small-scale CIFAR-
 Hand-held computer 10/100 dataset (32 ˆ 32 pixels) and the complex ImageNet
 Cellphone 0.60
 Pekinese dataset (224 ˆ 224 pixels, 1k classes). The quantization and
 Golden retriever 0.45 fine-tuning process are strictly data-free. We then report the
 White wolf
channel

 Red fox 0.30
 Top-1 accuracy on the validation set.
 Restaurant
 Cheeseburger 0.15
 4.1. Ablation study
 Daisy
 Cardoon In this section, we evaluate the effect of each component.
 0.00
 er ne se er olf ox nt er sy on
 put ho ine iev w d f ura urg Dai do All the ablation experiments are conducted on the ImageNet
 omCellp Pekn retWr hite RReestaeseb Car
 0123456789 d-h
 eld
 c
 Gol
 de Che dataset with pre-trained standard ResNet-50 [26]. We use
 image ID Han the popular Inception Score (IS) [62] to measure the sample
 (a) channel gating (b) correlation matrix quality despite its notable flaws [5]. As shown in Table 3,
 introducing prior regularization significantly contributes to
 Figure 5: We visualize the learned channel gating mask of higher inception scores and other components further im-
 ten samples in a), generated by ImageNet ResNet-50. There prove IS. Since the activations become more semantically
 are plenty of zero elements which illustrate the neural re- meaningful as layers go deeper, we apply (9) to the convo-
 dundancy. We further show the correlation matrix of gat- lution layers in the last block of ResNets. Table 4 shows
 ing masks of ten samples in b). Images belonging to the ensemble modeling leads to better performance than a sin-
 same main category produce higher responses than irrele- gle compressed model.
 vant classes, e.g., the gating mask of cardoon is more simi-
 lar to daisy than cheeseburger. 4.2. Generative settings
 As shown in Figure 2, GZNQ is a two-stage scheme.
 In this section, we detail the generative settings in the first
 models (i.e., the large KL divergence corresponds to the dis- stage. For CIFAR models, we first train full-precision net-
 agreement between full-precision networks and compressed works from scratch (initial learning rate 0.1 with cosine
 models). scheduler; weight decay is 1e´4 ; all networks are trained
 Channel gating From the view of connectionism, “mem- for 300 epochs by SGD optimizer). Then, we utilize the
 ory” is created by modifying the strength of the connec- proposed distribution matching loss, KL divergence loss,
 tions between neural units [64, 9], i.e., weights matrix. In channel gating, and CE loss to optimize the synthesized
 light of this, we wish to do the optimal surgery [40, 68] to images. More specifically, we use Adam optimizer (beta1
 enhance the “memory” when generating samples belonging is set to 0.3 and beta2 is 0.9) with a learning rate of 0.4
 to a specific category. More specifically, we use per-sample and generate 32 ˆ 32 images in a mini-batch of 128. The
 channel pruning to encourage learning more entangled rep- weight of BN matching loss is 0.03 and 0.05 for the KL
 resentations, shown in Figure 4. Since channel pruning may divergence loss. We first half the image size to speed up
 severely damage the BN statistics, we only apply this set- training via 2 ˆ 2 average downsampling. After 2k itera-
 ting to the last convolution layer. Fortunately, high-level tions, we use the full resolution images to optimize for an-
 neurons in deep CNNs are more semantically meaningful, other 2k iteration. In this work, we assume zi to be a three-
 which meets our needs. dimensional random variable such as zi,0 „ Bp0.5q and
 We use the common Gumble-Softmax trick [46] to per- zi,1 .zi,2 „ Up´30, 30q, which determines whether to flip
 form channel gatings images and move images along any dimension by any num-
 " ber of pixels. We follow most settings of CIFAR-10/100
 αi,j `log U ´logp1´U q
 1, δp q ě0 in ImageNet experiments. Since BN statistics and pseudo-
 Gi,j “ τ (14)
 0, otherwise labels are dataset-dependent, we adjust the weight of BN

 5
Generative Zero-shot Network Quantization
Dataset Pre-trained Model Method W bit A bit Quant Acc (%) Acc Drop (%) Fine-tuning
 ZeroQ [11] 4 4 79.30 14.73 –
 ResNet-20 Ours 4 4 89.06 4.07 –
 (0.27M) GDFQ [69] 4 4 90.25 3.78 X
 Ours 4 4 91.30 1.83 X
 Knowledge Within [25] 4 4 89.10 4.13 –
 ResNet-44 Ours 4 4 91.46 2.92 –
 (0.66M) Knowledge Within [25] 4: 8 92.25 0.99 –
 Ours 4 8 93.57 0.83 –
 CIFAR-10
 DFNQ [15] 4 8 88.91 2.06 X
 WRN16-1 Ours 4 4 89.00 2.38 X
 (0.17M) DFNQ [15] 4 8 86.29 4.68 –
 Ours 4 4 87.59 3.79 –
 DFNQ [15] 4 8 94.22 0.55 X
 WRN40-2 Ours 4 4 94.81 0.37 X
 (2.24M) DFNQ [15] 4 8 93.14 1.63 –
 Ours 4 4 94.06 1.09 –
 ZeroQ [11] 4 4 45.20 25.13 –
 ResNet-20 Ours 4 4 58.99 10.18 –
 (0.28M) GDFQ [69] 4 4 63.58 6.75 X
 Ours 4 4 64.37 4.80 X
 CIFAR-100
 DFNQ [15] 4 8 75.15 2.17 X
 ResNet-18 Ours 4 5 75.95 3.16 X
 (11.2M) DFNQ [15] 4 8 71.02 6.30 –
 Ours 4 5 71.15 7.96 –

Table 1: Results of zero-shot quantization methods on CIFAR-10/100. “W bit” means weights quantization bitwidth and
“A bit” is quantization bits for activations. “Fine-tuning” refers to re-training on the generated images using knowledge
distillation. : indicates first and last layers are in 8-bit. We directly cite the best results reported in the original zero-shot
quantization papers (ZeroQ 4-bit activations from [69]).

matching loss and KL divergence loss to 0.01 and 0.1 re- quantization in the forward pass [34, 25]. Additionally, bias
spectively. ImageNet pre-trained models are downloaded terms in convolution and fully-connected layers, gradients,
from torchvision model-zoo directly [58]. and Batch-Normalization layers are kept in floating points.
 In our experiments, we do observe that networks trained We argue that advanced calibration methods [53, 67, 69] or
only with random 256 ˆ N /N ˆ 256 cropping and flipping quantization schemes [16, 14, 36] coupled with GZNQ may
contribute to high-quality images but the accuracy is rela- further improve the performance.
tively lower than the official model. This finding is consis-
tent with the policy in differential privacy [1]. Since we fo-
cus on generative modeling and network quantization, more
 For our data-free training-aware quantization, i.e., fine-
ablation studies on this part will be our future works.
 tuning, we follow the setting in [69, 15] to utilize the vanilla
4.3. Quantization details Knowledge Distillation (KD) [29] between the original net-
 work and the compressed model. Since extra data aug-
 Low-bit activation quantization typically requires a cal- mentations can be the game-changer in fine-tuning, we use
ibration set to constrain the dynamic range of intermediate the common 4 pixels padding with 32 ˆ 32 random crop-
layers to a finite set of fixed points. As shown in Table 5, it ping for CIFAR and the official PyTorch pre-processing for
is crucial to collect images sampled from the same domain ImageNet, without bells and whistles. The initial learning
as the original dataset. To fully evaluate the effectiveness rate for KD is 0.01 and trained for 300/100 epochs on CI-
of GZNQ, we use synthetic images as the calibration set FAR/ImageNet, decayed every N3 iterations with a multi-
for sampling activations [39, 34], and BN statistics [59, 27]. plier of 0.1. The batch size is 128 in all experiments. We
Besides, we use MSE quantizer [65, 27] for both weights also follow [70, 69] to fix batch normalization statistics dur-
and activations quantization, though, it is sensitive towards ing fine-tuning on ImageNet. All convolution and fully-
outliers. In all our experiments, floating-point per-kernel connected layers are quantized to 4/6-bit, unless specified,
scaling factors for the weights and a per-tensor scale for the including the first and last layer. The synthesized CIFAR
layer’s activation values are considered. We keep a copy of dataset consists of 50k images and ImageNet has roughly
full-precision weights to accumulate gradients then conduct 100 images per category.

 6
Generative Zero-shot Network Quantization
Pre-trained Model Method W bit A bit Quant Top-1 (%) Top-1 Drop (%) Real-data Fine-tuning
 DoReFa [75] 4 4 71.4 5.5 X X
 Ours 4 4 72.7 3.4 – X
 OMSE [16] 4 32 67.4 8.6 – –
 ResNet-50
 Ours 4 6 68.1 8.1 – –
 (25.56M)
 OCS [73] 6 6 74.8 1.3 X –
 ACIQ [3] 6 6 74.3 1.3 X –
 Ours 6 6 75.5 0.6 – –
 ZeroQ [11] 4 4 26.0 45.4 – –
 GDFQ [69] 4 4 33.3 38.2 X –
 Ours 4 4 56.8 12.9 – –
 Knowledge Within [25] 4: 4 55.5 14.3 – –
 ResNet-18 Ours 4: 4 58.9 10.9 – –
 (11.69M) GDFQ [69] 4 4 60.6 10.9 – X
 Ours 4 4 64.5 5.30 – X
 Integer-Only [36] 6 6 67.3 2.46 X X
 DFQ [54] 6 6 66.3 3.46 – –
 Ours 6 6 69.0 0.78 – –
 Knowledge Within [25] 4; 4 16.10 55.78 – –
 Ours 4 4 53.53 17.87 – –
 Integer-Only [36] 6 6 70.90 0.85 X X
 MobileNetv2
 Ours 6 6 71.12 0.63 – X
 (3.51M)
 DFQ [54] 8 8 71.19 0.38 – –
 Knowledge Within [25] 8 8 71.32 0.56 – –
 Ours 8 8 71.38 0.36 – –

Table 2: Quantization results on ImageNet. “Real-data” means using original dataset as the calibration set to quantize
activations or fine-tune weights. “Fine-tuning” refers to re-training with KD on the generated images (if no real data is
required) or label-based fine-tuning. : indicates first and last layers are in 8-bit. ; means 8-bit 1 ˆ 1 convolution layer. We
directly cite the best results reported in the original papers (ZeroQ from [69]).

 BN CE-Loss Pseudo-label BN+ Gating Inception Score Ò Quantized Model Acc. (%)
 Calibration
 ResNet-20 WRN40-2 ResNet-20 ResNet-18
 X 7.4 Dataset
 (CIFAR-10) (CIFAR-10) (CIFAR-100) (CIFAR-100)
 X X 43.7
 X X X 74.0 SVHN 24.81 50.11 5.43 7.11
 X X X X 80.6 CIFAR-100 88.45 92.94 60.29 76.63
 CIFAR-10 90.06 94.01 57.29 75.39
 X X X X X 84.7
 Ours 89.06 94.06 58.99 71.15
 FP32 93.13 95.18 69.17 79.11
Table 3: Impact of the proposed modules on GZNQ. “BN+”
refers to Equation (9). Following GAN works [8, 72], we Table 5: Impact of using different calibration datasets for 4-
use Inception Score (IS) to evaluate the visual fidelity of bit weights and 4-bit activation post-training quantization.
generated images. Our synthesized dataset achieves comparable performance
 to the real images in most cases.
 Quantization Pruning Low-rank Ensemble (12) Ensemble (13)
 IS Ò 71.9˘3.85 73.0˘2.47 81.3˘2.16 84.7˘1.76 82.7˘2.90

 larger, GZNQ still obtains notably improvements over base-
Table 4: Ablation study on the effect of model ensemble in line methods. Due to the different full-precision baseline
the Pseudo-label generator. IS stands for Inception Score. accuracy reported in previous works, we also include the ac-
 curacy gap between floating-point networks and quantized
 networks in our comparisons. We directly cite the results in
4.4. Comparisons on benchmarks
 original papers to make a fair comparison, using the same
 We compare different zero-shot quantization methods architecture. In all experiments, we fine-tune the quantized
[11, 25, 15, 69] on CIFAR-10/100 and show the results in models on the synthetic dataset generated by their corre-
Table 1. Our methods consistently outperform other state- sponding full-precision networks.
of-the-art approaches in 4-bit settings. Furthermore, bene- We further compare our method with [11, 25, 12, 70] on
fiting from the high quality generated images, the experi- the visual fidelity of images generated on CIFAR-10 and
mental results in Table 2 illustrate that, as the dataset gets ImageNet. Figure 6-8 show that GZNQ is able to gener-

 7
Generative Zero-shot Network Quantization
Method Resolution GAN Inception Score Ò
 BigGAN-deep [8] 256 X 202.6
 BigGAN [8] 256 X 178.0
 SAGAN [72] 128 X 52.5
 SNGAN [50] 128 X 35.3
 GZNQ 224 - 84.7˘2.8
 DeepInversion [70] 224 - 60.6
 DeepDream [51] 224 - 6.2
 ZeroQ [11] 224 - 2.8

Table 6: Inception Score (IS, higher is better) of various
methods on ImageNet. SNGAN score reported in [63] and
DeepDream score from [70]. The bottom four schemes are
data-free and utlizing ImageNet pre-trained ResNet-50 to
obtain synthesized images.

 Figure 7: ImageNet samples generated by our GZNQ
 ResNet-50 model at 224 ˆ 224 resolution : bear, daisy, bal-
 loon, pizza, stoplight, stingray, quill, volcano.

(a) Noise (b) Deep (c) (d) Deep (e) GZNQ
 Dream[51]DAFL[12] Inversion[70]

Figure 6: Synthetic samples generated by a CIFAR-10 pre-
trained ResNet-34 at a 32 ˆ 32 resolution. We directly cite
the best visualization results reported in [70].

ate images with high fidelity and resolution. We observe
that GZNQ images are more realistic than other competi-
tors (appear like cartoon images). Following [70], we con-
duct quantitative analysis on the image quality via Inception
Score (IS) [62]. Our approach surpasses previous works,
that is consistent with visualization results.

5. Conclusions (a) (b) (c) (d) GZNQ
 ZeroQ[11] Knowledge Deep
 We present generative modeling to describe the image Within[25] Inversion[70]
synthesis process in zero-shot quantization. Different from
data-driven VAE/GANs that estimates ppxq defined over Figure 8: Synthetic samples generated by a ImageNet pre-
real images X , we focus on matching the distribution of trained ResNet-50 using different methods. We directly cite
mean and variance of Batch Normalization layers in the the best visualization results reported in the original papers.
absence of original data. The proposed scheme further in-
terprets the recent Batch Normalization matching loss and
leads to high fidelity images. Through extensive experi- References
ments, we have shown that GZNQ performs well on the [1] Martı́n Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan
challenging zero-shot quantization task. The generated im- McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep
ages also serve as an attempt to visualize what a deep con- learning with differential privacy. In Edgar R. Weippl, Stefan
volutional neural network expects to see in real images. Katzenbeisser, Christopher Kruegel, Andrew C. Myers, and

 8
Shai Halevi, editors, Proceedings of the 2016 ACM SIGSAC [14] Jungwook Choi, Zhuo Wang, Swagath Venkataramani,
 Conference on Computer and Communications Security, Vi- Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash
 enna, Austria, October 24-28, 2016, pages 308–318. ACM, Gopalakrishnan. PACT: parameterized clipping activation
 2016. 6 for quantized neural networks. CoRR, abs/1805.06085,
 [2] Shai Abramson, David Saad, and Emanuel Marom. Training 2018. 6
 a network with ternary weights using the CHIR algorithm. [15] Yoojin Choi, Jihwan P. Choi, Mostafa El-Khamy, and Jung-
 IEEE Trans. Neural Networks, 4(6):997–1000, 1993. 2 won Lee. Data-free network quantization with adversarial
 [3] Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. knowledge distillation. In 2020 IEEE/CVF Conference on
 ACIQ: analytical clipping for integer quantization of neural Computer Vision and Pattern Recognition, CVPR Workshops
 networks. CoRR, abs/1810.05723, 2018. 2, 7 2020, Seattle, WA, USA, June 14-19, 2020, pages 3047–
 [4] Ron Banner, Yury Nahshan, and Daniel Soudry. Post train- 3057. IEEE, 2020. 1, 2, 3, 6, 7
 ing 4-bit quantization of convolutional networks for rapid- [16] Yoni Choukroun, Eli Kravchik, Fan Yang, and Pavel Kisilev.
 deployment. In Hanna M. Wallach, Hugo Larochelle, Alina Low-bit quantization of neural networks for efficient infer-
 Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Ro- ence. In 2019 IEEE/CVF International Conference on Com-
 man Garnett, editors, Advances in Neural Information Pro- puter Vision Workshops, ICCV Workshops 2019, Seoul, Ko-
 cessing Systems 32: Annual Conference on Neural Informa- rea (South), October 27-28, 2019, pages 3009–3018. IEEE,
 tion Processing Systems 2019, NeurIPS 2019, 8-14 Decem- 2019. 2, 6, 7
 ber 2019, Vancouver, BC, Canada, pages 7948–7956, 2019. [17] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre
 2 David. Binaryconnect: Training deep neural networks with
 [5] Shane T. Barratt and Rishi Sharma. A note on the inception binary weights during propagations. In Advances in Neu-
 score. CoRR, abs/1801.01973, 2018. 5 ral Information Processing Systems 28: Annual Conference
 [6] Babak Ehteshami Bejnordi, Tijmen Blankevoort, and Max on Neural Information Processing Systems 2015, December
 Welling. Batch-shaping for learning conditional channel 7-12, 2015, Montreal, Quebec, Canada, pages 3123–3131,
 gated networks. In 8th International Conference on Learning 2015. 2
 Representations, ICLR 2020, Addis Ababa, Ethiopia, April [18] Tim Dettmers. 8-bit approximations for parallelism in deep
 26-30, 2020. OpenReview.net, 2020. 5 learning. In Yoshua Bengio and Yann LeCun, editors, 4th In-
 [7] Kartikeya Bhardwaj, Naveen Suda, and Radu Marculescu. ternational Conference on Learning Representations, ICLR
 Dream distillation: A data-independent model compression 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference
 framework. CoRR, abs/1905.07072, 2019. 1, 2 Track Proceedings, 2016. 2
 [8] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large [19] Ugljesa Djuric, Gelareh Zadeh, Kenneth Aldape, and Phe-
 scale GAN training for high fidelity natural image synthe- dias Diamandis. Precision histology: how deep learning is
 sis. In 7th International Conference on Learning Represen- poised to revitalize histomorphology for personalized cancer
 tations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. care. npj Precision Oncology, 1, 12 2017. 2
 OpenReview.net, 2019. 7, 8 [20] Carl Doersch. Tutorial on variational autoencoders. CoRR,
 [9] Cameron Buckner and James Garson. Connectionism. In Ed- abs/1606.05908, 2016. 2
 ward N. Zalta, editor, The Stanford Encyclopedia of Philos- [21] Ishan P. Durugkar, Ian Gemp, and Sridhar Mahadevan. Gen-
 ophy. Metaphysics Research Lab, Stanford University, fall erative multi-adversarial networks. In 5th International Con-
 2019 edition, 2019. 5 ference on Learning Representations, ICLR 2017, Toulon,
[10] Robert Burbidge, Matthew W. B. Trotter, Bernard F. Bux- France, April 24-26, 2017, Conference Track Proceedings.
 ton, and Sean B. Holden. Drug design by machine learning: OpenReview.net, 2017. 4
 Support vector machines for pharmaceutical data analysis. [22] Alexander Finkelstein, Uri Almog, and Mark Grobman.
 Computers & Chemistry, 26(1):5–14, 2002. 2 Fighting quantization bias with bias. CoRR, abs/1906.03193,
[11] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, 2019. 2
 Michael W. Mahoney, and Kurt Keutzer. Zeroq: A novel [23] Tal Grossman. The CHIR algorithm for feed forward net-
 zero shot quantization framework. In 2020 IEEE/CVF Con- works with binary weights. In David S. Touretzky, edi-
 ference on Computer Vision and Pattern Recognition, CVPR tor, Advances in Neural Information Processing Systems 2,
 2020, Seattle, WA, USA, June 13-19, 2020, pages 13166– [NIPS Conference, Denver, Colorado, USA, November 27-
 13175. IEEE, 2020. 1, 2, 3, 6, 7, 8 30, 1989], pages 516–523. Morgan Kaufmann, 1989. 2
[12] Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, [24] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and
 Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Pritish Narayanan. Deep learning with limited numerical
 Tian. Data-free learning of student networks. In 2019 precision. In Francis R. Bach and David M. Blei, editors,
 IEEE/CVF International Conference on Computer Vision, Proceedings of the 32nd International Conference on Ma-
 ICCV 2019, Seoul, Korea (South), October 27 - November chine Learning, ICML 2015, Lille, France, 6-11 July 2015,
 2, 2019, pages 3513–3521. IEEE, 2019. 2, 7, 8 volume 37 of JMLR Workshop and Conference Proceedings,
[13] Tzi-Dar Chiueh and Rodney M. Goodman. Learning algo- pages 1737–1746. JMLR.org, 2015. 2
 rithms for neural networks with ternary weights. Neural Net- [25] Matan Haroush, Itay Hubara, Elad Hoffer, and Daniel
 works, 1(Supplement-1):166–167, 1988. 2 Soudry. The knowledge within: Methods for data-free model

 9
compression. In 2020 IEEE/CVF Conference on Computer [36] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu,
 Vision and Pattern Recognition, CVPR 2020, Seattle, WA, Matthew Tang, Andrew G. Howard, Hartwig Adam, and
 USA, June 13-19, 2020, pages 8491–8499. IEEE, 2020. 1, 2, Dmitry Kalenichenko. Quantization and training of neu-
 3, 6, 7, 8 ral networks for efficient integer-arithmetic-only inference.
[26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. In 2018 IEEE Conference on Computer Vision and Pattern
 Deep residual learning for image recognition. In 2016 IEEE Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-
 Conference on Computer Vision and Pattern Recognition, 22, 2018, pages 2704–2713. IEEE Computer Society, 2018.
 CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2, 6, 7
 770–778. IEEE Computer Society, 2016. 1, 5 [37] Sanjay Kariyappa, Atul Prakash, and Moinuddin K. Qureshi.
[27] Xiangyu He and Jian Cheng. Learning compression from MAZE: data-free model stealing attack using zeroth-order
 limited unlabeled data. In Vittorio Ferrari, Martial Hebert, gradient estimation. CoRR, abs/2005.03161, 2020. 2
 Cristian Sminchisescu, and Yair Weiss, editors, Computer [38] Hyungjun Kim, Kyungsu Kim, Jinseok Kim, and Jae-Joon
 Vision - ECCV 2018 - 15th European Conference, Munich, Kim. Binaryduo: Reducing gradient mismatch in binary ac-
 Germany, September 8-14, 2018, Proceedings, Part I, vol- tivation network by coupling binary activations. In 8th In-
 ume 11205 of Lecture Notes in Computer Science, pages ternational Conference on Learning Representations, ICLR
 778–795. Springer, 2018. 2, 6 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe-
[28] Xiangyu He, Zitao Mo, Ke Cheng, Weixiang Xu, Qinghao view.net, 2020. 2
 Hu, Peisong Wang, Qingshan Liu, and Jian Cheng. Prox- [39] Raghuraman Krishnamoorthi. Quantizing deep convolu-
 ybnn: Learning binarized neural networks via proxy matri- tional networks for efficient inference: A whitepaper. CoRR,
 ces. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and abs/1806.08342, 2018. 6
 Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 [40] Yann LeCun, John S. Denker, and Sara A. Solla. Opti-
 - 16th European Conference, Glasgow, UK, August 23-28, mal brain damage. In David S. Touretzky, editor, Advances
 2020, Proceedings, Part XII. Springer, 2020. 2 in Neural Information Processing Systems 2, [NIPS Con-
 ference, Denver, Colorado, USA, November 27-30, 1989],
[29] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean.
 pages 598–605. Morgan Kaufmann, 1989. 5
 Distilling the knowledge in a neural network. CoRR,
 abs/1503.02531, 2015. 2, 6 [41] Fengfu Li and Bin Liu. Ternary weight networks. CoRR,
 abs/1605.04711, 2016. 2
[30] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Q.
 [42] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate
 Phung. MGAN: training generative adversarial nets with
 binary convolutional neural network. In Advances in Neural
 multiple generators. In 6th International Conference on
 Information Processing Systems 30: Annual Conference on
 Learning Representations, ICLR 2018, Vancouver, BC,
 Neural Information Processing Systems 2017, 4-9 December
 Canada, April 30 - May 3, 2018, Conference Track Proceed-
 2017, Long Beach, CA, USA, pages 345–353, 2017. 2
 ings. OpenReview.net, 2018. 4
 [43] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS:
[31] Sara Hooker, Aaron Courville, Gregory Clark, Yann differentiable architecture search. In 7th International Con-
 Dauphin, and Andrea Frome. What do compressed deep ference on Learning Representations, ICLR 2019, New Or-
 neural networks forget? arxiv e-prints, art. arXiv preprint leans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. 5
 arXiv:1911.05248, 2019. 4
 [44] Raphael Gontijo Lopes, Stefano Fenu, and Thad Starner.
[32] Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Ben- Data-free knowledge distillation for deep neural networks.
 gio, and Emily Denton. Characterising bias in compressed CoRR, abs/1710.07535, 2017. 2
 models. CoRR, abs/2010.03058, 2020. 4 [45] Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Ef-
[33] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El- stratios Gavves, and Max Welling. Relaxed quantization for
 Yaniv, and Yoshua Bengio. Binarized neural networks. In discretized neural networks. In 7th International Conference
 Advances in Neural Information Processing Systems 29: An- on Learning Representations, ICLR 2019, New Orleans, LA,
 nual Conference on Neural Information Processing Systems USA, May 6-9, 2019. OpenReview.net, 2019. 2
 2016, December 5-10, 2016, Barcelona, Spain, pages 4107– [46] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The
 4115, 2016. 2 concrete distribution: A continuous relaxation of discrete
[34] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El- random variables. In 5th International Conference on Learn-
 Yaniv, and Yoshua Bengio. Quantized neural networks: ing Representations, ICLR 2017, Toulon, France, April 24-
 Training neural networks with low precision weights and ac- 26, 2017, Conference Track Proceedings. OpenReview.net,
 tivations. J. Mach. Learn. Res., 18:187:1–187:30, 2017. 2, 2017. 5
 6 [47] Aravindh Mahendran and Andrea Vedaldi. Understanding
[35] Sergey Ioffe and Christian Szegedy. Batch normalization: deep image representations by inverting them. In IEEE Con-
 Accelerating deep network training by reducing internal co- ference on Computer Vision and Pattern Recognition, CVPR
 variate shift. In Francis R. Bach and David M. Blei, editors, 2015, Boston, MA, USA, June 7-12, 2015, pages 5188–5196.
 Proceedings of the 32nd International Conference on Ma- IEEE Computer Society, 2015. 1
 chine Learning, ICML 2015, Lille, France, 6-11 July 2015, [48] Brais Martı́nez, Jing Yang, Adrian Bulat, and Georgios Tz-
 volume 37 of JMLR Workshop and Conference Proceedings, imiropoulos. Training binary neural networks with real-
 pages 448–456. JMLR.org, 2015. 1, 2, 3, 4 to-binary convolutions. In 8th International Conference

 10
on Learning Representations, ICLR 2020, Addis Ababa, 32: Annual Conference on Neural Information Processing
 Ethiopia, April 26-30, 2020. OpenReview.net, 2020. 2 Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancou-
[49] Paul Micaelli and Amos J. Storkey. Zero-shot knowl- ver, BC, Canada, pages 8024–8035, 2019. 6
 edge transfer via adversarial belief matching. In Hanna M. [59] Jorn W. T. Peters and Max Welling. Probabilistic binary neu-
 Wallach, Hugo Larochelle, Alina Beygelzimer, Florence ral networks. CoRR, abs/1809.03368, 2018. 6
 d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, [60] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon,
 Advances in Neural Information Processing Systems 32: An- and Ali Farhadi. Xnor-net: Imagenet classification using bi-
 nual Conference on Neural Information Processing Systems nary convolutional neural networks. In Computer Vision -
 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, ECCV 2016 - 14th European Conference, Amsterdam, The
 Canada, pages 9547–9557, 2019. 2 Netherlands, October 11-14, 2016, Proceedings, Part IV,
[50] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and pages 525–542, 2016. 2
 Yuichi Yoshida. Spectral normalization for generative adver- [61] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian
 sarial networks. In 6th International Conference on Learning Sun. Faster R-CNN: towards real-time object detection
 Representations, ICLR 2018, Vancouver, BC, Canada, April with region proposal networks. In Corinna Cortes, Neil D.
 30 - May 3, 2018, Conference Track Proceedings. OpenRe- Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman
 view.net, 2018. 8 Garnett, editors, Advances in Neural Information Process-
[51] A. Mordvintsev, Christopher Olah, and M. Tyka. Inception- ing Systems 28: Annual Conference on Neural Information
 ism: Going deeper into neural networks. 2015. 1, 8 Processing Systems 2015, December 7-12, 2015, Montreal,
[52] Nelson Morgan et al. Experimental determination of preci- Quebec, Canada, pages 91–99, 2015. 1
 sion requirements for back-propagation training of artificial [62] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki
 neural networks. In Proc. Second Int’l. Conf. Microelectron- Cheung, Alec Radford, and Xi Chen. Improved techniques
 ics for Neural Networks,, pages 9–16. Citeseer, 1991. 2 for training gans. In Daniel D. Lee, Masashi Sugiyama,
[53] Markus Nagel, Rana Ali Amjad, Mart van Baalen, Chris- Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett,
 tos Louizos, and Tijmen Blankevoort. Up or down? editors, Advances in Neural Information Processing Sys-
 adaptive rounding for post-training quantization. CoRR, tems 29: Annual Conference on Neural Information Process-
 abs/2004.10568, 2020. 2, 6 ing Systems 2016, December 5-10, 2016, Barcelona, Spain,
[54] Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and pages 2226–2234, 2016. 5, 8
 Max Welling. Data-free quantization through weight equal- [63] Konstantin Shmelkov, Cordelia Schmid, and Karteek Ala-
 ization and bias correction. In 2019 IEEE/CVF International hari. How good is my gan? In Vittorio Ferrari, Mar-
 Conference on Computer Vision, ICCV 2019, Seoul, Korea tial Hebert, Cristian Sminchisescu, and Yair Weiss, editors,
 (South), October 27 - November 2, 2019, pages 1325–1334. Computer Vision - ECCV 2018 - 15th European Conference,
 IEEE, 2019. 7 Munich, Germany, September 8-14, 2018, Proceedings, Part
[55] Gaurav Kumar Nayak, Konda Reddy Mopuri, Vaisakh Shaj, II, volume 11206 of Lecture Notes in Computer Science,
 Venkatesh Babu Radhakrishnan, and Anirban Chakraborty. pages 218–234. Springer, 2018. 8
 Zero-shot knowledge distillation in deep networks. In Ka- [64] Paul Smolensky. Grammar-based connectionist approaches
 malika Chaudhuri and Ruslan Salakhutdinov, editors, Pro- to language. Cogn. Sci., 23(4):589–613, 1999. 5
 ceedings of the 36th International Conference on Machine [65] Wonyong Sung, Sungho Shin, and Kyuyeon Hwang. Re-
 Learning, ICML 2019, 9-15 June 2019, Long Beach, Cali- siliency of deep neural networks under quantization. CoRR,
 fornia, USA, volume 97 of Proceedings of Machine Learning abs/1511.06488, 2015. 6
 Research, pages 4743–4751. PMLR, 2019. 2 [66] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky.
[56] Chris Olah, Nick Cammarata, Ludwig Schubert, Deep image prior. In 2018 IEEE Conference on Computer
 Gabriel Goh, Michael Petrov, and Shan Carter. Vision and Pattern Recognition, CVPR 2018, Salt Lake City,
 Zoom in: An introduction to circuits. Distill, 2020. UT, USA, June 18-22, 2018, pages 9446–9454. IEEE Com-
 https://distill.pub/2020/circuits/zoom-in. 1 puter Society, 2018. 1, 2
[57] Chris Olah, Alexander Mordvintsev, and Ludwig [67] Peisong Wang, Xiangyu He, Qiang Chen, Anda Cheng,
 Schubert. Feature visualization. Distill, 2017. Qingshan Liu, and Jian Cheng. Unsupervised network quan-
 https://distill.pub/2017/feature-visualization. 1 tization via fixed-point factorization. IEEE Transactions on
[58] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, Neural Networks and Learning Systems, 2020. 2, 6
 James Bradbury, Gregory Chanan, Trevor Killeen, Zeming [68] Yulong Wang, Hang Su, Bo Zhang, and Xiaolin Hu. Interpret
 Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, neural networks by identifying critical data routing paths.
 Andreas Köpf, Edward Yang, Zachary DeVito, Martin Rai- In 2018 IEEE Conference on Computer Vision and Pattern
 son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-
 Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im- 22, 2018, pages 8906–8914. IEEE Computer Society, 2018.
 perative style, high-performance deep learning library. In 5
 Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, [69] Shoukai Xu, Haokun Li, Bohan Zhuang, Jing Liu, Jiezhang
 Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, Cao, Chuangrun Liang, and Mingkui Tan. Generative low-
 editors, Advances in Neural Information Processing Systems bitwidth data free quantization. In Andrea Vedaldi, Horst

 11
Bischof, Thomas Brox, and Jan-Michael Frahm, editors,
 Computer Vision - ECCV 2020 - 16th European Conference,
 Glasgow, UK, August 23-28, 2020, Proceedings, Part XII,
 volume 12357 of Lecture Notes in Computer Science, pages
 1–17. Springer, 2020. 1, 3, 6, 7
[70] Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong
 Li, Arun Mallya, Derek Hoiem, Niraj K. Jha, and Jan Kautz.
 Dreaming to distill: Data-free knowledge transfer via deep-
 inversion. In 2020 IEEE/CVF Conference on Computer
 Vision and Pattern Recognition, CVPR 2020, Seattle, WA,
 USA, June 13-19, 2020, pages 8712–8721. IEEE, 2020. 1, 2,
 3, 6, 7, 8
[71] Jaemin Yoo, Minyong Cho, Taebum Kim, and U Kang.
 Knowledge extraction with no observable data. In Hanna M.
 Wallach, Hugo Larochelle, Alina Beygelzimer, Florence
 d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,
 Advances in Neural Information Processing Systems 32: An-
 nual Conference on Neural Information Processing Systems
 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC,
 Canada, pages 2701–2710, 2019. 2
[72] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and
 Augustus Odena. Self-attention generative adversarial net-
 works. In Kamalika Chaudhuri and Ruslan Salakhutdinov,
 editors, Proceedings of the 36th International Conference
 on Machine Learning, ICML 2019, 9-15 June 2019, Long
 Beach, California, USA, volume 97 of Proceedings of Ma-
 chine Learning Research, pages 7354–7363. PMLR, 2019.
 7, 8
[73] Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Christopher De Sa,
 and Zhiru Zhang. Improving neural network quantization
 without retraining using outlier channel splitting. In Kama-
 lika Chaudhuri and Ruslan Salakhutdinov, editors, Proceed-
 ings of the 36th International Conference on Machine Learn-
 ing, ICML 2019, 9-15 June 2019, Long Beach, California,
 USA, volume 97 of Proceedings of Machine Learning Re-
 search, pages 7543–7552. PMLR, 2019. 2, 7
[74] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong
 Chen. Incremental network quantization: Towards lossless
 cnns with low-precision weights. In 5th International Con-
 ference on Learning Representations, ICLR 2017, Toulon,
 France, April 24-26, 2017, Conference Track Proceedings.
 OpenReview.net, 2017. 2
[75] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu,
 and Yuheng Zou. Dorefa-net: Training low bitwidth convo-
 lutional neural networks with low bitwidth gradients. CoRR,
 abs/1606.06160, 2016. 2, 7
[76] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally.
 Trained ternary quantization. In 5th International Con-
 ference on Learning Representations, ICLR 2017, Toulon,
 France, April 24-26, 2017, Conference Track Proceedings.
 OpenReview.net, 2017. 2

 12
You can also read