Black Box Variational Inference

Page created by Wade Patterson
 
CONTINUE READING
Black Box Variational Inference

          Rajesh Ranganath                  Sean Gerrish                      David M. Blei
                       Princeton University, 35 Olden St., Princeton, NJ 08540
                           frajeshr,sgerrish,bleig@cs.princeton.edu

                      Abstract                                marize observations, draw conclusions about current
                                                              data, and make predictions about new data. Central
    Variational inference has become a widely                 to working with latent variable models is the problem
    used method to approximate posteriors in                  of computing the posterior distribution of the latent
    complex latent variables models. However,                 structure. For many interesting models, computing
    deriving a variational inference algorithm gen-           the posterior exactly is intractable: practitioners must
    erally requires significant model-specific anal-          resort to approximate methods.
    ysis. These efforts can hinder and deter us               One of the most widely used methods for approximate
    from quickly developing and exploring a vari-             posterior estimation is variational inference (Wain-
    ety of models for a problem at hand. In this              wright and Jordan, 2008; Jordan et al., 1999). Varia-
    paper, we present a “black box” variational               tional inference tries to find the member of a family of
    inference algorithm, one that can be quickly              simple probability distributions that is closest (in KL
    applied to many models with little additional             divergence) to the true posterior distribution.
    derivation. Our method is based on a stochas-
    tic optimization of the variational objective             For a specific class of models, those where the condi-
    where the noisy gradient is computed from                 tional distributions have a convenient form (and where
    Monte Carlo samples from the variational dis-             a convenient variational family exists), this optimiza-
    tribution. We develop a number of methods                 tion can be carried out with a closed-form coordinate
    to reduce the variance of the gradient, always            ascent algorithm (Ghahramani and Beal, 2001). For
    maintaining the criterion that we want to                 generic models and arbitrary variational families, how-
    avoid difficult model-based derivations. We               ever, there is no closed-form solution: computing the
    evaluate our method against the correspond-               required expectations becomes intractable. In these
    ing black box sampling based methods. We                  settings, practitioners have resorted to model-specific al-
    find that our method reaches better predictive            gorithms (Jaakkola and Jordan, 1996; Blei and Lafferty,
    likelihoods much faster than sampling meth-               2007; Braun and McAuliffe, 2007) or generic algorithms
    ods. Finally, we demonstrate that Black Box               that require model specific computations (Knowles and
    Variational Inference lets us easily explore a            Minka, 2011; Wang and Blei, 2013; Paisley et al., 2012).
    wide space of models by quickly constructing              Deriving these algorithms on a model-by-model basis
    and evaluating several models of longitudinal             is tedious work. This hinders us from rapidly exploring
    healthcare data.                                          modeling assumptions when solving applied problems,
                                                              and it makes variational methods on complicated dis-
                                                              tributions impractical for many practitioners. Our goal
1    Introduction                                             in this paper is to develop a “black box” variational
                                                              inference algorithm, a method that can be quickly ap-
Probabilistic models with latent variables have become        plied to almost any model and with little effort. Our
a mainstay in modern machine learning applications.           method allows practitioners to quickly design, apply,
With latent variables models, we posit a rich latent          and revise models of their data, without painstaking
structure that governs our observations, infer that struc-    derivations each time they want to adjust the model.
ture from large data sets, and use our inferences to sum-
                                                              Variational inference methods frame a posterior esti-
Appearing in Proceedings of the 17t h International Con-      mation problem as an optimization problem, where the
ference on Artificial Intelligence and Statistics (AISTATS)   parameters to be optimized adjust a variational “proxy”
2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy-         distribution to be similar to the true posterior. Our
right 2014 by the authors.                                    method rewrites the gradient of that objective as the
Black Box Variational Inference

expectation of an easy-to-implement function f of the       have independently considered a similar procedure to
latent and observed variables, where the expectation        ours, where the gradient is construed as an expectation
is taken with respect to the variational distribution;      and the KL is optimized with stochastic optimization.
and we optimize that objective by sampling from the         They too include a term to reduce the variance, but
variational distribution, evaluating the function f , and   do not describe how to set it. We further innovate
forming the corresponding Monte Carlo estimate of           on their approach with Rao-Blackwellization, speci-
the gradient. We then use these noisy gradients in          fied control variates, adaptive learning rates, and data
a stochastic optimization algorithm to optimize the         subsampling. Salimans and Knowles (2012) provide a
variational parameters.                                     framework based on stochastic linear regression. Un-
                                                            like our approach, their method does not generalize
From the practitioner’s perspective, this method re-
                                                            to arbitrary approximating families and requires the
quires only that he or she write functions to evaluate
                                                            inversion of a large matrix that becomes impractical in
the model log-likelihood. The remaining calculations
                                                            high dimensional settings. Kingma and Welling (2013)
(properties of the variational distribution and evalu-
                                                            provide an alternative method for variational inference
ating the Monte Carlo estimate) are easily put into a
                                                            through a reparameterization of the variational distri-
library to share across models, which means our method
                                                            butions. In contrast to our approach, their algorithm is
can be quickly applied to new modeling settings.
                                                            limited to only continuous latent variables. Carbonetto
We will show that reducing the variance of the gradient     et al. (2009) present a stochastic optimization scheme
estimator is essential to the fast convergence of our       for moment estimation based on the specific form of
algorithm. We develop several strategies for controlling    the variational objective when both the model and
the variance. The first is based on Rao-Blackwellization    the approximating family are in the same exponential
(Casella and Robert, 1996), which exploits the factor-      family. This differs from our more general modeling
ization of the variational distribution. The second is      setting where latent variables may be outside of the
based on control variates (Ross, 2002; Paisley et al.,      exponential family. Finally, Paisley et al. (2012) use
2012), using the gradient of the log probability of the     Monte Carlo gradients for difficult terms in the varia-
variational distribution. We emphasize that these vari-     tional objective and also use control variates to reduce
ance reduction methods preserve our goal of black box       variance. However, theirs is not a black-box method.
inference because they do not require computations          Both the objective function and control variates they
specific to the model.                                      propose require model-specific derivations.
Finally, we show how to use recent innovations in vari-
ational inference and stochastic optimization to scale      2   Black Box Variational Inference
up and speed up our algorithm. First, we use adaptive
learning rates (Duchi et al., 2011) to set the step size    Variational inference transforms the problem of approx-
in the stochastic optimization. Second, we develop          imating a conditional distribution into an optimization
generic stochastic variational inference (Hoffman et al.,   problem (Jordan et al., 1999; Bishop, 2006; Wainwright
2013), where we additionally subsample from the data        and Jordan, 2008). The idea is to posit a simple family
to more cheaply compute noisy gradients. This inno-         of distributions over the latent variables and find the
vates on the algorithm of Hoffman et al. (2013), which      member of the family that is closest in KL divergence
requires closed form coordinate updates to compute          to the conditional distribution.
noisy natural gradients.
                                                            In a probabilistic model, let x be observations, z be
We demonstrate our method in two ways. First, we            latent variables, and  the free parameters of a varia-
compare our method against Metropolis-Hastings-in-          tional distribution q.z j /. Our goal is to approximate
Gibbs (Bishop, 2006), a sampling based technique that       p.z j x/ with the free parameter . In variational infer-
requires similar effort on the part of the practitioner.    ence we optimize the Evidence Lower BOund (ELBO),
We find our method reaches better predictive likeli-
hoods much faster than sampling methods. Second, we                  L./ , Eq .z/ Œlog p.x; z/   log q.z/:    (1)
use our method to quickly build and evaluate several
models of longitudinal patient data. This demonstrates      Maximizing the ELBO is equivalent to minimizing the
the ease with which we can now consider models gen-         KL divergence (Jordan et al., 1999; Bishop, 2006). Intu-
erally outside the realm of variational methods.            itively, the first term rewards variational distributions
                                                            that place high mass on configurations of the latent
                                                            variables that also explain the observations; the second
Related work. There have been several lines of work         term rewards variational distributions that are entropic,
that use sampling methods to approximate gradients          i.e., that maximize uncertainty by spreading their mass
in variational inference. Wingate and Weber (2013)          on many configurations.
Rajesh Ranganath, Sean Gerrish, David M. Blei

Practitioners derive variational algorithms to maximize       Algorithm 1 Black Box Variational Inference
the ELBO over the variational parameters by expand-             Input: data x, joint distribution p, mean field vari-
ing the expectation in Eq. 1 and then computing gradi-          ational family q.
ents to use in an optimization procedure. Closed form           Initialize  randomly, t D 1.
coordinate-ascent updates are available for condition-          repeat
ally conjugate exponential family models (Ghahramani              // Draw S samples from q
and Beal, 2001), where the distribution of each latent            for s D 1 to S do
variable given its Markov blanket falls in the same fam-             zŒs  q
ily as the prior, for a small set of variational families.        end for
However, these updates require analytic computation                = tth value of a Robbins Monro sequence
of various expectations for each new model, a problem              =  C  S1 SsD1 r log q.zŒs j /.log p.x; zŒs/
                                                                                P
which is exacerbated when the variational family falls            log q.zŒs j //
outside this small set. This leads to tedious bookkeep-           t Dt C1
ing and overhead for developing new models.                     until change of  is less than 0.01.
The expectation in Eq. 1 is with respect to a known
distribution whose parameter –  – is known. We
will therefore use stochastic optimization to maximize        be computed from samples from the variational poste-
the ELBO. In stochastic optimization, we maximize a           rior. To do this, we write the gradient of the ELBO
function using noisy estimates of its gradient (Robbins       (Eq. 1) as an expectation with respect to the variational
and Monro, 1951; Kushner and Yin, 1997; Bottou and            distribution,
LeCun, 2004). We will form the derivative of the objec-
tive as an expectation with respect to the variational          r L D Eq Œr log q.zj/.log p.x; z/         log q.zj//:
approximation and then sample from the variational                                                                       (2)
approximation to get noisy but unbiased gradients,
which we use to update our parameters. For each sam-          The derivation of Eq. 2 can be found in the appendix.
ple, our noisy gradient requires evaluating the possibly      Note that in statistics the gradient r log q.zj/ of the
unnormalized joint distribution of the observed and           log of a probability distribution is called the score func-
sampled variables, the variational distribution, and the      tion (Cox and Hinkley, 1979). The joint p.x; z/ can be
gradient of the log of the variational distribution. This     replaced by its unnormalized version (see the appendix
is a black box method in that the gradient of the log of      for details). For subsequent sections, any appearance
the variational distribution and sampling method can          of p.x; z/ may be replaced by an unnormalized version.
be derived once for each type of variational distribution
and reused for many models and applications.                  With this equation in hand, we compute noisy unbiased
                                                              gradients of the ELBO with Monte Carlo samples from
                                                              the variational distribution,
Stochastic optimization. Let us now review
stochastic optimization. Let f .x/ be a function to                       S
be maximized and h t .x/ be the realization of a ran-                  1X
                                                              r L          r log q.zs j/.log p.x; zs /     log q.zs j//;
dom variable H.x/ whose expectation is the gradient of                 S sD1
f .x/. Finally, let  t be a nonnegative scalar. Stochastic
                                                                                                     where zs  q.zj/:
optimization updates x at the tth iteration with
                                                                                                                    (3)
                 x tC1       x t C  t h t .x t /:
                                                              With Eq. 3, we can use stochastic optimization to
This converges to a maximum of f .x/ when  t , the           optimize the ELBO.
learning rate, follows the Robbins-Monro conditions,          The basic algorithm is summarized in Algorithm 1. We
                   P1                                         note that the score function and sampling algorithms
                    tD1      t    D      1                   depend only on the variational distribution, not the
                   P1         2
                         tD1  t    <     1:                  underlying model. Thus we can build up a collection of
                                                              these functions for various variational approximations
Because of its simplicity, stochastic optimization is         and reuse them in a package for a broad class of mod-
widely used in statistics and machine learning.               els. Further we did not make any assumptions about
                                                              the form of the model, only that the practitioner can
A noisy gradient of the ELBO. To optimize the                 compute the log of the joint p.x; zs /. This algorithm
ELBO with stochastic optimization, we need to de-             significantly reduces the effort needed to implement
velop an unbiased estimator of its gradient which can         variational inference in a wide variety of models.
Black Box Variational Inference

3     Controlling the Variance                                  respect to i as an iterated conditional expectation
                                                                which simplifies to
We can use Algorithm 1 to maximize the ELBO, but
the variance of the estimator of the gradient (under            r i L D
the Monte Carlo estimate in Eq. 3) can be too large               Eq.i / Œri log q.zi ji /.log pi .x; z.i / /     log q.zi ji //:
to be useful. In practice, the high variance gradients                                                                             (5)
would require very small steps which would lead to
slow convergence. We now show how to reduce this                The derivation of Eq. 5 can be found in the supplement.
variance in two ways, via Rao-Blackwellization and              This equation says that we can Rao-Blackwellize each
easy-to-implement control variates. We exploit the              component of the gradient with respect to variables
structure of our problem to use these methods in a              outside of the its Markov blanket without needing to
way that requires no model-specific derivations, which          compute model-specific conditional expectations.
preserves our goal of black-box variational inference.          We construct a Monte Carlo estimator for the gradient
                                                                of i using samples from the variational distribution,
3.1   Rao-Blackwellization
                                                                     S
                                                                  1X
Rao-Blackwellization (Casella and Robert, 1996) re-                    r log qi .zs ji /.log pi .x; zs /          log qi .zs ji //;
duces the variance of a random variable by replacing it           S sD1 i
with its conditional expectation with respect to a subset                                               where zs  q.i/ .zj/:
of the variables. Note that the conditional expectation                                                                     (6)
of a random variable is a random variable with respect
to the conditioning set. This generally requires analyt-        These noisy gradients can be used to maximize the
ically computing problem-specific integrals. Here we            ELBO. In our empirical study, Figure 2, we plot the
show how to Rao-Blackwellize the estimator for each             variance of this estimator along with that of Eq. 3.
component of the gradient without needing to compute
model-specific integrals.                                       3.2    Control Variates
In the simplest setting, Rao-Blackwellization replaces          As we saw above, variance reduction methods work
a function of two variables with its conditional expec-         by replacing the function whose expectation is being
tation. Consider two random variables, X and Y , and            approximated by Monte Carlo with another function
a function J.X; Y /. Our goal is to compute its expec-          that has the same expectation but smaller variance.
tation EŒJ.X; Y / with respect to the joint distribution       That is, to estimate Eq Œf  via Monte Carlo we com-
of X and Y .
                                                                pute the empirical average of fO where fO is chosen so
Define JO .X / D EŒJ.X; Y /jX , and note that EŒJO .X / D     Eq Œf  D Eq ŒfO and Varq Œf  > Varq ŒfO.
EŒJ.X; Y /. This means that JO .X / can be used in
                                                                A control variate (Ross, 2002) is a family of functions
place of J.X; Y / in a Monte Carlo approximation of
                                                                with equivalent expectation. Recently, control variates
EŒJ.X; Y /. The variance of JO .X / is
                                                                have been of interest for variational inference (Paisley
 Var.JO .X // D Var.J.X; Y //     EŒ.J.X; Y /    JO .X //2 :   et al., 2012) and for general optimization problems that
                                                                occur in machine learning (Wang et al., 2013).
Thus, JO .X / has lower variance than J.X; Y /.                 Consider a function h, which has a finite first moment,
We return to the problem of estimating the gradient             and a scalar a. Define fO to be
of L. Suppose there are n latent variables z1Wn and we
are using the mean-field variational family, where each                     fO.z/ , f .z/       a.h.z/      EŒh.z//:               (7)
random variable zi is independent and governed by its           This is a family of functions, indexed by a, and note
own variational distribution,                                   that Eq ŒfO D Eq Œf  as required. Given a particular
                                                                function h, we can choose a to minimize the variance
              q.z j / D niD1 q.zi j i /;
                         Q
                                                    (4)
                                                                of fO.
where 1Wn are the n variational parameters charac-             First we note that variance of fO can be written as
terizing the member of the variational family we seek.
Consider the i th component of the gradient. Let q.i / be             Var.fO/ D Var.f / C a2 Var.h/               2aCov.f; h/:
the distribution of variables in the model that depend
on the i th variable, i.e., the Markov blanket of zi and zi ;   This equation implies that good control variates have
and let pi .x; z.i / / be the terms in the joint that depend    high covariance with the function whose expectation is
on those variables. We can write the gradient with              being computed.
Rajesh Ranganath, Sean Gerrish, David M. Blei

Taking the derivative of Var.fO/ with respect to a and                   Algorithm 2 Black Box Variational Inference (II)
setting it equal to zero gives us the value of a that                      Input: data x, joint distribution p, mean field vari-
minimizes the variance,                                                    ational family q.
                                                                           Initialize 1Wn randomly, t D 1.
                 a D Cov.f; h/=Var.h/:                            (8)     repeat
                                                                             // Draw S samples from the variational ap-
                                                                             proximation
We now apply this method to Black Box Variational
                                                                             for s D 1 to S do
Inference. To maintain the generic nature of the algo-
                                                                                zŒs  q
rithm, we want to choose a control variate that only
                                                                             end for
depends on the variational distribution and for which
                                                                             for d D 1 to D do
we can easily compute its expectation. Meeting these
                                                                                for s D 1 to S do
criteria, we choose h to be the score function of the
                                                                                    fd Œs D rd log qi .zŒs j i /.log pi .x; zŒs/
variational approximation, r log q.z/, which always
                                                                                    log qi .zŒs j i //
has expectation zero. (See the appendix.)
                                                                                    hd Œs D rd log qi .zŒs j i /
We apply our control variates to estimate each entry of                         end for
                                                                                          O
the Rao-Blackwellized gradient. More formally, in our                           aOd D Cov.fO
                                                                                               d ;hd /
                                                                                                       , Estimate from a few samples
                                                                                           Var.h d/
above notation, the control variates for the d th entry                                            S
                                                                                rO d L , S1 sD1 fi Œs aOd hi Œs
                                                                                               P
of the gradient are
                                                                             end for
                                                                              = tth value of a Robbins Monro sequence
  fd .z/ D rd log q.zji /.log pi .x; z/             log qi .x; z//;
                                                                              D  C rO  L
  hd .z/ D rd log q.zji /;                                                 t Dt C1
                                                                           until change of  is less than 0.01.
when the d th entry belongs to the ith factor.
The optimal scalings for the control variates are given                  4     Extensions
by Eq. 8. We estimate the necessary variance and
covariances on a small number of samples. Given the
                                                                         We extend the main algorithm in two ways. First, we
estimated scalings aOd the control variate Monte Carlo
                                                                         address the difficulty of setting the step size sched-
estimate of the gradient using S samples is
                                                                         ule. Second, we address scalability by subsampling
                      S
                                                                         observations.
                  1   X
        rO d L ,           rd log qi .zs ji /
                  S   sD1                                                4.1    AdaGrad
                 .log pi .x; zs /      log qi .zs /      aOd /:    (9)
                                                                         One challenge with stochastic optimization techniques
                                                                         is setting the learning rate. Intuitively, we would like
This estimator uses both Rao-Blackwellization and                        the learning rate to be small when the variance of the
control variates. We show in the empirical study that                    gradient is large and vice-versa. Additionally, in prob-
this generic control variate further reduces the variance                lems like ours that have different scales1 , the learning
of the estimator.                                                        rate needs to be set small enough to handle the small-
                                                                         est scale. To address this issue, we use the AdaGrad
3.3   Black Box Variational Inference (II)                               algorithm (Duchi et al., 2011). Let G t be a matrix
                                                                         containing the sum across the first t iterations of the
Putting together the noisy gradient,              Rao-                   outer products of the gradient. AdaGrad defines the
Blackwellization, and control variates, we present                       learning rate as
Black Box Variational Inference (II). It takes samples
                                                                                                                   1=2
from the variational approximation to compute noisy                                             t D diag.G t /         :        (10)
gradients with Eq. 9. These noisy gradients are
then used in a stochastic optimization procedure to                      This is a per-component learning rate since diag.G t /
maximize the ELBO.                                                       has the same dimension as the gradient. Note that
                                                                         since AdaGrad only uses the diagonal of G t , those
We summarize the procedure in Algorithm 2. Algo-                         are the only elements we need to compute. AdaGrad
rithm 2 is easily used on many models. It only requires                  captures noise and varying length scales through the
samples from the variational distribution, computations                  square of the noisy gradient and reduces the number
about the variational distribution, and easy computa-
tions about the model.                                                       1 Probability   distributions have many parameterizations.
Black Box Variational Inference

of parameters to our algorithm from the standard two                 labs at each time step are sparse and that the time
parameter Robbins-Monro learning rate.                               between patient visits are highly irregular. The labs
                                                                     values are all positive, as the labs measure the amount
4.2     Stochastic Inference in Hierarchical                         of a particular quantity such as sodium concentration
        Bayesian Models                                              in the blood.
                                                                     Our modeling goal is to come up with a low dimen-
Stochastic optimization has also been used to scale
                                                                     sional summarization of patients’ labs at each of their
variational inference in hierarchical Bayesian models to
                                                                     visits. From this, we aim to to find latent factors that
massive data (Hoffman et al., 2013). The basic idea is
                                                                     summarize each visit as positive random variables. As
to subsample observations to compute noisy gradients.
                                                                     in medical data applications, we want our factors to
We can use a similar idea to scale our method.
                                                                     be latent indicators of patient health.
In a hierarchical Bayesian model, we have a hyper-
                                                                     We evaluate our model using predictive likelihood. To
parameter , global latent variables ˇ, local latent
                                                                     compute predictive likelihoods, we need an approximate
variables z1:::n , and observations x1:::n having the log
                                                                     posterior on both the global parameters and the per
joint distribution
                                                                     visit parameters. We use the approximate posterior on
    log p.x1:::n ;z1:::n ; ˇ/ D log p.ˇj/                           the global parameters and calculate the approximate
                     n
                                                                     posterior on the local parameters on 75% of the data in
                                                                     the test set. We then calculate the predictive likelihood
                     X
                 C          log p.zi jˇ/ C log p.xi jzi ; ˇ/: (11)
                     i D1
                                                                     on the other 25% of the data in the validation set using
                                                                     Monte Carlo samples from the approximate posterior.
This is the same definition as in Hoffman et al. (2013),
                                                                     We initialize randomly and choose the variational fami-
but they place further restrictions on the forms of the
                                                                     lies to be fully-factorized with gamma distributions for
distributions and the complete conditionals. Under
                                                                     positive variables and normals for real valued variables.
the mean field approximating family, applying Eq. 9
                                                                     We use both the AdaGrad and doubly stochastic ex-
to construct noisy gradients of the ELBO would re-
                                                                     tensions on top of our base algorithm. We use 1,000
quire iterating over every datapoint. Instead we can
                                                                     samples to estimate the gradient and 100 samples to
compute noisy gradients using a sampled observation
                                                                     estimate the control variate scaling. We set the batch
and samples from the variational distribution. The
                                                                     size at 25 for all our experiments.
derivation along with variance reductions can be found
in the supplement.
                                                                     5.2   Model
5     Empirical Study
                                                                     To meet our goals, we construct a Gamma-Normal
                                                                     time series (Gamma-Normal-TS) model. We model our
We use Black Box Variational Inference to quickly
                                                                     data using weights drawn from a Normal distribution
construct and evaluate several models on longitudinal
                                                                     and observations drawn from a Normal, allowing each
medical data. We demonstrate the effectiveness of our
                                                                     factor to both positively and negative affect each lab
variance reduction methods and compare the speed
                                                                     while letting factors represent lab measurements. The
and predictive likelihood of our algorithm to sampling
                                                                     generative process for this model with hyperparameters
based methods. We evaluate the various models using
                                                                     denoted with  is
predictive likelihood to demonstrate the ease at which
several models can be explored.
                                                                     Draw W  Normal.0; w /, an L  K matrix
                                                                     For each patient p: 1 to P
5.1     Longitudinal Medical Data                                     Draw op Normal.0; o /, a vector of L
                                                                      Draw xp1 = GammaE.x ; x /
Our data consist of longitudinal data from 976 patients
                                                                      Draw lp1 Normal.W xp1 C op ; l /, a vector of L.
(803 train + 173 test) from a clinic at CUMC who have
                                                                      For each visit v: 2 to vp
been diagnosed with chronic kidney disease. The data
                                                                        Draw xpv GammaE.xpv 1 ; x /
are obtained from CUMC under IRB protocol. These
                                                                        Draw lpv Normal.W xpv C op ; l /, a vector of L.
patients visited the clinic a total of 33K times. During
each visit, a subset, determined by the doctor, of 17
                                                                     We set w , o , x to be 1 and l to be .01. In our
measurements (labs) were collected.
                                                                     model, GammaE is the expectation/variance parame-
The data are observational and consist of measurements               terization of the (L-dimensional) gamma distribution.
(lab values) taken at the doctor’s discretion when the               (The mapping between this parameterization and the
patient is at a checkup. This means both that the                    more standard one can be found in the supplement.)
Rajesh Ranganath, Sean Gerrish, David M. Blei

                                                                              ditionals. For our proposal distribution we use the same
                        0
 Log Predictive Likelihood
                                                                              distributions as found in the previous section, with the
                                                                              mean equal to the value of the previous parameter.
                                                                              We compute predictive likelihoods using the posterior
                        −50

                                                              Algorithm       samples generated by the MCMC methods on held out
                                                               Gibbs          data in the test set.
                                                               Black Box VI
                                                                              We compared our method to Metropolis-Hastings inside
                        −100

                                                                              Gibbs on the Gamma-Normal-TS model. We used a
                                                                              fixed computational budget of 20 hours. Figure 1 plots
                                                                              predictive likelihood versus time on the held out set for
                        −150

                               0    5     10     15      20                   both methods. We found similar results with different
                                    Time (in hours)
                                                                              random initializations of both methods. Black Box
                                                                              Variational Inference gives better predictive likelihoods
Figure 1: Comparison between Metropolis-Hastings
                                                                              and gets them faster.3
within Gibbs and Black Box Variational Inference on
the Gamma-Normal-TS model. The x axis is time and
the y axis is the log predictive likelihood of the held                       5.4   Variance Reductions
out set. Black Box Variational Inference reaches better
predictive likelihoods faster than Gibbs sampling. Vari-                      We next studied how much variance is reduced by our
ational inference overfits slightly in the 20 hour period.                    variance reduction methods. In Figure 2, we plot the
The Gibbs sampler’s progress slows considerably after                         variance of various estimators of the gradient of the
5 hours.                                                                      variational approximation for a factor in the patient
                                                                              time-series versus iteration number. We compare the
                                                                              variance of the Monte Carlo gradient (Eq. 3) to that
Black Box Variational Inference allows us to make                             of the Rao-Blackwellized gradient (Eq. 6) and that
use of non-standard parameterizations for distributions                       of the gradient using both Rao-Blackwellization and
that are easier to reason about. This is an important                         control variates (Eq. 9). We allow the estimators with-
observation, as the standard set of families used in                          out control variates to leverage the samples used by
variational inference tend to be fairly limited. In this                      the control variate estimator to estimate the scalings.
case, the expectation parameterization of the gamma                           We found that Rao-Blackwellization reduces the vari-
distribution allows the previous visit factors to define                      ance by several orders of magnitude. Applying control
the expectation of the current visit factors. Finally, we                     variates reduces the variance further. This reduction
emphasize that coordinate ascent variational inference                        in variance drastically improves the speed at which
and Gibbs sampling are not available for this algorithm                       Black Box Variational Inference converges. In fact, in
because the required conditional distributions do not                         the time allotted, Algorithm 1—the algorithm without
have closed form.                                                             variance reductions—failed to make much progress.

5.3                            Sampling Methods                               5.5   Exploring Models

We compare Black Box Variational Inference to                                 We developed Black Box Variational Inference to make
a standard sampling based technique, Metropolis-                              it easier to quickly explore and fit many new models
Hastings (Bishop, 2006), that also only requires the                          to a data set. We demonstrate this by considering a
joint distribution.2                                                          sequence of three other models for our data: Gamma,
Metropolis-Hastings works by sampling from a proposal                         Gamma-TS, and Gamma-Normal.
distribution and accepting or rejecting the samples
based on the likelihood. Standard Metropolis-Hastings                         Gamma. We model the latent factors that summa-
can work poorly in high dimensional models. We find                           rize each visit in our models as positive random vari-
that it fails for the Gamma-Normal-TS model. Instead,                         ables; as noted above, we expect these to be indicative
we compare to a Gibbs sampling method that uses                               of patient health. The Gamma model is a positive-
Metropolis-Hastings to sample from the complete con-                          value factor model where all of the factors, weights,
                                                                              and observations have positive values. The generative
                     2 Methods
              that involve a bit more work such as Hamil-                     process for this model is
tonian Monte Carlo could work in this setting, but as our
technique only requires the joint distribution and could                         3 BlackBox Variational Inference also has better pre-
benefit from added analysis used in more complex methods,                     dictive mean-squared error on the labs than Gibbs style
we compare against a similar methods.                                         Metropolis-Hastings.
Black Box Variational Inference

                                                                     Model               Predictive Likelihood
        1e+12                                                     Gamma-Normal                   -31.2
                                                                 Gamma-Normal-TS                 -30.0
                                                                  Gamma-Gamma                     -146
 Variance

                                           Estimator             Gamma-Gamma-TS                   -132
        1e+09

                                               Basic
                                               RB           Table 1: A comparison between several models for our
                                               RB + CV
                                                            patient health dataset. We find that taking into account
                                                            the longitudinal nature of the data in the model leads
        1e+06

                                                            to a better fit. The Gamma models perform relatively
                                                            poorly. This is likely due to the fact that some labs
            0   1000       2000         3000                are negatively correlated as the Gamma models cannot
                   Iteration                                capture such relationships.

Figure 2: Variance comparison for the first component
of a random patient on the following estimators: Eq. 3,     Model Comparisons. Table 1 details our models
the Rao-Blackwellized estimator Eq. 6, and the Rao-         along with their predictive likelihoods. From this we
Blackwellized control variate estimator Eq. 9. We find      see that time helps in modeling our longitudinal health-
that Rao-Blackwellizing the naive estimator reduces         care data. We also see that the Gamma-Gamma mod-
the variance by several orders of magnitude. Adding         els perform poorly. This is likely because they cannot
control variates reduces the variance even further.         capture the negative correlations that exist between dif-
                                                            ferent medical labs. More importantly, by using Black
                                                            Box Variational Inference we were able to quickly fit
Draw W  Gamma.˛w ; ˇw /, an L  K matrix                   and explore a set of complicated non-conjugate mod-
For each patient p: 1 to P                                  els. Without a generic algorithm, approximating the
 Draw op Gamma.˛o ; ˇo /, a vector of L                    posterior of any of these models is a project in itself.
 For each visit v: 1 to vp
   Draw xpv Gamma.˛x ; ˇx /
   Draw lpv GammaE.W xpv C op ; o /, a vector of L.       6   Conclusion
We set all hyperparameters save o to be 1. As in the       We developed and studied Black Box Variational In-
previous model, o is set to 0.01.                          ference, a new algorithm for variational inference that
                                                            drastically reduces the analytic burden. Our main ap-
Gamma-TS. We can link the factors through time              proach is a stochastic optimization of the ELBO by
using the expectation parameterization of the gamma         sampling from the variational posterior to compute a
distribution. (Note this is harder with the usual natural   noisy gradient. Essential to its success are model-free
parameterization of the gamma.) This changes xpv to         variance reductions to reduce the variance of the noisy
be distributed as GammaE.xpv 1 ; v /. We draw xp1          gradient. Our method works wells on new models, while
as above. In this model, the expected values of the         requiring minimal analytic work by the practitioner.
factors at the next visit is the same as the value at
the current visit. This allows us to propagate patient      There are several natural directions for future improve-
states through time.                                        ments to this work. First, software libraries can be
                                                            created with score functions for a wide variety of vari-
Gamma-Normal. Similar to the above, we can                  ational families (each score function is simply the log
change the time-series Gamma-Normal-TS (studied             gradient of the variational distribution with respect to
in the previous section) to a simpler factor model. This    the variational parameters). Second, we believe that
is similar to the Gamma model, but a normal prior.          number of samples could be set dynamically. Finally,
                                                            carefully-selected samples from the variational distribu-
These combinations lead to a set of four models that        tion (e.g., with quasi-Monte Carlo methods) are likely
are all nonconjugate and for which standard varia-          to significantly decrease sampling variance.
tional techniques are difficult to apply. Our variational
inference method allows us to compute approximate
                                                            Acknowledgements We thank Jeremy Manning
posteriors for these models to determine which provides
                                                            and the reviewers for their helpful comments. RR
the best low dimensional latent representations.
                                                            is supported by an NDSEG fellowship. DMB is sup-
We determined convergence using change in log predic-       ported by NSF IIS-0745520, NSF IIS-1247664, NSF
tive likelihood on a validation set. We set the AdaGrad     IIS-1009542, ONR N00014-11-1-0651, Alfred P. Sloan
step to 0:5 for the Gamma models and 1 for the others.      foundation, and DARPA FA8750-14-2-0009.
Rajesh Ranganath, Sean Gerrish, David M. Blei

References                                                 H. Robbins and S. Monro. A stochastic approximation
                                                             method. The Annals of Mathematical Statistics, 22
C. Bishop. Pattern Recognition and Machine Learning.         (3):pp. 400–407, 1951.
  Springer New York., 2006.                                S. M. Ross. Simulation. Elsevier, 2002.
D. Blei and J. Lafferty. A correlated topic model of       T. Salimans and D Knowles. Fixed-form variational
  Science. Annals of Applied Statistics, 1(1):17–35,         approximation through stochastic linear regression.
  2007.                                                      ArXiv e-prints, August 2012.
L. Bottou and Y. LeCun. Large scale online learning. In    M. Wainwright and M. Jordan. Graphical models, ex-
  Advances in Neural Information Processing Systems,        ponential families, and variational inference. Founda-
  2004.                                                     tions and Trends in Machine Learning, 1(1–2):1–305,
M. Braun and J. McAuliffe. Variational inference for        2008.
 large-scale models of discrete choice. Journal of         C. Wang and D. Blei. Variational inference for noncon-
 American Statistical Association, 105(489), 2007.           jutate models. JMLR, 2013.
P. Carbonetto, M. King, and F. Hamze. A stochastic         C. Wang, X. Chen, A. Smola, and E. Xing. Variance
  approximation method for inference in probabilistic        reduction for stochastic gradient optimization. In
  graphical models. In Advances in Neural Information        Advances in Neural Information Processing Systems
  Processing Systems, 2009.                                  (NIPS), 2013.
G. Casella and C. Robert. Rao-blackwellisation of          D. Wingate and T Weber. Automated variational
  sampling schemes. Biometrika, 83(1):81–94, 1996.           inference in probabilistic programming. ArXiv e-
D. R. Cox and D.V. Hinkley. Theoretical Statistics.          prints, January 2013.
  Chapman and Hall, 1979.
J. Duchi, E. Hazan, and Y. Singer. Adaptive sub-
  gradient methods for online learning and stochastic
  optimization. J. Mach. Learn. Res., 12:2121–2159,
  July 2011. ISSN 1532-4435.
Z. Ghahramani and M. Beal. Propagation algorithms
  for variational Bayesian learning. In NIPS 13, pages
  507–513, 2001.
M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochas-
 tic variational inference. Journal of Machine Learn-
 ing Research, 14(1303–1347), 2013.
T. Jaakkola and M. Jordan. A variational approach
  to Bayesian logistic regression models and their ex-
  tensions. In International Workshop on Artificial
  Intelligence and Statistics, 1996.
M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul.
 Introduction to variational methods for graphical
 models. Machine Learning, 37:183–233, 1999.
D. Kingma and M. Welling. Auto-encoding variational
  bayes. ArXiv e-prints, December 2013.
D. Knowles and T. Minka. Non-conjugate variational
  message passing for multinomial and binary regres-
  sion. In Advances in Neural Information Processing
  Systems, 2011.
H. Kushner and G. Yin. Stochastic Approximation
  Algorithms and Applications. Springer New York,
  1997.
J. Paisley, D. Blei, and M. Jordan. Variational Bayesian
   inference with stochastic search. In International
   Conference on Machine Learning, 2012.
You can also read