Estimating DSGE Models with Stan

Page created by Charlotte Murray
 
CONTINUE READING
Estimating DSGE Models with Stan

                                Ben Goodrich∗ and Carlos Montes-Galdón†

1      Introduction

This paper is markedly incomplete, but we will show how relatively new Bayesian software called
Stan can be used to efficiently draw from the posterior distribution of challenging Dynamic Stochas-
tic General Equilibrium (DSGE) models and similarly challenging posterior distributions in eco-
nomics and other fields. Most attempts to estimate the parameters of a DSGE model have used a
random-walk Metropolis-Hastings approach, which we strongly believe is inadequate in the sense
that the Markov Chain does not mix well enough to confidently obtain reliable inferences.

There have been some recent attempts to impove the mixing of Metropolis-Hastings in DGSE
models (see [Chib and Ramamurthy(2010), Herbst(2011), Herbst and Schorfheide(2014)]). A few
of these have started to use the gradient and Hessian of the posterior distribution to make “better”
proposals in the sense that the proposals can be relatively far from the current state of the Markov
Chain but are still accepted with a relatively high probability. The most advanced implementation
of this idea to date is the Hamiltonian Monte Carlo algorithm implement in Stan. Stan has several
important features that make it appropriate even for difficult posterior distributions:

     • All the derivatives are handled by automatic (sometimes called algorithmic) differentiation,
        so the user only has to specify the kernel of a (log) posterior distribution and does not need
        to worry about symbolic derivatives at all. Automatic differentiation is much faster and more
        accurate than numerical differentiation and is feasible for any posterior distribution that is
        differentiable everywhere.

     • Stan automatically adjusts the tuning parameters for Hamiltonian Monte Carlo during the
        warmup period. These tuning parameters include the stepsize and the innovation variances
        in order to achieve a target acceptance probability. By default, the target acceptance prob-
        ability is 0.8, which is much higher than other Metropolis-Hastings algorithms, and yet the
        proposals are relatively far from the current state of the Markov Chain because they are gov-
        erned primarily by the curvature of the posterior distribution.
    ∗ Lecturer,   Columbia University
    † Researcher,   European Central Bank

                                                   1
2

    • Hamiltonian Monte Carlo output includes several additional quantities that make it easier
      to diagnose when the Markov Chain is not working well. These diagnostics are especially
      important in difficult problems like DSGE models.

We have written (and are in the process of cleaning up) MATLAB code that will write the text of
a Stan program given the solution to a DSGE model in symbolic form. We hope to eventually do
the same with Dynare++. These tools will hopefully make it easier to estimate DSGE models with
state of the art Markov Chain Monte Carlo algorithms. Stan is in the process of implementing Rie-
mannian Manifold Monte Carlo, which generalizes Hamiltonian Monte Carlo to utilize the second
derivatives of the (log) posterior distribution rather than merely the gradient. Riemannian Mani-
fold Monte Carlo may well be essential for estimating DSGE models whose posterior distributions
change to rapidly over the parameter space to be succesfully sampled by Hamiltonian Monte Carlo
schemes.

The outline of the paper is as follows. We briefly discuss the pertubation approach to DSGE models
that has been thoroughly explained by [Klein(2000)] or [Schmitt-Grohé and Uribe(2004)]. A major
stumbling block that we only recently overcame is implementing the approach of [Klein(2000)] or
[Schmitt-Grohé and Uribe(2004)] using real numbers only. We then discuss various frequentist and
Bayesian ways of estimating a linearized DSGE model and explain in some detail why we are opti-
mistic that Stan will sample from difficult posterior distributions more efficiently than Metropolis-
Hastings but with little to no additional burden on the researcher. The last section has an example
of a Real Business Cycle model that we can estimate with Stan and will be expanded to include
additional examples. There are no actual results yet, but our code has reached “proof of concept”
stage and we expect it to be in “beta” stage by early May.

2    DSGE models and their solution

In this paper, we are not trying to innovate the theory or the specification of DSGE models. Our
innovation pertains to the estimation of DSGE models. Nevertheless, in this section we briefly
explain a conventional approach to specifying and solving a DSGE model.

A DSGE model consists of a rich set of structural difference equations that describe an economy
in a coherent way, which considers economic agents as optimizing entities, who form their own
expectations about the future, and that takes into account market clearing conditions. In the past
years, computational advances have allowed economic researchers to build medium and large scale
DSGE models that have become the standard tool in policy analysis (especially in Central Banks).
The fact that solution and estimation methods of DSGE models have developed very fast in the
last years has helped to their rapid development. Moreover, compared to reduced form models,
DSGE models have proved to fit the data as well or better than Vector Autoregressions, they have
better forecasting power, and they provide a stylized theory about the economy which helps us to
understand better the interaction among different variables and the transmission mechanisms of
economic shocks.
3

However, finding the solution of the set of equations that form a DSGE model is not a trivial task.
Most of the difference equations are highly non linear, and it is not possible to find a closed form
solution for the dynamic system that they describe. Most of the literature considers linearization
methods to transform the non linear model to a linear one, and to find the solution in a relatively
easy way. Once the model is written as a linear approximation, then it is also possible to find a
procedure to estimate the deep parameters of the model that determine the equilibrium relation-
ships among the different variables using a likelihood approach. While it is possible to solve some
models using non linear methods (value function iteration), this approach is generally limited to
small scale models, and in principle a likelihood approach is not available to estimate them.

Following [DeJong and Dave(2011)], the DSGE model can be written a system of nonlinear differ-
ence equations that depend on a vector of parameters Θ,

                                               EΨ(zt+1 , zt ; Θ) = 0                                            (1)

where E stands for the expectation operator. The system has a steady state, Ψ(z̄; Θ) = 0, where z̄ is
a function of the deep parameters of the system. Defining wt = zt − z̄ (deviations of variables from
its steady state value), our aim is to find a linearised solution such that,

                                                     AEwt+1 = Bwt .                                             (2)

Following [Judd(1998)] or [Schmitt-Grohé and Uribe(2004)], we can find 2 using a Taylor series
expansion of the original non linear system around the steady state value1 ,

                                        ∂Ψ                   ∂Ψ
                              Ψ(z̄) +       (z̄)(zt − z̄) +       (z̄)(zt+1 − z̄) ≈ 0
                                        ∂zt                 ∂zt+1

and we can define
                                               ∂Ψ                     ∂Ψ
                                         A=         (z̄),      B=−        (z̄).
                                              ∂zt+1                   ∂zt
Finally, once the model has been linearised we want to find a representation of the dynamic behav-
ior of the variables as,
                                        wt+1 = F (Θ, w̄)wt + G (Θ, w̄)vt+1                                      (3)

where vt is a vector of exogenous structural disturbances in the model, and F (Θ, w̄), G (Θ, w̄) are
matrices formed by the deep parameters.

As in [Schmitt-Grohé and Uribe(2004)], we include two types of variables in a DSGE model: pre-
determined variables (xt2 ) and control variables (xt1 ). Then, we can partition 3 as,

                                               xt1     =   gx xt2
                                            xt2+1      = h x xt2 + Σvt+1                                        (4)
   1 We  show here a first order approximation.                  However,     higher order approximations can
be      performed        as      in      [Andreasen et al.(2013)Andreasen, Fernández-Villaverde, and Rubio-Ramírez],
[Fernández-Villaverde and Rubio-Ramírez(2007)] or [Schmitt-Grohé and Uribe(2004)]
4

In order to solve for gx and h x from the matrices A and B it is possible to follow the approach in
[Klein(2000)] or [Schmitt-Grohé and Uribe(2004)], who use a generalized Schur (QZ) decomposi-
tion of A and B in order to find the stable solution of the model.
As noted in [Klein(2000)], it is somewhat easier to use a complex Schur (QZ) decomposition of
A = QSZ and B = QTZ, where both Q and Z are unitary and both S and T are upper triangular,
and there is code in a variety of languages to do so that have been contributed by Chris Sims, Paul
Cline, Martín Uribe and Stephanie Schmitt-Grohé that has been incorporated and extended by the
Dynare team. However, the software we use to estimate DSGE models cannot handle derivatives
of complex functions and thus we are forced to use a real Schur (QZ) decomposition of B = QSZ
and A = QTZ, where both Q and Z are orthonormal but only T is upper triangular whereas S is
upper Hessenberg, meaning it may have some non-zero entries immediately below its diagonal.
[Klein(2000)] glosses over the implementation details of the real Schur decomposition so we will
state them here. There are many implementations of the real Schur decomposition that all trace
their roots to [Moler and Stewart(1973)]. We use the C++ implementation in Eigen, which supplies
a good portion of the supporting code for Stan. Eigen includes (although it is not exposed) a
method to rotate the decomposition with Givens rotations that preserve the orthogonality of Q and
Z such that any zeros along the diagonal of T are pushed to the lower right corner. This “zero-
chasing” scheme may be originally attributable to [Golub and Van Loan(2012)]. Zeros along the
diagonal of T imply that A is singular and thus chasing them to the lower-right corner makes it
possible to partition the decomposition as
                                    "               #                      "               #
                                        S11   S12                              T11   T12
                           B=Q                          Z   and   A=Q                          Z,
                                        0     S22                              0     T22

where T22 have have all zeros along its diagonal (but may have non-zero elements above the diag-
onal) and T11 has no zeros on its diagonal. If A−1 exists, our task would be more straightforward,
but in many DSGE models it does not. With some abuse of notation we use T −1 for the “inverse” of
                                            −1
T even when there is no unique solution to T22 T22 = I. As we shall see, T22 becomes irrelevant for
our purposes.
In other words, at least the upper-left corner of A−1 B = Z > T −1 Q> QSZ = Z > T −1 SZ is well-
defined and a diagonal matrix of generalized eigenvalues (λ) could be constructed from the di-
agonal blocks of T −1 S such that AVλ = BV and V is unitary. However, λ would, in general,
entail complex numbers that we need to avoid, so we convert to what Eigen calls a “pseudo eigen-
decomposition” where AV̇D = BV̇ and D is block diagonal with 1 × 1 or 2 × 2 blocks on its di-
agonal.2 We can then specify a permutation matrix P in Eigen, such that PP> = I, implying
V̇DV̇ > = V̇PP> DPP> V̇ > = V̈ D̈V̈ > and the diagonal blocks of D̈ = P> DP are increasing in the
absolute value of the generalized eigenvalues. The generalized eigenvalues in the bottom right
corner of D̈ may be infinite, but it is more important to classify the finite generalized eigenvalues
depending on whether they are less than or greater than unity in absolute value. We can repartition
    2 Technically, Eigen uses the phrase pseudo eigen-decomposition to refer to the standard eigenvalue problem where B = I

, but the internal Eigen code is was fairly easy to adapt to this generalized eigenvalue problem.
5

the decomposition as
                                                  "                    #
                              h               i       D̈11       0                   h               i
                          A       V̈1   V̈2                                  =   B       V̈1   V̈2       ,
                                                          0     D̈22

where all the eigenvalues of D̈11 are less than unity in absolute values and" all the# eigenvalues of
                                                                               V̈11
D̈22 are greater than unity (and possibly infinite) in absolute value. V̈1 =           can be further
                                                                               V̈12
                                        −1                 −1
partitioned such that h x = V̈11 D̈11 V̈11 and gx = V̈12 V̈11 in equation 4.

3    Ways of estimating DSGE models

DSGE models can generally be written as a state space form. Suppose that we have a set of observ-
able variables that have a theoretical counterpart in the model, yt . Note that many times we can
not observe all of the economic variables defined in the model (for example, total factor productiv-
ity which is a theoretical measure that does not have an empirical counterpart such as output or
consumption growth). Given the set of observable variables, we can write the state space form of
the model as,

                                        yt    = Swt + ε t
                                    w t +1    =           F (Θ, w̄)wt + G (Θ, w̄)vt+1

where S is a selection matrix that relates the variables in the model to the observed data, and ε t are
measurement errors. The first equation is referred as the measurement equation, and the second
one is the transition equation. The main goal of this section is to describe how to estimate Θ, the
diverse methods that have been used to date, and our proposal using Hamiltonian Monte Carlo
methods.

The first step is to construct the likelihood function for the data, p(y T |Θ). Once it is constructed,
the likelihood can be maximized using numerical methods or it can be combined with a prior
distribution for Θ, p(Θ), to form a posterior distribution p(Θ|y T ) that can be sampled from using
Markov Chain Monte Carlo (MCMC). However, building the likelihood function is not a trivial
task, since the latent states have to be integrated out. That is,

                                                      T
                          p(y T |Θ)       =       ∏ p ( y t | y t −1 | Θ )
                                                  t =1
                                                    T ˆ
                                          =       ∏           p(yt |wt , Θ) p(wt |yt−1 , Θ)dwt .             (5)
                                                  t =1

Under the assumption that both ε t and vt+1 follow a Gaussian distribution, the model is linear, and
6

the likelihood function can be evaluated recursively using the the Kalman Filter (see for example,
[Hamilton(1994)] or [Harvey(1990)])

When the DSGE model is linearised to an order higher than one, then the transition equation is no
longer linear, and therefore, the Kalman Filter can not be used to evaluate the conditional likelihood
5. In this case, the literature uses a Particle Filter to approximate the integrals (for a textbook
exposition, see for example [Fernández-Villaverde and Rubio-Ramírez(2007)]).

3.1   Frequentist approach

In the frequentist approach, the likelihood function 5 can be maximized numerically using any opti-
mization technique. However, this seemingly easy approach has several pitfalls, which are pointed
out in [Fernández-Villaverde(2010)]. The likelihood function has a substantial number of param-
eters in a medium or large-scale DSGE model, and therefore, it might be difficult to find a global
maximum among a considerable number of local maxima. The likelihood function can also be flat
or essentially flat in many areas of the parameter space, which raises additional computational dif-
ficulties and may preclude a unique estimator of the parameters. Moreover, the estimated standard
errors based on multivariate normal approximations may be an inaccurate measure of uncretainty
about the parameters. The observable macroeconomic time series are not especially long, and thus
the favorable asymptotic properties of maximum likelihood estimators may not hold. For these rea-
sons, the literature and central banks are increasingly turning to Bayesian teachniques to estimate
DSGE models.

3.2   Previous Bayesian approaches

We refer to [Herbst and Schorfheide(2014)], which explains the Bayesian approach to DSGE models
in book form. In general, the likelihood function is multiplied by a prior distribution over the
parameters, and Bayes’ Theorem implies a posterior distribution

                                      p(Θ|y T ) ∝ p(y T |Θ) p(Θ)

The prior distribution is usually chosen using economic theory or results from previously-estimated
models. The prior distribution helps to concentrate the posterior distribution in areas of the param-
eter space where there is a unique, stable solution to the DSGE model. For example, in a standard
New Keynesian model, the conduct of monetary policy is typically specified as a Taylor rule, so
that, in its simplest form, the nominal interest rate is a function of the inflation rate. In this sit-
uation, it is well known that the semielasticity of the interest rate of the inflation rate should be
greater than unity in order to bring about a unique, stable equilibrium in the dynamic model that
is consistent with the relatively stable monetary policy we observe.

In general, in economics and other fields, a (hybrid) Gibbs sampler is created by deriving the full-
conditional distributions of as many of the parameters as possible. The Gibbs sampler proceeds by
7

drawing from each full-conditional distribution in turn and falling back to some other sampling
technique to draw parameters whose full-conditional distributions are not known or is known but
is not easy to draw from. However, in the case of a DSGE model, F (Θ, w̄) and G (Θ, w̄) are highly
non-linear combinations of the structural parameters, Θ, and it is essentially impossible to derive
the full-conditional distribution of any parameter of the model.

Therefore, the majority of the Bayesian literature on DSGE models uses the Random Walk Metropo-
lis Hastings (RWMH) approach to explore the posterior distribution of Θ, which can be summa-
rized as follows. Suppose that we have performed j simulations. Then,

  1. Draw a vector of proposed parameters, Θ j using any of the methods outlined below.

      (a) Solve for the matrices of the linearised form of the model, given Θ j , and the implied
          steady state.
      (b) Evaluate the likelihood function, using the Kalman Filter if a first order Taylor expansion
          is used, or a Particle Filter otherwise.
       (c) Evaluate the prior distribution p(Θ j ), which when combined with the evaluated likeli-
          hood is proportional to the posterior distribution at Θ j

                                                      p(Θ j |y T )       q(Θ|Θ j )
  2. Accept the proposed draw Θ j with probability    p(Θ|y T )
                                                                     ×   q(Θ j |Θ)
                                                                                   ,   where q () is the density of
     the proposal distribution.

However, the RWMH algorithm only mitigates many of the problems that plague the maximum
likelihood estimator. The Markov Chain may become stuck in a flat area or local maximum of the
posterior distribution for an indefinite number of iterations and thus not fully explore the poste-
rior distribution in a remotely reasonable amount of time. In addition, the RWMH algorithm may
not mix throughout the parameter space at a fast enough rate for the Markov Chain Monte Carlo
Central Limit Theorem (MCMC-CLT) to hold, in which case it is difficult to characterize our un-
certainty over how closely the empirical means of functions of the Markov Chain adhere to the
corresponding analytical means (if it were even possible to derive them in closed-form). At the
extreme, a RWMH algorithm could easily not have converged to the posterior distribution when
the researcher stops to analyze the output.

[Chib and Ramamurthy(2010)] recognizes these problems with the standard RWMH and proposes
a variation of it. The researcher first specifies a number of subvectors that Θ can be partitioned
                                                                                                     SB
into. If there are B mutually exclusive and exhaustive blocks, then we can write Θ =                   b =1   Θb . The
difference in this “tailored” Metropolis-Hastings appoach is that at each iteration, the parameters
are randomly allocated into B blocks — without regard to the current state of the Markov Chain —
           j
and only Θb is proposed, leaving the other blocks at their current values. This process diminishes
the change that the Markov Chain will become stuck in a region of the parameter space for an
indefinite number of iterations. [Chib and Ramamurthy(2010)] reports that the autocorrelation of
the draws dissapates after 30 or 40 iterations for most of the 36 parameters in a DSGE model.
8

[Herbst(2011)] uses a blocking scheme where the parameters are allocated into B blocks based on
posterior correlations estimated during the burnin period and reports improved performance.
Other alternatives to RWMH use the information in the shape of the posterior distribution to gen-
erate better proposals for the parameters, rather than drawing them blindly from a multivariate
normal distribution. The Metropolis-Adjusted Langevin (MAL) approach utilizes the gradient and
Hessian (evaluated at the posterior mode) to shift the mean vector of the multivariate normal pro-
posal distribution. Another variant on this approach is to evaluate the Hessian at the current state
of the Markov Chain.

3.3     Our approach

[Herbst and Schorfheide(2014)] concludes its survey of Metropolis-Hastings based approaches by
referring to Hamiltonian and Riemannian Monte Carlo

        Preliminary trials with both of these algorithms suggested, that, yes, they represented
        a statistical improvement from the algorithms presented above. However, the compu-
        tational burden associated with them outweighed any gains. For example, computing
        the information matrix — a requirement for the Riemann-based method — of a DSGE is
        still too cumbersome to be used within an MH algorithm. Computational gains in these
        kinds of calculations, a subject of on-going research, might make them more feasible,
        and thus, preferred. (72)

In this subsection, we argue that Hamiltonian Monte Carlo (HMC) is feasible today using Stan and
Stan’s implementation of Riemannian Manifold Monte Carlo (RMMC) will likely become feasible
in 2015 or 2016.
Stan implements automatic differentiation to evaluate derivatives of the (log) posterior distribu-
tion.
This section borrows from the description of Stan in [Goodrich et al.(2012)Goodrich, Katznelson, and Wawro].
For a comprehensive review of HMC, see [Neal(2011)]. The theoretical justification for Stan is given
in [Hoffman and Gelman(2014)] and [Carpeter(2015)].
Physicists can model the evolution over time of a closed system given a particle’s position and
momentum using functions measuring its potential and kinetic energy. It turns out that the same
intuition and mathematics can be applied to sample from a posterior distribution via MCMC, which
is the characteristic that distinguishes HMC from other MCMC algorithms.
HMC augments the usual expresion for Bayes’ Theorem p ( Θ| data) ∝ p (Θ) f ( data| Θ) with an
auxiliary vector (ρ) of “momentum” variables — one for each of the parameters — to specify the
joint posterior distribution of Θ and ρ, which is p ( Θ, ρ| data) ∝ p (Θ, ρ) p ( data| Θ, ρ). These auxil-
iary variables are computationally important, but they are independent of Θ and the data, allowing
us to write p ( Θ, ρ| data) ∝ p (ρ) p (Θ) p ( data| Θ). The researcher must specify priors for the pa-
rameters Θ and ρ but unlike Gibbs sampling, it is not necessary to specify the priors in such a way
9

              Table 1: Hamiltonian Dynamics Concepts Mapped Between Physics and MCMC
                        Concept          Physics            Bayesian Statistics
                         Object          Particle              Parameters
                      Dimensions         2D / 3D           Any dimensionality
                        Position            q                         Θ
                      Momentum              p                         ρ
                    Potential Energy      U (q)            − ln [ p ( Θ, ρ| data)]
                     Kinetic Energy       K (p)      − ln [N (0, M)] w/ M diagonal

that the full-conditional distributions of the parameters are known in closed forms. This feature is
important in a DSGE context where it is impossible to derive full-conditional distributions of the
elements of Θ anyway.

The most convenient prior for the momentum variables is a multivariate normal with mean zero
and covariance matrix M. HMC algorithms typically specify a fixed M at the outset, which is
often M = mI for some chosen value of the tuning parameter, m. By default, Stan assumes M is
diagonal and the diagonal entries are tuned automatically during the warmup phase of the MCMC
algorithm so that the diagonal entries are approximately equal to the posterior variances of the
elements of Θ. Since ρ is independent of Θ a priori and does not enter the likelihood, the full-
conditional distribution of ρ is just its prior distribution, namely multivariate normal. Thus, ρ can
be updated with a Gibbs-step, which is why HMC is also sometimes taken as an abbreviation for
Hybrid Monte Carlo because Θ is updated in an entirely different way, conditional on ρ.

The relationship between Hamiltonian dynamics in physics and Hamiltonian dynamics in MCMC
is outlined in Table 1. Physics specifies a position vector (q) of a particle, a momentum vector (p)
for that particle, a function for the potential energy (U (q)) of the particle, and a function for the
kinetic energy (K (p)) of the particle. The Hamiltonian function, which measures total energy, is
typically additive, H (p, q) = U (q) + K (p), which leads to the following system of differential
equations, called the “Hamiltonian equations”:

                                   dqi        ∂H (q, p)   h     i
                                         =              = M −1 p ,
                                    dt           ∂t               i
                                   dpi          ∂H (q, p)    ∂U (q)
                                         =    −           =−        ,
                                   dt              ∂t          ∂qi

where M is interpreted as a “mass” matrix for the particle. These differential equations charac-
terize where the particle moves in continuous time (t), but it is typically necessary to make a
discrete-time approximation of this system in order to solve for the new position, momentum, po-
tential energy, and kinetic energy. HMC utilizes the same discrete-time approximation to construct
a MCMC algorithm, where the “potential energy” function is the negative log posterior distribu-
tion, U (Θ, ρ) = − ln [ p ( Θ, ρ| data)], and the “kinetic energy”
                                                            h       function    i is the negative log prior
                                                                 − 1 ρ > M −1 ρ
distribution for the momentum variables, K (ρ) = − ln c × e 2                    . For both of these distri-
butions, normalizing constants (c) can be included or ignored because they have no effect on the
dynamics.
10

Each iteration of the MCMC algorithm requires a solution to the Hamiltonian equations and con-
sists of three distinct steps. First, a proposed momentum vector ρ j is drawn from its multivariate
                                                                    

normal distribution. Second, a proposed location vector Θ j is drawn, given a realization of the
                                                              

proposed momentum vector that “pushes” the location of the parameters in accordance with the
gradient of the log posterior. Third, a Metropolis-Hastings test is conducted to determine whether
to accept the proposed momentum and location or to stay at the current location (Θ, ρ). There are
several variants of the HMC algorithm, which differ primarily in how they implement the second
step.

A simple implementation of the second step uses the so-called “leapfrogging” algorithm to solve
a discrete-time approximation of the Hamiltonian equations. The leapfrogging algorithm specifies
a positive integer for the number of steps ( L) in time and a positive stepsize (e). The leapfrog-
ging process presents three challenges for a HMC algorithm. First, it requires the gradient of the
negative log-posterior, which is often difficult or impossible to calculate in closed form. Numeric
gradients can be inaccurate and are computationally expensive when there are many parameters.
Stan overcomes this problem by utilizing reverse-mode automatic differentiation and C++ template
metaprogramming.3 Automatic differentiation exploits the fact that a software program only needs
to know a few rules for differentiation, and then a compiler can construct a (potentially large) ex-
pression tree that calculates the gradient via the chain rule. Thus, applied researchers do not need
to specify any derivatives to use Stan. Programmers also do not need to specify any derivatives to
contribute to Stan, although it is often possible to make substantial improvements in computation
time if some partial derivatives are known analytically and can be inserted into the expression tree
that evaluates to the gradient.

The second challenge is how to choose L in a HMC algorithm. If L is too small, Θ j will be too
close to Θ, and the Markov Chain will mix slowly through the parameter space. If L is too large,
a lot of computation time is wasted generating a proposal that is no more likely to be accepted.
Stan allows the user to specify L, but by default, Stan endogenizes L with an algorithm called No
U-Turn Sampling (NUTS), which is described in more detail in [Hoffman and Gelman(2014)]. The
intuition of NUTS is to move Θ j both “forward” and “backward” (i.e. ±e) , repeatedly doubling
the number of leapfrog steps until some proposal makes a U-turn (and, if not stopped, would start
moving roughly toward Θ again). Stan then utilizes a slice sampling mechanism to randomly select
some point along the path of proposals it has traced and decides whether to accept this proposal
with a Metropolis-Hastings acceptance test, although the proposal distribution is symmetric so
technically it is just a Metropolis test. The slice sampling mechanism adds a layer of complication
but preserves the essential properties of a Markov Chain.

The third challenge is how to choose e. As e ↓ 0, the discrete-time approximation of the solution
to the Hamiltonian equations approaches a continuous-time process and all proposed locations
will be accepted. However, if e is too small, Θ j will be too close to Θ, and the Markov Chain will
mix slowly. If e is too large, the approximation can become dislodged from the true Hamiltonian
dynamics and the resulting Θ j is unlikely to be accepted, although it is still a valid proposal. Stan
  3 Forward-mode   automatic differentiation will be used to implement Riemannian Manifold Monte Carlo.
11

allows the user to specify e at the outset, but by default, Stan utilizes the warmup period of the
Markov Chain to tune e. Essentially, Stan tries to tune e so that 80% of proposals are accepted by
default, which is optimal under certain assumptions and contrasts to an optimal acceptance rate of
23.4% in Random Walk Metropolis algorithms.

Thus, Stan makes it possible for users to simply define a log posterior distribution (with or without
normalizing constants) and then obtain samples from that posterior distribution. By default, M
and e are tuned automatically during the warmup phase of the Markov Chain and frozen there-
after. L is chosen endogenously at every iteration of the Markov Chain. Experts can specify M, e,
and L at the outset and possibly obtain a somewhat more efficient algorithm, but Stan’s defaults
seem to work reasonably well for a wide range of statistical problems. By relying on automatic
differentiation, Stan obviates the need for the user to define a function that evaluates the gradient
of the log posterior, and the computational cost of automatic differentiation is roughly proportional
to the cost of evaluating the log posterior.

There is one additional complication where the developers of Stan have expended considerable
effort to make life easier for the user. The above sketch of Stan works if the posterior has positive
mass throughout the parameter space. However, if a Markov Chain wanders into a region of the
parameter space with no mass, it would get stuck because the gradient is either zero or undefined.
Many statistical models are well-defined only in certain regions of the parameter space; e.g. vari-
ance parameters cannot be negative. There are two main approaches in the HMC literature to deal
with this problem. The first is to erect walls that define the admissible region of the parameter
space and “bounce” if the Markov Chain runs into a wall. Intuitively, this involves negating the
momentum, which is legal because doing so does not affect the multivariate normal density.

Stan has not implemented bouncing yet. Instead, Stan takes the second approach of reparameteriz-
ing Θ as a function of unbounded primitive parameters. For example, if a user defines σ as the stan-
dard deviation of the errors in a linear regression model, Stan automatically specifies σ = exp (eσ)
where eσ is an unbounded primitive parameter that is not even seen by the user. Technically, Stan
samples e
        σ but saves samples of σ for the user to analyze. This change-of-variables requires that
the posterior distribution be multiplied by the absolute value of the determinant of the Jacobian
for the transformation from the bounded parameter to the unbounded parameter, which is sim-
ply exp (eσ ) in this case. For a multivariate change-of-variables, the Jacobian determinant can be
tedious to define and / or costly to evaluate, but wherever possible Stan uses transformations that
yield triangular Jacobians, whose determinants are relatively easy to evaluate. In almost all cases
— including those discussed in this paper — users do not need to worry about derivates at all.

4    Examples

      The examples in this section are empirically incomplete. As of April 1, we have just
      gotten the real Schur (QZ) decomposition code to work (modulo a few patches that
      have not been upstreamed yet) with a personal fork of the Stan library. Our procedure
12

        does not crash (any more), but we have not yet had time to even preliminarily analyze
        the results. It appears as if it will be challenging to get DSGE models to converge within
        a few hours of computation time, but we have not even attempted to optimize our
        code yet. We will have a much more complete set of results by the May 8 conference
        presentation.

4.1     A Real Business Cycle (RBC) Model

In this subsection, we consider a small scalle RBC model, similar to the one in [Schmitt-Grohé and Uribe(2004)].
The model considers a representative agent that maximizes an infinite period expected utility func-
tion:

                                                           ∞           1− γ
                                                                       ct
                                              max E0       ∑      βt
                                                                       1−γ
                                                           t =0

where c stands for consumption, E is the expectation operator, and β, γ are two deep parameters
of the model, measuring the subjective value on the future by the agent and her risk aversion,
respectively.

The agent is constrained by the fact that,

                                         At kαt = ct + k t+1 − (1 − δ)k t

which is other way to put that total output per capita (yt = At kαt ) is equal to consumption and
investment per capita (k t+1 − (1 − δ)k t ). The variable k t represents the stock of capital, At is the
productivity level, α is a parameter of the model related to the use of capital in production and δ
is the depreciation rate of capital. In the model, we introduce stochastic exogenous technological
change as follows,

                                        At+1 − 1 = ρ( At − 1) + ηε t+1

with ε ∼ N (0, 1) and ρ an extra parameter to be estimated. The solution to the model consists on
the first order conditions of the maximization problem, plus the resource constraint, the definition
of output, consumption growth and output growth, and the exogenous process for productivity:

                                    −γ                 −γ
                                   ct        =    βEt ct+1 [αAt+1 kαt+−11 + 1 − δ]
                                  At kαt     = c t + k t +1 − (1 − δ ) k t
                                        yt   =    At kαt
13

                                ln At+1      = ρ ln At + σε t+1
                                                 ct
                                      gc     =
                                               c t −1
                                                 yt
                                      gy     =
                                               y t −1

Note that this set of equations can easily be cast in the form of equation 1. In this case, we can
define the vector of structural parameters of the model as Θ = {γ, β, α, δ, ρ, σ }

and the vector of variables as wt = { gyt , gct , yt , ct , k t , At }. The previous set of equations is easily
cast in the form of 1, and the steady state of the model is found as the solution of the system when
wt+1 = wt = w̄, given a vector of parameters. Then, we can find it as,

                                           Ā    = 1
                                                             1               ! 1−1 α
                                                             β   −1+δ
                                            k̄   =
                                                                     α
                                            c̄   = k̄α − δk̄
                                            ȳ   = k̄α
                                           ḡc   = 1
                                           ḡy   = 1

Given this steady state and a parameter vector, we can solve for the linearised version of the model
using the method described in section 2. Given a vector of observable variables, the state space
model can be written as before,

                                      yt     = Swt + ε t
                                  w t +1     =       F (Θ, w̄)wt + G (Θ, w̄)vt+1

In this case, we will use two variables as observed data, output growth and consumption per capita
                o , go ]). For the estimation we will use either artificial data or true quarterly data
growth (yt = [ gyt   ct
from the United States, spanning the period 1948Q1-2014Q4. With this two observable variables,
and using the identifiability test in [Iskrev(2010)], in principle all the parameters in Θ could be
identified using the priors that we will impose in the estimation of the model. Finally, in this case,
the selection matrix is given by,

                                                 "                                     #
                                                     1   0       0       0   0   0
                                       S=
                                                     0   1       0       0   0   0
14

4.2    More Examples Coming

References

[Andreasen et al.(2013)Andreasen, Fernández-Villaverde, and Rubio-Ramírez] Martin         M    An-
      dreasen, Jesús Fernández-Villaverde, and Juan Rubio-Ramírez.         The pruned state-space
      system for non-linear dsge models: Theory and empirical applications. Technical report,
      National Bureau of Economic Research, 2013.

[Carpeter(2015)] Bob et al. Carpeter. Stan: A probabalistic programming language. 2015.

[Chib and Ramamurthy(2010)] Siddhartha Chib and Srikanth Ramamurthy. Tailored randomized
      block mcmc methods with application to dsge models. Journal of Econometrics, 155(1):19–38,
      2010.

[DeJong and Dave(2011)] David N DeJong and Chetan Dave. Structural macroeconometrics. Prince-
      ton University Press, 2011.

[Fernández-Villaverde(2010)] Jesús Fernández-Villaverde. The econometrics of dsge models. SE-
      RIEs, 1(1-2):3–49, 2010.

[Fernández-Villaverde and Rubio-Ramírez(2007)] Jesús Fernández-Villaverde and Juan F Rubio-
      Ramírez. Estimating macroeconomic models: A likelihood approach. The Review of Economic
      Studies, 74(4):1059–1087, 2007.

[Golub and Van Loan(2012)] Gene H Golub and Charles F Van Loan. Matrix computations, vol-
      ume 3. JHU Press, 2012.

[Goodrich et al.(2012)Goodrich, Katznelson, and Wawro] Ben Goodrich, Ira Katznelson, and Greg
      Wawro. Designing quantitative historical social inquiry:an introduction to stan. 2012.

[Hamilton(1994)] James Douglas Hamilton. Time series analysis, volume 2. Princeton university
      press Princeton, 1994.

[Harvey(1990)] Andrew C Harvey. Forecasting, structural time series models and the Kalman filter.
      Cambridge university press, 1990.

[Herbst(2011)] Edward Herbst. Gradient and Hessian-based MCMC for DSGE Models. Unpublished
      manuscript, 2011.

[Herbst and Schorfheide(2014)] Edward Herbst and Frank Schorfheide.          Bayesian Estimation of
      DSGE Models. Book manuscript, 2014.

[Hoffman and Gelman(2014)] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler:
      Adaptively setting path lengths in hamiltonian monte carlo. The Journal of Machine Learning
      Research, 15(1):1593–1623, 2014.
15

[Iskrev(2010)] Nikolay Iskrev. Local identification in dsge models. Journal of Monetary Economics,
    57(2):189–202, 2010.

[Judd(1998)] Kenneth L Judd. Numerical methods in economics. MIT press, 1998.

[Klein(2000)] Paul Klein. Using the generalized schur form to solve a multivariate linear rational
    expectations model. Journal of Economic Dynamics and Control, 24(10):1405–1423, 2000.

[Moler and Stewart(1973)] C.B. Moler and G.W Stewart. An algorithm for generalized eigenvalue
    problems. SIAM Journal of Numerical Analysis, 10:241, 1973.

[Neal(2011)] Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte
    Carlo, 2, 2011.

[Schmitt-Grohé and Uribe(2004)] Stephanie Schmitt-Grohé and Martın Uribe. Solving dynamic
    general equilibrium models using a second-order approximation to the policy function. Jour-
    nal of economic dynamics and control, 28(4):755–775, 2004.
You can also read