A loss discounting framework for model averaging and selection in time series models - arXiv

Page created by Jeremy Hoffman
 
CONTINUE READING
A loss discounting framework for model
                                           averaging and selection in time series models
                                                                          Dawid Bernaciak
                                                   Statistical Science, University College London, Gower Street,
arXiv:2201.12045v1 [stat.ME] 28 Jan 2022

                                                                     London WC1E 6BT, U.K.
                                                                                 and
                                                                           Jim E. Griffin
                                                   Statistical Science, University College London, Gower Street,
                                                                     London WC1E 6BT, U.K.
                                                                              January 31, 2022

                                                                                    Abstract
                                                    We introduce a Loss Discounting Framework for forecast combination which gen-
                                                eralises and combines Bayesian model synthesis and generalized Bayes methodologies.
                                                The framework allows large scale model averaging/selection and is also suitable for
                                                handling sudden regime changes. This novel and simple model synthesis framework is
                                                compared to both established methodologies and state of the art methods for a num-
                                                ber of macroeconomic forecasting examples. We find that the proposed method offers
                                                an attractive, computationally efficient alternative to the benchmark methodologies
                                                and often outperforms more complex techniques.

                                           Keywords: Bayesian, Density forecasting, Forecast combination, Forecast averaging, Model
                                           synthesis

                                                                                        1
1     Introduction
In recent years, the development and popularisation of econometric modelling and machine
learning techniques as well as an increasingly easy access to vast computational resources
and data has lead to a proliferation of forecast models. These models yield either point
forecasts or full forecast density functions and originate from both established subject
experts as well as newcomers armed with powerful software tools. This trend has been
met with a renewed academic interest in model selection, forecast density combination and
synthesis, e.g., Stock and Watson (2004), Hendry and Clements (2004), Hall and Mitchell
(2007), Raftery et al. (2010), Geweke and Amisano (2011) Waggoner and Zha (2012), Koop
and Korobilis (2012), Billio et al. (2013), Del Negro et al. (2016), Yao et al. (2018), McAlinn
and West (2019), Diebold et al. (2022) to mention just a few.
    It has been shown that a combination or pooling of competing individual point forecasts
can yield forecasts with a lower RMSE than a single model (Bates and Granger, 1969; Stock
and Watson, 2004). Hendry and Clements (2004) suggested that the forecast combination
provides an insurance in a case when the the individual models are misspecified, poorly
estimated or non-stationary, and may provide superior results. The performance of density
forecasts is often evaluated using logarithmic scoring rules. Mitchell and Hall (2005) found
that their density contrast combination may not lead to a higher logarithmic score, even
if evaluated in-sample, in an application to UK inflation forecasts. This has lead to an
interest in a range of competing methods for forecast density combination.
    Much modern work on pooling forecast densities follows the seminal work of Diebold
(1991) who showed that Bayesian model averaging (BMA) may not be optimal under a
logarithmic scoring when the set of models to be pooled is misspecified. For example, Hall
and Mitchell (2007) proposed to combine forecast densities by minimising the Kullback-
Leibler divergence between the pooled forecast density and the true but unknown density.
Their method leads to time-invariant linear pool with weights chosen on the simplex to
maximise the average logarithmic scoring rule.
    This idea of optimizing the logarithmic scoring rule of the forecast combination has
been developed in several directions. Geweke and Amisano (2011) introduced a Bayesian
approach to estimating the weights of a time-invariant linear pool. This has been extended
to time-varying weights by Waggoner and Zha (2012) (Markov switching weights) and
Del Negro et al. (2016) (dynamic linear pools). These approaches have been combined and
extended by Billio et al. (2013) who used sequential Monte Carlo method to approximate
the filtering and predictive densities. An alternative direction involves directly adjusting
the model weights of BMA to give better performance. Raftery et al. (2005) used BMA
applied on a sliding window of N observations. The Dynamic Model Averaging (DMA)
approach of Raftery et al. (2010) is another windowing method which is based on a recursive
algorithm with exponential discounting of Bayes factors with a discount/forgetting/decay1 .
Del Negro et al. (2016) argued that this can be seen as an approximation to the method of
Waggoner and Zha (2012). Koop and Korobilis (2012) used the idea of logarithmic score
maximisation to find an optimal discount factor for DMA. Beckmann et al. (2020) applied
   1
     The terms discount/forgetting/decay factor are used interchangeably in the literature and so are they
in this research.

                                                    2
this idea to model selection and developed Dynamic Model Learning (DML) method with
an application to foreign exchange forecasting. Through the lens of the research of Hall
and Mitchell (2007), this method could be seen as an optimisation problem with respect to
the forgetting factor where one tries to minimise the Kullback-Leibler divergence between
the forecasts combined via DMA and the true forecast density.
    Recently, McAlinn and West (2019) and McAlinn et al. (2020) proposed a broad theo-
retical framework called Bayesian Predictive Synthesis (BPS). In their work they used the
latent factor regression model, cast as a Bayesian seemingly unrelated regression (SUR), to
update latent agent states. In both works the BPS model outperformed the BMA bench-
mark as well as the optimal linear pool. They also show that the majority of proposed
Bayesian techniques can be classified as special cases of the BPS framework put forward in
their research.
    Outside the formal Bayesian framework, Diebold et al. (2022) suggested using a simple
average of the forecasts from a team of N (or less) forecasters, where N is set a priori. The
team is selected by choosing the forecasters with the highest average logarithmic scores
in the previous rw-periods. This can be seen as a localised and simplified version of the
method of Hall and Mitchell (2007).
    This paper develops a flexible and simple approach to combining a pool of forecast
densities. We assume the presence of a decision maker who examines forecasts provided
by experts to make a prediction or an informed decision. We generalize the DMA and
DML techniques by introducing an approach which we call a Loss Discounting Framework
(LDF). This leads to more flexibility in the dynamics of the model weights but retains the
computational efficiency of DMA and DML. In a simulation study we show that model
averaging using LDF outperforms other benchmark methods and is more robust to the
choice of the hyperparameters than DMA and DML. We also consider three empirical
studies. In the first two studies, which look at professional forecast of inflation in the
Eurozone and foreign exchange forecasting based on econometric fundamentals, LDF based
models outperform competing methods in problems with between 32 and 2048 forecasts per
period. A third example considers US inflation forecasts in presence of only four models
and illustrates the limitations of our methodology.
    The paper is organised as follows. Section 2 presents some background which leads into
a description of the proposed methodology in Section 3. In Section 4, the performance
of the LDF approach is examined in a simulated example and applications to Eurozone
inflation forecasting using professional forecasts, and foreign exchange and US inflation
using time series models. We discuss the limitations of our approach and set out directions
for further research in Section 5. The code to reproduce our study is freely available via
the provided link2 .

2         Background
It is common in Bayesian analysis (Yao et al., 2018, and references therein) to distinguish
three types of model pool M = {M1 , M2 , · · · , MK }: M-closed – the true data generating
    2
        https://github.com/dbernaciak/ldf

                                             3
process is described by one of the models in M but is unknown to researchers; M-complete
– the model for the true data generating process exists but is not in M, which is viewed
as a set of useful approximating models; M-open – the model for the true data generating
process is not in M and the true model cannot be constructed either in principle or due
to a lack of resources, expertise etc.3 Model selection based on BMA only converges to the
true model in the M-closed case (see e.g. Diebold, 1991) and can perform poorly outside
the M-closed case.
    There are several reasons to believe that econometric problems are outside the M-
closed setting. Firstly, real-world forecasting applications often involve complex systems
and the model pool will only include approximations at best. In fact, one might argue
that econometric modellers have an inherent belief that the models they propose provide
reasonable approximation to the data generating process even if certain process features
escape the capabilities of the supplied methodologies. Secondly, in many applications, the
data generating process is not constant in time (Del Negro et al., 2016) and may involve
regime changes and considerable model uncertainty. For example, in the foreign exchange
context, Bacchetta and Van Wincoop (2004) proposed the scapegoat theory suggesting that
investors display a rational confusion about the true source of exchange rate fluctuations.
If an exchange rate movement is affected by a factor which is unobservable or unknown,
investors may attribute this movement to some other observable macroeconomic funda-
mental variable. This induces regimes where different market observables might be more
or less important.
    These concerns motivate a model averaging framework that is both, suitable for M-
complete (or even M-open) situations and provides a way to incorporate time-varying
model weights. DMA has been shown to perform well in econometric applications. It
utilises forgetting factors to alleviate the computational burden of calculating large scale
Markov Chain Monte Carlo (MCMC) or sequential Monte Carlo associated with methods
such as Waggoner and Zha (2012). Let πt|t−1,k be the probability that model k should be
used for forecasting at time t given information through time t − 1. Raftery et al. (2010)
introduce πt|t,k (which can be interpreted as the probability that k is the best model for
the data at time t) and pool the models based on the values of πt|t−1,k and the forgetting
factor, denoted by α, using the recursions
                                               α
                                             πt−1|t−1,k +c
                                 πt|t−1,k = PK α               ,                                  (2.1)
                                                 π
                                             l=1 t−1|t−1,l + c
                                               πt|t−1,k pj (yt |y t−1 )
                                   πt|t,k   = PK                            ,                     (2.2)
                                                                      t−l )
                                               l=1 πt|t−1,l pl (yt |y

where pk (yt |y t−1 ) is the predictive likelihood at time t, given information available at time
t − 1 for model k (i.e., the likelihood calculated using the posterior predictive distribution
yt |y t−1 for model k given the actual realisation yt ) and c is a small positive number intro-
   3
     Clarke et al. (2013) give, a slightly unusual, example of works of William Shakespeare as an M-open
problem. The works (data) has a true data generating process (William Shakespeare) but one can argue
that it makes no sense to model the mechanism by which the data was generated.

                                                      4
duced to avoid model probability being brought to machine zero by aberrant observations4 .
The log-sum-exp trick is an alternative way of handling this numerical instability which
would, at least in part, eliminate the need for the constant c. We provide more insight on
this matter in Appendix C. However, we largely leave the role of this parameter to further
research.
    The recursions in (2.1) and (2.2) amount to a closed form algorithm to calculate the
probability that model k is the best model given information from the previous time step
t − 1, for forecasting at time t. A model receives a higher score if it performed well in
the recent past. The discount factor α controls the importance that one attaches to the
recent past. For example, if α = 0.7, the forecast performance 12 periods ago receives
approximately 2% as much weight as the most recent observation. However, if α = 0.9,
this weight is as high as 31%. Therefore, lower values of α are responsible for faster model
switching. In particular, α → 0 would cause the selection of a model that performed best
in the previous step and α = 1 recovers the standard BMA.
    DMA performed well in forecasting inflation and output growth as demonstrated in
Del Negro et al. (2016). There, DMA performed comparably to the novel dynamic predic-
tion pooling method proposed by the authors. It was subsequently expanded and success-
fully used in econometric applications by Koop and Korobilis (2012), Koop and Korobilis
(2013) and Beckmann et al. (2020). In the first two papers the authors test DMA for a
few values of discount factors α, whereas, in the latest paper the authors follow the recom-
mendation of Raftery et al. (2010) to estimate the forgetting factor online in the context
of Bayesian model selection.

2.1     Dynamic Model Learning
In order to formalise the notation which will be used in our framework, we give a closer
overview of Dynamic Model Learning (DML) (Beckmann et al., 2020) which provides a
way to optimally choose the discount factor α. The discounted predictive likelihood for a
model k at time t is defined as
                                                  t−1
                                                                            i
                                                  Y
                               DPLt|t−1,k (α) =         pk (yt−i |y t−1−i )α ,                     (2.3)
                                                  i=1

where pk (yt |y t−1 ) is the realised predictive likelihood of the model k, given α, for the events
up to, and including, time t − 1.
   The values of the parameter α are restricted to a grid of values which we denote Sα . At
time τ the value of α is chosen as the one which has produced the model with the highest
product of predictive likelihoods in the past. So that, at each time t we select the best
model given value α:
                           ∗
                          kt,α = arg max DPLt|t−1,k (α) t 6 τ             α ∈ Sα .                 (2.4)
                                  k∈1,...,K

   4
    Yusupova et al. (2019) note that this constant is only present in the original work by Raftery et al.
(2010) and then in the implementation by Koop and Korobilis (2012) but then subsequently dropped in
further works, software packages and citations. They also notice that this constant has a non-trivial and
often critical effect of the dynamics of weight changes. We comment on this aspect in Appendix B.

                                                   5
We select the optimal value of α, denoted as ατ∗ |τ −1 such that:
                                     −1
                                    τY                                          τ −1
                                                                                X                         
            ατ∗ |τ −1   = arg max           ∗ (yt |y
                                          pkt,α        t−1
                                                             ) = arg max                     ∗ (yt |y
                                                                                       log pkt,α      t−1
                                                                                                          ) ,   (2.5)
                           α∈Sα     t=1                                 α∈Sα    t=1

         ∗ (yt |y
                  t−1
where pkt,α           ) is the predictive likelihood for the best model at time t for a given
α ∈ Sα . Then, at time τ we select the single best model kτ∗ such that

                                    kτ∗|τ −1 = arg max DPLτ |τ −1,k (ατ∗ ).                                     (2.6)
                                                 k∈1,...,K

Therefore, at time τ , we select the model that has the highest discounted predictive like-
lihood up to, and including, time τ − 1 using the optimal value ατ∗ , i.e. pkτ∗ (yτ |yτ −1 ). We
stress that in this set up the parameter α is not time varying but merely time dependent.
Although not considered by Beckmann et al. (2020), the DML framework can be trivially
extended to model averaging.

3     Methodology
3.1    Loss Discounting Framework
In this section we outline our proposed loss discounting framework (LDF) which can be
used for both model averaging as well as model selection. This framework generalises and
expands the DMA and DML techniques.
   Let us define log-discounted predictive likelihood as:
                                                  t−1
                                                  X
                                                         δ i log pk (yt−i |y t−1−i , θ) ,
                                                                                       
                           LDPLt|t−1,k (θ) =                                                                    (3.1)
                                                   i=1

where k indexes a model and θ is a parameter or a set of parameters. In DMA we only
have one level of discounting where pk (yt−i |y t−1−i , θ) are the different forecaster densities
(δ corresponds to α). In this case, similarly to (2.1), these are passed through a softmax
function to get weights. Therefore, we could denote DMA as LDF1s where the 1 superscript
indicates a single level of loss discounting and the s subscript indicates the use of the
soft-max function. In DML, as in DMA, the pk (yt−i |y t−1−i , θ)’s are the different forecaster
densities and δ corresponds to α but these are passed through an argmax function to select
the best model. Then the second layer of discounting is used to find an optimal value of
the parameter α (in a similar way to (2.5) and (2.6) with δ = 1). In this case we could call
DML LDF2a,a with δ = 1. The (a, a) subscript indicates the use of the argmax function at
two levels. In general, multiple model specification are possible where user is free to select
the number of discounting steps and functions acting in each layer.
    The LDPL can be generalized to
                                                                  t−1
                                                                  X
                                    LDPLt|t−1,k (θ) =                   δ i Sk (yt−i , θ)
                                                                  i=1

                                                              6
where Sk is a score for predicting yt−i with parameters θ. This generalizes the use of
the logarithmic scoring. The use of scoring rules for Bayesian updating for parameters
was pioneered by Bissiri et al. (2016) (rather than inference about models in forecast
combination) and is justified in a M-open or misspecified setting. Loaiza-Maya et al.
(2021) extend this approach to econometric forecasting. They both consider sums which
are equally weighted (i.e. δ = 1). Miller and Dunson (2019) provide a justification for
using a powered version of the likelihood of misspecified models.
   A useful representation of the LDPL is
                           LDPLt|t−1,k (θ) = Mt Sk,t (θ) = −Mt Lk,t (θ)
               δ
                 (1 − δ t−1 ). The variable SM,t (θ) = M1t t−1   i
                                                          P
where Mt = 1−δ                                              i=1 δ Sk (yt−i , θ) can be interpreted
as an estimator of the expected score at time t (or equivalently LM,t (θ) can be interpreted as
an estimator of the expected loss at time t). Mt is a measure of the relevance of information
available at time t. This generalises the idea of maximising the model weights so that the
simple average of the scores is maximised as proposed in Hall and Mitchell (2007). Geweke
and Amisano (2011) justify the simple average for cases when the data generating process
is ergodic which guarantees the existence of an optimal model pool. The use of an average
can also be justified under the simpler condition that the scores are ergodic. We consider
weighted averages as more appropriate and robust than simple averages in time series
problems (the simple average is still available by choosing δ = 1) for the reason that only
more recent information may be useful to predict the expected score. A similar, more
informal, argument is made by Billio et al. (2013) to justify taking exponentially weighted
moving average of forecast errors.

3.2    Two-Level Model Averaging within Loss Discounting Frame-
       work
The LDF allows us to describe more general set-ups for discounting in forecast combination,
such as these models with two discounting levels. In the first model, the first LDPL layer is
put through softmax and the second through argmax. This is denoted LDF2s,a . A second
model replaces the argmax second layer with a softmax layer, which is denoted LDF2s,s .
    As described in Beckmann et al. (2020), to choose the parameter α we specify a grid
of possible values. However, in these two specifications we aim to construct a model com-
bination as opposed to model selection as in DML. The model averaging procedure hinges
on (2.1) and (2.2), i.e. the first layer of LDPL is passed through the softmax function.
    In our methodology we make the parameter α time varying and introduce the notation
αt to denote the value of this parameter at time t. For each discount factor αt at time t
one obtains the corresponding average prediction:
                                                       K
                                                       X
                                      t−1
                           p̄(yt |y         , αt ) =         πt|t−1,l (αt )pl (yt |y t−1 ).   (3.2)
                                                       l=1

   For LDF2s,a , to choose an optimal ατ at time τ (i.e. for event at time τ which has not
yet happened), which we denote as ατ∗ |τ −1 , we will use an exponentially weighted sum of

                                                             7
the realised log predictive likelihoods of the models’ average:
                                                       τ −1
                                                       X
                      ατ∗ |τ −1                               δ i−1 log p̄(yτ −i |y τ −1−i , ατ ) ,
                                                                                                 
                                  = arg max                                                                  (3.3)
                                         ατ ∈Sα
                                                       i=1

where Sα is the set of allowable values for the parameter α. Finally, the model average to
be used for event at time τ can be calculated as:
                                                           K
                                                           X
                                τ −1
                     p̄(yτ |y          , ατ∗ |τ −1 )   =         πτ |τ −1,l (ατ∗ |τ −1 )pl (yτ |y τ −1 ).    (3.4)
                                                           l=1

   For LDF2s,s , we calculate weights for each ατ,j for j = 1, . . . , |Sα | at time τ as:
                                 Pτ −1 i−1
                                        δ log (p̄(yτ −i |y τ −1−i , ατ,j ))
                     wατ,j = P|Sα |i=1
                                    Pτ −1 i−1                                   ,                            (3.5)
                                j=1    i=1 δ  log (p̄(yτ −i |y τ −1−i , ατ,j ))

and the model average to be used for event at time τ can be calculated as:
                                                   |Sα | K
                                                   X    X
                  p̄(yτ |y τ −1 , wατ,j ) =                      wατ,j πτ |τ −1,l (ατ,j )pl (yτ |y τ −1 ).   (3.6)
                                                   j=1 l=1

    Note a difference between (2.5) and (3.3). Firstly, the summation in (2.5) is over time
whereas in (3.3) it is over the offsets from the time τ . This change in notation makes it
explicit that in equation (3.3) the maximisation is performed for a grid of values for ατ ,
i.e., in the summation over index i, ατ is fixed. Secondly, in equation (3.3) the logarithmic
scores are discounted exponentially by a factor of δ in line with the framework described
in subsection 3.1, and we choose the model/model combination based on the weighted
loss. This introduces more flexibility in the dynamics of the model averaging by allowing
the possibility to gradually discount the previous performance of the forgetting factors.
Parameter δ might be set by an expert or calculated on a calibration sample if the data
sample is sufficiently large to permit a robust estimation.
    In terms of computation time our proposed algorithm is very fast as it just relies of
simple addition and multiplication. This is an advantage over more sophisticated forecasts
combination methods when the time series is long and/or we would like to incorporate a
large (usually greater than 10) number of forecasters.
    Finally, these two-level LDF methods can be trivially expanded to the model selection
use case. In such a setting in order to select the optimal value of α we would calculate an
exponentially weighted average, with parameter δ, of the realised predictive log-likelihoods
of the models that obtained the highest realised predictive likelihood for each possible α at
previous times. This corresponds to the setting where we put both levels through argmax
functions. As mentioned before, LDF2a,a is a generalised version of DML presented in
Beckmann et al. (2020) where implicitly the authors suggest δ = 1, i.e, all past performances
of the forgetting factors are equally weighted. In the limit δ → 0 we would choose the
discount factor α which performed best in the latest run, disregarding any other history.

                                                                 8
In this part we considered a two-level model, but it would of course be possible to
construct models with more levels. For example a three-level model would assume that
the parameter δ is also time-varying and introduce another discount factor, e.g. γ, which
would help to select the right discount factor δ at each time. However, in this work we do
not consider such models.

3.3     Interpretation
3.3.1   Momentum
A combination of parameters α and δ might be intuitively associated with the momentum
strategy where an agent/investor places a bet on the highest performing assets. Based on
this interpretation, the discounted log-likelihood plays the role of the performance metric for
1) optimal model averaging; 2) optimal α parameter selection. The parameter α provides
the smoothing across time for model averaging (as in Raftery et al. (2010)), whereas δ
provides the exponential decay for the performance metric on which we based the selection
of α.
    Note that this method effectively averages the best performing models which is in
contrast with, for example, linear pools or Diebold et al. (2022) where the methodology
aims to find the best performing average. Linear pools might choose models to either
boost to overall combined volatility or fatten the tails of the combination when the agent
forecasts are, for example, all Gaussian but the true DGP is Student-t, say. That however,
might create problems when the models that provided a needed boost in tails misbehave
at a future time period. Our proposed technique might help in some cases, for example,
when pruning the outlier models before any averaging takes place is desirable. This induced
model sparsity might also be useful when the model pool is large.
    Since BMA usually converges to a single model, the α-weighted BMA in our algorithm
provides a natural the way to select top performing models. The lower the α parameter
the more models will have material bearing on the final average. Similarly, the smaller the
δ parameter the shorter the window length for forgetting factor performance measurement
and the more dynamic the changes in optimal α.
    An underlying assumption in the BMA technique is that the true model is included in
the model set and weights are constant over time. Following these assumption it can be
shown that, in the limit, BMA converges (i.e. the weight goes to 1) to the right model.
However, if the model set does not include the true model, i.e., the model set is misspecified,
then BMA fails to average as pointed out in Diebold (1991). Our set up can also be
interpreted in terms of time smoothed BMA selecting the right model but only locally in
time. Even if no model is correct across all times BMA might be able to indicate the right
model or set of models in the set up where the process characteristics change in time. I.e.,
if no model is right across all times it is possible that at least some models are close to
the right one at certain times. Our method also provides a remedy for the constant weight
assumption assuming the optimal weights being constant only locally in time.

                                              9
3.3.2    Markov switching model with time-varying transition matrix
Following the argument in Del Negro et al. (2016) to interpret DMA in terms of a Markov
switching model, our extension allows a time-varying transition matrix, i.e. Qt = (q(t)kl ).
The gradual forgetting of the performance of the discount factor α allows for a change
of optimal discount factor when the underlying changes in transition matrix are required.
However, we also show that our model outperforms the standard DMA model even when the
transition matrix is non-time-varying. This point will be further illustrated in Appendix A.

4       Examples
Our methodology is best suited for data with multiple regime switches with a potentially
time-varying transition matrix. As such, it is particularly useful for modelling data such as
inflation levels, interest or foreign exchange rates. We illustrate our model on a simulated
example and three real data examples. The supplementary materials for our examples are
given in appendices A, B, C, D and E.
    We compare examples of our LDF to several popular model averaging methodologies.
The approaches used are

    • Two-level LDF - 2 hyperparameters, i.e., δ, c;

    • BMA - 0 hyperparameters;

    • DMA - 2 hyperparameters, i.e., α, c

    • BPS - 5 hyperparameters, i.e., β discount factor for state evolution matrices, δ dis-
      count factor for residual volatility, n0 prior number of degrees of freedom, s0 prior on
      BPS observation variance, R0 prior covariance matrix of BPS coefficients;

    • best N-average - 2 hyperparameters, i.e., N number of models, rolling window length
      rw.

   We evaluate the performance of the models by calculating the out-of-sample mean log
predictive score (MLS)
                                   T
                             1    X
                                       log p(yt |y1 , . . . , yt−1 ),
                           T − s t=s+1
where y1 , . . . , ys are the observations for a calibration period and T is the total number of
observations.

4.1     Simulation study
The data generating process (DGP) of Diebold et al. (2022) is

                             yt = µt + xt + σy t ,   t ∼ N (0, 1),                      (4.1)
                             xt = φx xt−1 + σx vt ,   vt ∼ N (0, 1),                      (4.2)

                                               10


                                     

                            0/6
                                     

                                                 LDFs,2 s
                                                       LDFs,2 a
                                                 '0$
                                                               

Figure 1: Simulation – The MLS versus values of δ for LDF and α for DMA in the x-axis.
The error bars correspond to the standard deviation of MLS over 10 runs.

where yt is the variable to be forecast, xt is the long-run component of yt , µt is the time-
varying level (in Diebold et al. (2022) set to 0). The error terms are all i.i.d and uncor-
related. It is assumed that the data generating process is known to each forecaster apart
from the level component µt . Each individual forecaster k models xt with noise and applies
different level ηk to yt :

                            zkt = xt + σtk νkt , νkt ∼ N (0, 1),                                (4.3)
                            ỹkt = ηk + zkt + σy t , t ∼ N (0, 1).                            (4.4)

Notice that the individual forecasters’ levels are not time varying. This emulates a situation
where forecasters have access to different sets of information and/or models which might
guide a different choice of level. It also opens a possibility that we are in the M-complete
or M-open setting where no forecaster is right at all times.
   In contrast to Diebold et al. (2022), we allow the variable yt to have multiple regime
switches. The settings are as follows: φx = 0.9, σx = 0.3, σy = 0.3, σtk = 0.1 ∀k, K = 20,
T = 2001, ηk = −2 + 0.2105(k − 1), k = 1, . . . , K and finally:
                                          S          S           S
                   
                    0,    for t ∈ [0, 49] [200, 399] [800, 849] [970, 979]
                          S               S             S
                              [1000, 1049] [1600, 1650] [1700, 2001]
                   
                   
                   
                                             S           S          S
              µt = 1,      for t ∈ [100, 150] [900, 949] [960, 969] [990, 999]
                          S               S             S
                              [1050, 1099] [1200, 1599] [1700, 1749]
                   
                   
                   
                   
                   
                   −1, otherwise.

More examples are discussed in Appendix A. For LDF we set Sα = {1.0, 0.99, 0.95, 0.9, 0.8,
0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.001} and c = 10−20 similarly to Koop and Korobilis (2012).
    The standard Bayesian model averaging (MLS = -4.34) fared poorly since it quickly
converged to the wrong model. The BPS (MLS = -0.73) of McAlinn and West (2019) with

                                                                11

                        WUXHOHYHO
          
                        PHDQSUHGLFWHGOHYHO
          

          

          

          

          

          

          

          

          
                                                                                                                                

Figure 2: Simulation – True data generating process mean and mean predicted level ac-
cording to LDF2s,s .
                                                                                                                                                                                    
                                                                                                                                                                                    
                                                                                                                                                                                    
                                                                                                                                                                                    
                                                                                                                                                                                    
                                                                                                                                                                                    
                                                                                                                                                                                    
                                                                                                                                                                                    
                        WUXHOHYHO
                                                                                                                                                                                    
                            IRULDFs,2 a = 0.95
                      IRULDFs,2 a = 1.0                                                                                                                                    
                                                                                                                                                                                    
                                                                                                                     

Figure 3: Simulation – Comparison of the α parameters function for LDF2s,a model with
δ = 0.95 versus δ = 1. We observe more dynamic adaptation of parameter α when we set
δ = 0.95.

normal agent predictive densities5 performed better than BMA but struggled to quickly
adjust to the regime changes which resulted in low log-scores at the change points. The
N-average method of Diebold et al. (2022) performed better (we chose rolling-window of
5 observations which performed best), with an MLS of -0.52 for N = 3 and N = 4, than
BMA and BPS and similarly to the standard DMA method of Raftery et al. (2010). Our
proposed method 2-level methods, however, performed best for a range of parameters δ,
especially LDF2s,s . In Figure 1 we show that the performance of LDF2s,s dominates for any
δ < 1, the LDF2s,a performs a bit worse but still seems to be a better choice than DMA
which is a single layer model. What is more, the performance for two level LDF models is
robust across wide range of values for δ which is not the case for values of α in DMA.
   In Figure 2 we present how synthesised agent forecast level of LDF2s,s adjusts to the
mean levels implied by the DGP. We can see that the model is very reactive to the mean
  5
    We used the original set of parameters (adjusted β = 0.95 and δ = 0.95 to get better results) as
proposed by the authors of the paper but adjusted the prior variance to match the σy parameter. The
model was run for 5000 MCMC paths with 3000 paths burnin period. We performed only 1 run due to
computational cost.

                                                                                               12
predicted level following the true DGP mean closely with only a small time lag.
    Figure 3 show how the parameter α dynamically changes using LDF2s,a with δ = 0.95. It
is close to 1 in periods of stability and closer to 0 in times of abrupt changes. In comparison,
for δ = 1 the parameter α is rather stable, oscillating around 0.6. As mentioned before,
this variation in parameter α might be beneficial since the lower the α parameter more
models will be taken into consideration and the final outcome might show more uncertainty.
Additionally, a lower parameter alpha facilitates the ability to quickly re-weight the models
to adapt to the new regime. Whereas, in the times of stability it might be better to narrow
down the meaningful forecasts to a smaller group by increasing the parameter α. This
illustrates how two level model provides useful flexibility in the discount factor.
    .

4.2    ECB Survey of Professional Forecasters
For the first empirical study we analyse inflation density forecasts from the European
Central Bank (ECB) Survey of Professional Forecasters.6 This data is sampled quarterly
since 1999, the participants are surveyed in January, April, July and October. To make
our results comparable with those in Diebold et al. (2022), we select the surveys starting
in 1999 Q1 and ending in 2019 Q3, which constitutes 83 forecasts. The forecasts target
the percentage change in the Harmonised Index of Consumer Prices (HICP) for the year
following the forecast (one year ahead forecasts), available from The Federal Reserve Bank
of St. Louis website7 . The data preparation routine closely follows the procedure described
in Diebold et al. (2022) and is detailed in Appendix C.
    In order to evaluate our proposed methodology we calculate the log-scores for 1-year-
ahead Eurozone inflation density forecasts. The burn-in/calibration sample spans Q1 1999
to Q4 2000. The scores are compared using 75 quarters between Q1 2001 and Q3 2019.

                                     

                                     

                                     
                            0/6

                                     

                                                 LDFs,2 s
                                                         LDFs,2 a
                                                 '0$
                                                                    

Figure 4: Professional Forecasters – The MLS versus values of δ for LDF and α for DML
in the x-axis.
  6
     https://www.ecb.europa.eu/stats/ecb_surveys/survey_of_professional_forecasters/html/
index.en.html
   7
     https://fred.stlouisfed.org/series/CP0000EZ19M086NEST

                                                                     13
We used the same methods as the simulated example except we did not use the BPS
method of McAlinn and West (2019) since it would require extra work to adjust the model
to handle the multinomial predictive distributions of agents’ forecast models. We note that
LDF2s,a model specification outperforms other models and marks a noticeable improvement
over the specification with δ = 1 (which would correspond to a model averaging version of
DML).
    In Figure 4 we plotted the mean log-scores against parameters δ and α with the LDF
specification and DMA respectively. The picture in similar to the one we presented in the
simulation study, using 2-level LDF models shows greater robustness to the hyperparameter
choice. In this case LDF2s,a performs better (MLS = -1.88 for δ = 0.7) than both DMA
(MLS = -1.90 for α = 0.7) and LDF2s,s (MLS = -1.92 for δ = 1). Finally, we also checked
that our results do not depend on the value of the parameter c in (2.1). Based on these
results, the simple average (MLS = -2.00) is still an attractive option for forecast synthesis
which performs almost on par with the best-N forecasters (N = 7, -1.96) methodology.
We also see that model combination is always, bar BMA (MLS = -2.16), better than any
individual model. The original DMA approach generally performs better than the best-N
average.
    In Figure 5 we compare the forecast predictive densities of the simple average method-
ology and LDF2s,a with δ = 0.7. We notice that our model assigns higher probabilities to
the forecasters who predict that year-on-year (YOY) inflation increases by around 2% in
periods of lower inflation volatility and displays more dynamic changes from the begin-
ning of the aftermath of the Global Financial Crisis in 2007. In particular we see that in
the deflationary period in 2008 and 2009, our methodology provides forecasts that showed
greater uncertainty about the outcomes than the simple average of forecasters’ predictions.

4.3     Foreign Exchange Forecasts
Exchange rate forecasting is a popular, important and actively researched topic in eco-
nomics and finance (see Rossi, 2013, for a comprehensive review). Beckmann et al. (2020)
consider forecast combination using DML for a pool of Time-Varying Parameter Bayesian
Vector Autoregressive (TVP-BVAR) models with different subsets of economic fundamen-
tals. We will follow closely the set up of Beckmann et al. (2020). An outline of the model
is given in Appendix D.1, additional details regarding the model parameters are provided
in Appendix D.2, and the full information can be found in Beckmann et al. (2020)8 .
    We use a set of G10 currencies to evaluate the model performance, namely: Australian
dollar (AUD), the Canadian dollar (CAD), the euro (EUR), the Japanese yen (JPY), the
New Zealand dollar (NZD), the Norwegian krone (NOK), the Swedish krona (SEK), the
Swiss franc (CHF), pound (GBP) and the US dollar (USD). All currencies are expressed
in terms of the amount of dollars per unit of a foreign currency, i.e. the domestic price of
a foreign currency.
   8
    Following Koop and Korobilis (2013) we adopt an Exponentially Weighted Moving Average (EWMA)
estimator for the measurement covariance matrix to avoid the need for the posterior simulation for mul-
tivariate stochastic volatility. This is different than Beckmann et al. (2020) who use the approximation
derived by Triantafyllopoulos (2011)

                                                  14
Forecast densities obtained by taking an average.

                                                    
                                                                                                                              

                                                                                                                              

                                                    
            
• Long-short interest rate difference - the difference between 10 year benchmark gov-
     ernment yield and 1 month deposit rate.

   • Stock growth - monthly return on the main stock index of each of the G10 curren-
     cies/countries.

   • Gold price - monthly change in the gold price.

The interest rate difference and stock growth factors were previously used by Wright (2008),
although, there the author used the annual stock growth as opposed to monthly. In this
research we also introduce the change in gold price as a predictor. The gold prices act as
exogenous non-asset specific factors, whereas UIP and INT DIFF are asset specific.
    All collected and analysed economic data is monthly (as of month ends)10 . In line with
issues highlighted by Rossi (2013), the data underwent scrutiny in terms of any adjustments,
revisions and timing issues that could introduce information not available to an investor at
the time of decision making. At this point we note that the data between 1989 and 1990
is of worse quality. The reason of this is twofold. First, some of the financial instruments
were not well-developed or liquid enough to be reliably sourced from data providers (for
example, short terms deposit or 10 year Government bond benchmarks are not available for
all currencies in question at the beginning of the sample). Secondly, certain currencies and
stock indices started being available in the 1990s (e.g. EUR exchange rate). The details
concerning the data sources and any proxies used are presented in the data Appendix D.6.
The data is standardised based on the mean and standard deviation calibrated to an initial
training period of 10 years.
    To understand effects of the pool size, we consider a small pool (which consists of the
32 models based on UIP only and time-constant parameters), and a larger pool (which
consists of 2048 models including all possible subsets of the fundamentals). An exhaustive
list of model parameter settings is outlined in Appendix D.2. The smaller pool allows a
comparison to the N-average method of Diebold et al. (2022) and BPS method of McAlinn
and West (2019) which are computationally costly when the number of models is large.
    Setting the parameter δ < 1 improves both the model averaging as well as model
selection procedures for both model pools for two-level LDF models. The best scores in
model averaging/selection were achieved for LDF2s,s specification with δ = 0.8 for the large
model pool and δ = 0.9 for the small model pool. Model averaging scored better than
model selection and the benefit of using a two-level LDF method appears to be greater for
the bigger model pool.
    Interestingly, in the larger pool, the EWMA Random Walk (RW), decay factor 0.97,
model was not the best model of all models considered (MLS = 21.77) but it performed
almost on par with the a posteriori best model (MLS = 21.78) which indicates that even
from a big pool of models it is hard to find a model that outperforms the random walk.
the appropriate cross-currency curves. However, we assume that the difference between the deposit rates
in two countries provides a good proxy for the interest rate differential.
  10
     If month end data was not available, it was substituted with the beginning of the month data or
monthly average. These substitutions were unavoidable for some of the data in the 1980s. See Appendix
D.6 for more details.

                                                  16
Small pool                                                         Large pool

                                                                     

                                                                     
 0/6

                                                                    0/6
                                                                     

                      LDFs,2 s                                                   LDFs,2 s
                              LDFs,2 a                                                           LDFs,2 a
                      '0$                                                     '0$
                                                                        

                                                                     

                                                                     
 0/6

                                                                    0/6

                                                                     

                      LDFa,2 s                                                   LDFa,2 s
                              LDFa,2 a                                                           LDFa,2 a
                      LDFa1                                                      LDFa1
                                                                        

Figure 6: FX – MLS versus values of δ for LDF and α for DMA in the x-axis for the small
and large model pool. The upper plots show the cases of model averaging whereas the
lower plots show model selection.

However, both the single level as well as two-level LDF based model selection methods can
provide a better performance versus using the random walk. We note that the proposed
LDF2a,a methodology improved upon the DML method Beckmann et al. (2020), which as
we recall is LDF2a,a with δ = 1, in model selection for both the large and the small model
pool.
   The model averaging/selection using LDF2s,s /LDF2a,s outperforms all other benchmark
methods for both the large as well as the small model pools. This indicates that LDF can
perform well for quick and simple model averaging/selection when a large pool of models is
available as well demonstrate competitive performance against more sophisticated methods
with medium sized model pools.
   For the small universe of 32 models we benchmark LDF methods against 4-model av-
erage with 20-period rolling window (that was the best performing N-day method), BPS
and simple average. For multivariate normal BPS we have set the prior for joint covariance
matrix s0 to a diagonal matrix with 7.7% annual volatility for all currencies and matrix R0
to diagonal with entries equal to 0.001 and δ = 0.95, β = 0.99, the rest settings for this

                                                                                  17
example are as in McAlinn et al. (2020)11 . We see that for model averaging the LDF2s,s
method with δ = 0.9 performs best, followed by other two-level LDF specifications. BPS
method (MLS = 21.6) did not perform well here.
    In Figure 6 we show how the average log-score changes over time with the choice of
hyperparameters for LDF based model averaging and DMA methods with both the small
and the large pool of models. We show that LDF provides not only a better performance
in terms of optimal choice of the hyperparamter but also more robustness with respect
to that choice which is visible as a relatively smaller gradient around the maximum value
of the mean score, whereas DMA is more sensitive to the choice of α which has a single,
narrow peak.12

4.4     US Inflation Forecasts
The third study considers an example of McAlinn and West (2019), which involves fore-
casting the quarterly US inflation rate between 1961/Q1 and 2014/Q4. Here, the inflation
rate corresponds to the annual percentage change in a chain-weighted GDP price index.
There are four competing models: M1 includes one period lagged inflation rate, M2 includes
period one, two and three lagged inflation interest and unemployment rates, M3 includes
period one, two and three lagged inflation rate only and M4 includes period one lagged
inflation interest and unemployment rates. All four models provide Student-t distributed
forecasts with around 20 degrees of freedom.
    The distinguishing features of this example are the small number of models and the
existence of time periods when none of the models or model combinations lying on simplex
provide an accurate mean forecast. In this example we will see the limitation of the LDF
and other simplex based methodologies which are unable to correct for forecasting biases
if bias corrected models are not explicitly available in the pool.
    The BPS method (MLS = 0.06) dominates all other methodologies since it allows for
model combinations not adhering to simplex. In fact, there were six dates in the evaluation
period where the mean of BPS synthesised model was greater than the maximum of the
underlying models. The feature to go beyond simplex proved to be one of the key factors
in the superior performance.
    The next most effective method was N-model average of Diebold et al. (2022) which for
N = 2 and N = 3 models had a MLS equal to -0.01 and provided better performance than
the best single model (M2 , MLS = -0.02). For N = 2, out of the 100 evaluation points,
the algorithm selected the pair (M0 , M1 ) 35 times, the pair (M2 , M3 ) 49 times and the
pair (M1 , M3 ) 16 times. On the other hand, both 2-level LDF model averaging and DMA
methods did not work very well in this example but improved upon picking just a single
model. The poor performance of 2-level LDF and DMA could mostly be attributed to the
highly dynamic nature of these methods which sometimes attached too much weight to a
single model that would score poorly.
  11
    We tried different values for R0 and β and we report the best result achieved.
  12
    In Appendix D.3 we show that with a dense grid of allowable values for α and δ the points in figure 6
become smooth curves.

                                                   18

                                     
                                     
                                     

                            0/6
                                     
                                     
                                                 LDFs,2 s
                                                 LDFs,2 a
                                                 '0$
                                                                     

Figure 7: US inflation – The MLS versus values of δ for LDF and α for DML in the x-axis.

5     Discussion
This paper contributes to the model averaging and selection literature by introducing a Loss
Discounting Framework which encompasses Dynamic Model Averaging first presented by
Raftery et al. (2010), generalises Dynamic Model Learning (Beckmann et al., 2020) and
introduces additional model averaging or selection specifications. The methodology offers
extra flexibility which can lead to better forecast scores and yield results which are less
sensitive to the choice of hyperparameters. It also empowers users to choose the model
specification in terms of number of levels of discounting layers which is suitable for the
problem at hand.
    We show that our proposed methodology performs well in both the simulation study
as well as in the empirical examples based on the inflation forecasts as surveyed by the
ECB and the exchange rate forecasts based on vector autoregressive models. We find that
the LDF can be a good choice when: the number of forecasters is fairly large (depending
on the hardware availability it is usually around 10) and sophisticated methods become
burdensome; if we want to have only a small number of hyperparameters to calibrate;
we suspect that we are in the M-complete/open setting and different models might be
optimal at different times but there is no consistent bias to be eliminated across all models;
if we believe that scoring forecasters on the joint predictive density or joint utility basis is
reasonable.
    The LDF is by no means the panacea for model synthesis and as we have seen in
the empirical studies various model synthesis methods might work in different settings.
With that in mind, we believe that the methodology we put forward can be considered as
an easy-to-implement and compute, yet well-performing alternative formulation for model
averaging and model selection.
    There are multiple open avenues to explore. Many current forecast combination meth-
ods described in the literature assume that the pool of forecasters does not change over
time (see e.g. Diebold et al., 2022; McAlinn and West, 2019; Raftery et al., 2010). In
some situations this is a substantial limitation, for example, in the Survey of Professional
forecasters, many forecasters will only provide forecasts for a some of the quarters.

                                                                      19
Let us first consider the situation of a new agent being added to the existing pool of
forecasters. The existing forecasters already have a track record of forecasts and corre-
sponding scores. A new forecaster could be included with an initial weight. This could be
fairly easily achieved in the LDF by considering a few initial scores. It is not clear what
this weight should be, especially in more formal methodologies which relax the simplex
restriction like McAlinn and West (2019). Similarly, forecasters may drop out completely
or for some quarters before providing new forecasts. Again, in general, it is hard to know
how to weight these forecasters. The LDF provides a rationale, we should be using an
estimate of that forecaster’s score when a forecast is made. This is a time series prediction
problem and can be approached using standard methods.
    As mentioned before we use joint predictive log-likelihood as a statistical measure of
out-of-sample forecasting performance. It gives an indication of how likely the realisation
of the modelled variable was conditional on the model parameters. The logarithmic scoring
rule is strictly proper but it severely penalises low probability events and hence it is sensitive
to tail or extreme cases, see Gneiting and Raftery (2007). A different proper scoring rule
could be used if deemed appropriate for a specific use case.
    Furthermore, since the scoring function is often based on the joint forecast probability
density function, our methodology is not best suited to take strength from forecasters who
might be good at forecasting one or more variables but not the others. This is partially
due to the fact that our methodology does not consider any dependency structure between
expert models and the weighting is solely performance based. An extension introducing
a way to take the agent interdependencies into consideration would be of considerable
interest.

References
Abbate, A. and M. G. Marcellino (2018). Point, interval and density forecasts of exchange
 rates with time-varying parameter models. Journal of the Royal Statistical Society, Series
 A 181, 155–179.

Bacchetta, P. and E. Van Wincoop (2004). A scapegoat model of exchange-rate fluctuations.
  American Economic Review 94 (2), 114–118.

Bates, J. M. and C. W. J. Granger (1969). The combination of forecasts. Journal of the
  Operational Research Society 20, 451–468.

Beckmann, J., G. Koop, D. Korobilis, and R. A. Schüssler (2020). Exchange rate pre-
  dictability and dynamic Bayesian learning. Journal of Applied Econometrics 35, 410–
  421.

Billio, M., R. Casarin, F. Ravazzolo, and H. K. Van Dijk (2013). Time-varying combinations
  of predictive densities using nonlinear filtering. Journal of Econometrics 177, 213–232.

Bissiri, P. G., C. C. Holmes, and S. G. Walker (2016). A general framework for updating
  belief distributions. Journal of the Royal Statistical Society, Series B 78, 1103–1130.

                                               20
Clarke, J. L., B. Clarke, C.-W. Yu, et al. (2013). Prediction in M-complete Problems with
  Limited Sample Size. Bayesian Analysis 8, 647–690.

Del Negro, M., R. B. Hasegawa, and F. Schorfheide (2016). Dynamic prediction pools: An
  investigation of financial frictions and forecasting performance. Journal of Economet-
  rics 192, 391–405.

Della Corte, P. and I. Tsiakas (2012). Statistical and Economic Methods for Evaluating
  Exchange Rate Predictability, Chapter 8, pp. 221–263. John Wiley & Sons, Ltd.

Diebold, F. X. (1991). A note on Bayesian forecast combination procedures. In Economic
  Structural Change, pp. 225–232. Springer.

Diebold, F. X., M. Shin, and B. Zhang (2022). On the Aggregation of Probability As-
  sessments: Regularized Mixtures of Predictive Densities for Eurozone Inflation and Real
  Interest Rates.

Geweke, J. and G. Amisano (2011). Optimal prediction pools. Journal of Econometrics 164,
 130–141.

Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and
 estimation. Journal of the American Statistical Association 102, 359–378.

Hall, S. G. and J. Mitchell (2007). Combining density forecasts. International Journal of
  Forecasting 23, 1–13.

Hendry, D. F. and M. P. Clements (2004). Pooling of forecasts. The Econometrics Jour-
  nal 7, 1–31.

Koop, G. and D. Korobilis (2012). Forecasting inflation using dynamic model averaging.
 International Economic Review 53, 867–886.

Koop, G. and D. Korobilis (2013). Large time-varying parameter VARs. Journal of Econo-
 metrics 177, 185–198.

Kouwenberg, R., A. Markiewicz, R. Verhoeks, and R. C. J. Zwinkels (2017). Model uncer-
 tainty and exchange rate forecasting. Journal of Financial and Quantitative Analysis 52,
 341––363.

Loaiza-Maya, R., G. M. Martin, and D. T. Frazier (2021). Focused Bayesian prediction.
  Journal of Applied Econometrics 36, 517–543.

McAlinn, K., K. A. Aastveit, J. Nakajima, and M. West (2020). Multivariate bayesian
 predictive synthesis in macroeconomic forecasting. Journal of the American Statistical
 Association 115, 1092–1110.

McAlinn, K. and M. West (2019). Dynamic Bayesian predictive synthesis in time series
 forecasting. Journal of Econometrics 210, 155–169.

                                           21
Miller, J. W. and D. B. Dunson (2019). Robust Bayesian inference via coarsening. Journal
 of the American Statistical Association 114, 1113–1125.

Mitchell, J. and S. G. Hall (2005). Evaluating, comparing and combining density forecasts
 using the KLIC with an application to the Bank of England and NIESR ‘fan’charts of
 inflation. Oxford Bulletin of Economics and Statistics 67, 995–1033.

Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski (2005). Using Bayesian
  model averaging to calibrate forecast ensembles. Monthly Weather Review 133, 1155–
  1174.

Raftery, A. E., M. Kárnỳ, and P. Ettler (2010). Online prediction under model uncertainty
  via dynamic model averaging: Application to a cold rolling mill. Technometrics 52,
  52–66.

Reuters, J. M. (1996). RiskMetrics-technical document. Technical report, Technical report,
  JP Morgan-Reuters.

Rossi, B. (2013). Exchange rate predictability. Journal of Economic Literature 51, 1063–
  1119.

Stock, J. H. and M. W. Watson (2004). Combination forecasts of output growth in a
  seven-country data set. Journal of Forecasting 23, 405–430.

Triantafyllopoulos, K. (2011). Time-varying vector autoregressive models with stochastic
  volatility. Journal of Applied Statistics 38, 369–382.

Waggoner, D. F. and T. Zha (2012). Confronting model misspecification in macroeco-
 nomics. Journal of Econometrics 171, 167–184.

Wright, J. H. (2008). Bayesian model averaging and exchange rate forecasts. Journal of
 Econometrics 146, 329–341.

Yao, Y., A. Vehtari, D. Simpson, A. Gelman, et al. (2018). Using stacking to average
  Bayesian predictive distributions (with discussion). Bayesian Analysis 13, 917–1007.

Yusupova, A., N. G. Pavlidis, and E. G. Pavlidis (2019). Adaptive Dynamic Model Aver-
  aging with an Application to House Price Forecasting.

                                            22
Model MLS  ¯   σ[MLS]
                         P           P
                            log(p) σ[ log(p)]        Model       ¯
                                                               MLS   σ[MLS]
                                                                              P           P
                                                                                 log(p) σ[ log(p)]
   BMA    -4.34  0.05    -8601.11    107.41           BPS      -0.73 N/A∗     -1444.08    N/A∗
      Best N-average, rolling-window=5                                  LDFs2,a
   N=1    -0.71  0.02    -1404.45    45.73       δ   = 1.00    -0.50  0.02    -1000.88    35.00
   N=2    -0.55  0.03    -1080.86    52.25       δ   = 0.95    -0.46  0.02     -919.52    45.95
   N=3    -0.52  0.03    -1031.68    49.74       δ   = 0.90    -0.47  0.03     -932.42    52.32
   N=4    -0.52  0.02    -1033.21    33.46       δ   = 0.80    -0.48  0.03     -955.35    56.16
   N=5    -0.54  0.02    -1063.02    31.84       δ   = 0.70    -0.49  0.03     -966.49    57.96
   N=6    -0.57  0.01    -1134.41    27.37       δ   = 0.60    -0.49  0.04     -978.41    58.88
                   LDFs2,s                                    Dynamic Model Averaging
 δ = 1.00 -0.49  0.02     -973.02    34.45       α = 1.00      -0.80  0.03    -1585.42    50.03
 δ = 0.95 -0.43  0.02     -848.97    40.46       α = 0.95      -0.70  0.02    -1395.40    46.96
 δ = 0.90 -0.42  0.02     -832.41    39.40       α = 0.90      -0.63  0.02    -1237.18    49.68
 δ = 0.80 -0.42  0.02     -822.47    35.85       α = 0.80      -0.54  0.02    -1063.35    43.70
 δ = 0.80 -0.42  0.02     -824.96    35.05       α = 0.70      -0.50  0.02     -993.32    38.20
 δ = 0.60 -0.42  0.02     -830.30    34.40       α = 0.60      -0.49  0.02     -970.13    34.55

Table 1: Predictive log-likelihood for our simulated example averaged over R = 10 runs
and the associated
           P P standard deviation. We denote the average P        of the mean log score as
  ¯
MLS = T R1
                log(p), the average of the cumulative log score as log(p) = R1
                                                                               PP
                                                                                    log(p)
            r   t                                                                       r   t
and by σ(.) the corresponding standard deviation of the quantities in question. We can see
that our proposed model outperforms all other methods.

A      Simulation study - supplementary material
In this appendix we provide additional details corresponding to the simulation example
from section 4.1 as well as the results of additional experiments performed in order to
check the robustness and persistence of the results presented.

    • time constant Markov switching levels

    • time varying Markov switching levels

A.1     Simulation study - additional results
In Table 1 we present the full set of results from the simulation study in section 4.1. Apart
from MSE we also provide the sums of log-scores which also show that the two-level LDF
models also dominate in this performance metric.

A.2     Time-constant Markov switching model
In this experiment we adopt the  same set up as in Section
                                                            4.1 but we set the Markov
                                    0.990 0.005 0.005
transition matrix for µt to Q = 0.005 0.990 0.005 for three states {−1, 0, 1} and
                                    0.005 0.005 0.990
the rest of parameters we set to the same values as in the Section 4.1, namely, φx = 0.9,
σx = 0.3, σy = 0.3, σtk = 0.1 ∀k, K = 20, T = 2001. We compare our 2-level method to

                                                23
You can also read