A loss discounting framework for model averaging and selection in time series models - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A loss discounting framework for model averaging and selection in time series models Dawid Bernaciak Statistical Science, University College London, Gower Street, arXiv:2201.12045v1 [stat.ME] 28 Jan 2022 London WC1E 6BT, U.K. and Jim E. Griffin Statistical Science, University College London, Gower Street, London WC1E 6BT, U.K. January 31, 2022 Abstract We introduce a Loss Discounting Framework for forecast combination which gen- eralises and combines Bayesian model synthesis and generalized Bayes methodologies. The framework allows large scale model averaging/selection and is also suitable for handling sudden regime changes. This novel and simple model synthesis framework is compared to both established methodologies and state of the art methods for a num- ber of macroeconomic forecasting examples. We find that the proposed method offers an attractive, computationally efficient alternative to the benchmark methodologies and often outperforms more complex techniques. Keywords: Bayesian, Density forecasting, Forecast combination, Forecast averaging, Model synthesis 1
1 Introduction In recent years, the development and popularisation of econometric modelling and machine learning techniques as well as an increasingly easy access to vast computational resources and data has lead to a proliferation of forecast models. These models yield either point forecasts or full forecast density functions and originate from both established subject experts as well as newcomers armed with powerful software tools. This trend has been met with a renewed academic interest in model selection, forecast density combination and synthesis, e.g., Stock and Watson (2004), Hendry and Clements (2004), Hall and Mitchell (2007), Raftery et al. (2010), Geweke and Amisano (2011) Waggoner and Zha (2012), Koop and Korobilis (2012), Billio et al. (2013), Del Negro et al. (2016), Yao et al. (2018), McAlinn and West (2019), Diebold et al. (2022) to mention just a few. It has been shown that a combination or pooling of competing individual point forecasts can yield forecasts with a lower RMSE than a single model (Bates and Granger, 1969; Stock and Watson, 2004). Hendry and Clements (2004) suggested that the forecast combination provides an insurance in a case when the the individual models are misspecified, poorly estimated or non-stationary, and may provide superior results. The performance of density forecasts is often evaluated using logarithmic scoring rules. Mitchell and Hall (2005) found that their density contrast combination may not lead to a higher logarithmic score, even if evaluated in-sample, in an application to UK inflation forecasts. This has lead to an interest in a range of competing methods for forecast density combination. Much modern work on pooling forecast densities follows the seminal work of Diebold (1991) who showed that Bayesian model averaging (BMA) may not be optimal under a logarithmic scoring when the set of models to be pooled is misspecified. For example, Hall and Mitchell (2007) proposed to combine forecast densities by minimising the Kullback- Leibler divergence between the pooled forecast density and the true but unknown density. Their method leads to time-invariant linear pool with weights chosen on the simplex to maximise the average logarithmic scoring rule. This idea of optimizing the logarithmic scoring rule of the forecast combination has been developed in several directions. Geweke and Amisano (2011) introduced a Bayesian approach to estimating the weights of a time-invariant linear pool. This has been extended to time-varying weights by Waggoner and Zha (2012) (Markov switching weights) and Del Negro et al. (2016) (dynamic linear pools). These approaches have been combined and extended by Billio et al. (2013) who used sequential Monte Carlo method to approximate the filtering and predictive densities. An alternative direction involves directly adjusting the model weights of BMA to give better performance. Raftery et al. (2005) used BMA applied on a sliding window of N observations. The Dynamic Model Averaging (DMA) approach of Raftery et al. (2010) is another windowing method which is based on a recursive algorithm with exponential discounting of Bayes factors with a discount/forgetting/decay1 . Del Negro et al. (2016) argued that this can be seen as an approximation to the method of Waggoner and Zha (2012). Koop and Korobilis (2012) used the idea of logarithmic score maximisation to find an optimal discount factor for DMA. Beckmann et al. (2020) applied 1 The terms discount/forgetting/decay factor are used interchangeably in the literature and so are they in this research. 2
this idea to model selection and developed Dynamic Model Learning (DML) method with an application to foreign exchange forecasting. Through the lens of the research of Hall and Mitchell (2007), this method could be seen as an optimisation problem with respect to the forgetting factor where one tries to minimise the Kullback-Leibler divergence between the forecasts combined via DMA and the true forecast density. Recently, McAlinn and West (2019) and McAlinn et al. (2020) proposed a broad theo- retical framework called Bayesian Predictive Synthesis (BPS). In their work they used the latent factor regression model, cast as a Bayesian seemingly unrelated regression (SUR), to update latent agent states. In both works the BPS model outperformed the BMA bench- mark as well as the optimal linear pool. They also show that the majority of proposed Bayesian techniques can be classified as special cases of the BPS framework put forward in their research. Outside the formal Bayesian framework, Diebold et al. (2022) suggested using a simple average of the forecasts from a team of N (or less) forecasters, where N is set a priori. The team is selected by choosing the forecasters with the highest average logarithmic scores in the previous rw-periods. This can be seen as a localised and simplified version of the method of Hall and Mitchell (2007). This paper develops a flexible and simple approach to combining a pool of forecast densities. We assume the presence of a decision maker who examines forecasts provided by experts to make a prediction or an informed decision. We generalize the DMA and DML techniques by introducing an approach which we call a Loss Discounting Framework (LDF). This leads to more flexibility in the dynamics of the model weights but retains the computational efficiency of DMA and DML. In a simulation study we show that model averaging using LDF outperforms other benchmark methods and is more robust to the choice of the hyperparameters than DMA and DML. We also consider three empirical studies. In the first two studies, which look at professional forecast of inflation in the Eurozone and foreign exchange forecasting based on econometric fundamentals, LDF based models outperform competing methods in problems with between 32 and 2048 forecasts per period. A third example considers US inflation forecasts in presence of only four models and illustrates the limitations of our methodology. The paper is organised as follows. Section 2 presents some background which leads into a description of the proposed methodology in Section 3. In Section 4, the performance of the LDF approach is examined in a simulated example and applications to Eurozone inflation forecasting using professional forecasts, and foreign exchange and US inflation using time series models. We discuss the limitations of our approach and set out directions for further research in Section 5. The code to reproduce our study is freely available via the provided link2 . 2 Background It is common in Bayesian analysis (Yao et al., 2018, and references therein) to distinguish three types of model pool M = {M1 , M2 , · · · , MK }: M-closed – the true data generating 2 https://github.com/dbernaciak/ldf 3
process is described by one of the models in M but is unknown to researchers; M-complete – the model for the true data generating process exists but is not in M, which is viewed as a set of useful approximating models; M-open – the model for the true data generating process is not in M and the true model cannot be constructed either in principle or due to a lack of resources, expertise etc.3 Model selection based on BMA only converges to the true model in the M-closed case (see e.g. Diebold, 1991) and can perform poorly outside the M-closed case. There are several reasons to believe that econometric problems are outside the M- closed setting. Firstly, real-world forecasting applications often involve complex systems and the model pool will only include approximations at best. In fact, one might argue that econometric modellers have an inherent belief that the models they propose provide reasonable approximation to the data generating process even if certain process features escape the capabilities of the supplied methodologies. Secondly, in many applications, the data generating process is not constant in time (Del Negro et al., 2016) and may involve regime changes and considerable model uncertainty. For example, in the foreign exchange context, Bacchetta and Van Wincoop (2004) proposed the scapegoat theory suggesting that investors display a rational confusion about the true source of exchange rate fluctuations. If an exchange rate movement is affected by a factor which is unobservable or unknown, investors may attribute this movement to some other observable macroeconomic funda- mental variable. This induces regimes where different market observables might be more or less important. These concerns motivate a model averaging framework that is both, suitable for M- complete (or even M-open) situations and provides a way to incorporate time-varying model weights. DMA has been shown to perform well in econometric applications. It utilises forgetting factors to alleviate the computational burden of calculating large scale Markov Chain Monte Carlo (MCMC) or sequential Monte Carlo associated with methods such as Waggoner and Zha (2012). Let πt|t−1,k be the probability that model k should be used for forecasting at time t given information through time t − 1. Raftery et al. (2010) introduce πt|t,k (which can be interpreted as the probability that k is the best model for the data at time t) and pool the models based on the values of πt|t−1,k and the forgetting factor, denoted by α, using the recursions α πt−1|t−1,k +c πt|t−1,k = PK α , (2.1) π l=1 t−1|t−1,l + c πt|t−1,k pj (yt |y t−1 ) πt|t,k = PK , (2.2) t−l ) l=1 πt|t−1,l pl (yt |y where pk (yt |y t−1 ) is the predictive likelihood at time t, given information available at time t − 1 for model k (i.e., the likelihood calculated using the posterior predictive distribution yt |y t−1 for model k given the actual realisation yt ) and c is a small positive number intro- 3 Clarke et al. (2013) give, a slightly unusual, example of works of William Shakespeare as an M-open problem. The works (data) has a true data generating process (William Shakespeare) but one can argue that it makes no sense to model the mechanism by which the data was generated. 4
duced to avoid model probability being brought to machine zero by aberrant observations4 . The log-sum-exp trick is an alternative way of handling this numerical instability which would, at least in part, eliminate the need for the constant c. We provide more insight on this matter in Appendix C. However, we largely leave the role of this parameter to further research. The recursions in (2.1) and (2.2) amount to a closed form algorithm to calculate the probability that model k is the best model given information from the previous time step t − 1, for forecasting at time t. A model receives a higher score if it performed well in the recent past. The discount factor α controls the importance that one attaches to the recent past. For example, if α = 0.7, the forecast performance 12 periods ago receives approximately 2% as much weight as the most recent observation. However, if α = 0.9, this weight is as high as 31%. Therefore, lower values of α are responsible for faster model switching. In particular, α → 0 would cause the selection of a model that performed best in the previous step and α = 1 recovers the standard BMA. DMA performed well in forecasting inflation and output growth as demonstrated in Del Negro et al. (2016). There, DMA performed comparably to the novel dynamic predic- tion pooling method proposed by the authors. It was subsequently expanded and success- fully used in econometric applications by Koop and Korobilis (2012), Koop and Korobilis (2013) and Beckmann et al. (2020). In the first two papers the authors test DMA for a few values of discount factors α, whereas, in the latest paper the authors follow the recom- mendation of Raftery et al. (2010) to estimate the forgetting factor online in the context of Bayesian model selection. 2.1 Dynamic Model Learning In order to formalise the notation which will be used in our framework, we give a closer overview of Dynamic Model Learning (DML) (Beckmann et al., 2020) which provides a way to optimally choose the discount factor α. The discounted predictive likelihood for a model k at time t is defined as t−1 i Y DPLt|t−1,k (α) = pk (yt−i |y t−1−i )α , (2.3) i=1 where pk (yt |y t−1 ) is the realised predictive likelihood of the model k, given α, for the events up to, and including, time t − 1. The values of the parameter α are restricted to a grid of values which we denote Sα . At time τ the value of α is chosen as the one which has produced the model with the highest product of predictive likelihoods in the past. So that, at each time t we select the best model given value α: ∗ kt,α = arg max DPLt|t−1,k (α) t 6 τ α ∈ Sα . (2.4) k∈1,...,K 4 Yusupova et al. (2019) note that this constant is only present in the original work by Raftery et al. (2010) and then in the implementation by Koop and Korobilis (2012) but then subsequently dropped in further works, software packages and citations. They also notice that this constant has a non-trivial and often critical effect of the dynamics of weight changes. We comment on this aspect in Appendix B. 5
We select the optimal value of α, denoted as ατ∗ |τ −1 such that: −1 τY τ −1 X ατ∗ |τ −1 = arg max ∗ (yt |y pkt,α t−1 ) = arg max ∗ (yt |y log pkt,α t−1 ) , (2.5) α∈Sα t=1 α∈Sα t=1 ∗ (yt |y t−1 where pkt,α ) is the predictive likelihood for the best model at time t for a given α ∈ Sα . Then, at time τ we select the single best model kτ∗ such that kτ∗|τ −1 = arg max DPLτ |τ −1,k (ατ∗ ). (2.6) k∈1,...,K Therefore, at time τ , we select the model that has the highest discounted predictive like- lihood up to, and including, time τ − 1 using the optimal value ατ∗ , i.e. pkτ∗ (yτ |yτ −1 ). We stress that in this set up the parameter α is not time varying but merely time dependent. Although not considered by Beckmann et al. (2020), the DML framework can be trivially extended to model averaging. 3 Methodology 3.1 Loss Discounting Framework In this section we outline our proposed loss discounting framework (LDF) which can be used for both model averaging as well as model selection. This framework generalises and expands the DMA and DML techniques. Let us define log-discounted predictive likelihood as: t−1 X δ i log pk (yt−i |y t−1−i , θ) , LDPLt|t−1,k (θ) = (3.1) i=1 where k indexes a model and θ is a parameter or a set of parameters. In DMA we only have one level of discounting where pk (yt−i |y t−1−i , θ) are the different forecaster densities (δ corresponds to α). In this case, similarly to (2.1), these are passed through a softmax function to get weights. Therefore, we could denote DMA as LDF1s where the 1 superscript indicates a single level of loss discounting and the s subscript indicates the use of the soft-max function. In DML, as in DMA, the pk (yt−i |y t−1−i , θ)’s are the different forecaster densities and δ corresponds to α but these are passed through an argmax function to select the best model. Then the second layer of discounting is used to find an optimal value of the parameter α (in a similar way to (2.5) and (2.6) with δ = 1). In this case we could call DML LDF2a,a with δ = 1. The (a, a) subscript indicates the use of the argmax function at two levels. In general, multiple model specification are possible where user is free to select the number of discounting steps and functions acting in each layer. The LDPL can be generalized to t−1 X LDPLt|t−1,k (θ) = δ i Sk (yt−i , θ) i=1 6
where Sk is a score for predicting yt−i with parameters θ. This generalizes the use of the logarithmic scoring. The use of scoring rules for Bayesian updating for parameters was pioneered by Bissiri et al. (2016) (rather than inference about models in forecast combination) and is justified in a M-open or misspecified setting. Loaiza-Maya et al. (2021) extend this approach to econometric forecasting. They both consider sums which are equally weighted (i.e. δ = 1). Miller and Dunson (2019) provide a justification for using a powered version of the likelihood of misspecified models. A useful representation of the LDPL is LDPLt|t−1,k (θ) = Mt Sk,t (θ) = −Mt Lk,t (θ) δ (1 − δ t−1 ). The variable SM,t (θ) = M1t t−1 i P where Mt = 1−δ i=1 δ Sk (yt−i , θ) can be interpreted as an estimator of the expected score at time t (or equivalently LM,t (θ) can be interpreted as an estimator of the expected loss at time t). Mt is a measure of the relevance of information available at time t. This generalises the idea of maximising the model weights so that the simple average of the scores is maximised as proposed in Hall and Mitchell (2007). Geweke and Amisano (2011) justify the simple average for cases when the data generating process is ergodic which guarantees the existence of an optimal model pool. The use of an average can also be justified under the simpler condition that the scores are ergodic. We consider weighted averages as more appropriate and robust than simple averages in time series problems (the simple average is still available by choosing δ = 1) for the reason that only more recent information may be useful to predict the expected score. A similar, more informal, argument is made by Billio et al. (2013) to justify taking exponentially weighted moving average of forecast errors. 3.2 Two-Level Model Averaging within Loss Discounting Frame- work The LDF allows us to describe more general set-ups for discounting in forecast combination, such as these models with two discounting levels. In the first model, the first LDPL layer is put through softmax and the second through argmax. This is denoted LDF2s,a . A second model replaces the argmax second layer with a softmax layer, which is denoted LDF2s,s . As described in Beckmann et al. (2020), to choose the parameter α we specify a grid of possible values. However, in these two specifications we aim to construct a model com- bination as opposed to model selection as in DML. The model averaging procedure hinges on (2.1) and (2.2), i.e. the first layer of LDPL is passed through the softmax function. In our methodology we make the parameter α time varying and introduce the notation αt to denote the value of this parameter at time t. For each discount factor αt at time t one obtains the corresponding average prediction: K X t−1 p̄(yt |y , αt ) = πt|t−1,l (αt )pl (yt |y t−1 ). (3.2) l=1 For LDF2s,a , to choose an optimal ατ at time τ (i.e. for event at time τ which has not yet happened), which we denote as ατ∗ |τ −1 , we will use an exponentially weighted sum of 7
the realised log predictive likelihoods of the models’ average: τ −1 X ατ∗ |τ −1 δ i−1 log p̄(yτ −i |y τ −1−i , ατ ) , = arg max (3.3) ατ ∈Sα i=1 where Sα is the set of allowable values for the parameter α. Finally, the model average to be used for event at time τ can be calculated as: K X τ −1 p̄(yτ |y , ατ∗ |τ −1 ) = πτ |τ −1,l (ατ∗ |τ −1 )pl (yτ |y τ −1 ). (3.4) l=1 For LDF2s,s , we calculate weights for each ατ,j for j = 1, . . . , |Sα | at time τ as: Pτ −1 i−1 δ log (p̄(yτ −i |y τ −1−i , ατ,j )) wατ,j = P|Sα |i=1 Pτ −1 i−1 , (3.5) j=1 i=1 δ log (p̄(yτ −i |y τ −1−i , ατ,j )) and the model average to be used for event at time τ can be calculated as: |Sα | K X X p̄(yτ |y τ −1 , wατ,j ) = wατ,j πτ |τ −1,l (ατ,j )pl (yτ |y τ −1 ). (3.6) j=1 l=1 Note a difference between (2.5) and (3.3). Firstly, the summation in (2.5) is over time whereas in (3.3) it is over the offsets from the time τ . This change in notation makes it explicit that in equation (3.3) the maximisation is performed for a grid of values for ατ , i.e., in the summation over index i, ατ is fixed. Secondly, in equation (3.3) the logarithmic scores are discounted exponentially by a factor of δ in line with the framework described in subsection 3.1, and we choose the model/model combination based on the weighted loss. This introduces more flexibility in the dynamics of the model averaging by allowing the possibility to gradually discount the previous performance of the forgetting factors. Parameter δ might be set by an expert or calculated on a calibration sample if the data sample is sufficiently large to permit a robust estimation. In terms of computation time our proposed algorithm is very fast as it just relies of simple addition and multiplication. This is an advantage over more sophisticated forecasts combination methods when the time series is long and/or we would like to incorporate a large (usually greater than 10) number of forecasters. Finally, these two-level LDF methods can be trivially expanded to the model selection use case. In such a setting in order to select the optimal value of α we would calculate an exponentially weighted average, with parameter δ, of the realised predictive log-likelihoods of the models that obtained the highest realised predictive likelihood for each possible α at previous times. This corresponds to the setting where we put both levels through argmax functions. As mentioned before, LDF2a,a is a generalised version of DML presented in Beckmann et al. (2020) where implicitly the authors suggest δ = 1, i.e, all past performances of the forgetting factors are equally weighted. In the limit δ → 0 we would choose the discount factor α which performed best in the latest run, disregarding any other history. 8
In this part we considered a two-level model, but it would of course be possible to construct models with more levels. For example a three-level model would assume that the parameter δ is also time-varying and introduce another discount factor, e.g. γ, which would help to select the right discount factor δ at each time. However, in this work we do not consider such models. 3.3 Interpretation 3.3.1 Momentum A combination of parameters α and δ might be intuitively associated with the momentum strategy where an agent/investor places a bet on the highest performing assets. Based on this interpretation, the discounted log-likelihood plays the role of the performance metric for 1) optimal model averaging; 2) optimal α parameter selection. The parameter α provides the smoothing across time for model averaging (as in Raftery et al. (2010)), whereas δ provides the exponential decay for the performance metric on which we based the selection of α. Note that this method effectively averages the best performing models which is in contrast with, for example, linear pools or Diebold et al. (2022) where the methodology aims to find the best performing average. Linear pools might choose models to either boost to overall combined volatility or fatten the tails of the combination when the agent forecasts are, for example, all Gaussian but the true DGP is Student-t, say. That however, might create problems when the models that provided a needed boost in tails misbehave at a future time period. Our proposed technique might help in some cases, for example, when pruning the outlier models before any averaging takes place is desirable. This induced model sparsity might also be useful when the model pool is large. Since BMA usually converges to a single model, the α-weighted BMA in our algorithm provides a natural the way to select top performing models. The lower the α parameter the more models will have material bearing on the final average. Similarly, the smaller the δ parameter the shorter the window length for forgetting factor performance measurement and the more dynamic the changes in optimal α. An underlying assumption in the BMA technique is that the true model is included in the model set and weights are constant over time. Following these assumption it can be shown that, in the limit, BMA converges (i.e. the weight goes to 1) to the right model. However, if the model set does not include the true model, i.e., the model set is misspecified, then BMA fails to average as pointed out in Diebold (1991). Our set up can also be interpreted in terms of time smoothed BMA selecting the right model but only locally in time. Even if no model is correct across all times BMA might be able to indicate the right model or set of models in the set up where the process characteristics change in time. I.e., if no model is right across all times it is possible that at least some models are close to the right one at certain times. Our method also provides a remedy for the constant weight assumption assuming the optimal weights being constant only locally in time. 9
3.3.2 Markov switching model with time-varying transition matrix Following the argument in Del Negro et al. (2016) to interpret DMA in terms of a Markov switching model, our extension allows a time-varying transition matrix, i.e. Qt = (q(t)kl ). The gradual forgetting of the performance of the discount factor α allows for a change of optimal discount factor when the underlying changes in transition matrix are required. However, we also show that our model outperforms the standard DMA model even when the transition matrix is non-time-varying. This point will be further illustrated in Appendix A. 4 Examples Our methodology is best suited for data with multiple regime switches with a potentially time-varying transition matrix. As such, it is particularly useful for modelling data such as inflation levels, interest or foreign exchange rates. We illustrate our model on a simulated example and three real data examples. The supplementary materials for our examples are given in appendices A, B, C, D and E. We compare examples of our LDF to several popular model averaging methodologies. The approaches used are • Two-level LDF - 2 hyperparameters, i.e., δ, c; • BMA - 0 hyperparameters; • DMA - 2 hyperparameters, i.e., α, c • BPS - 5 hyperparameters, i.e., β discount factor for state evolution matrices, δ dis- count factor for residual volatility, n0 prior number of degrees of freedom, s0 prior on BPS observation variance, R0 prior covariance matrix of BPS coefficients; • best N-average - 2 hyperparameters, i.e., N number of models, rolling window length rw. We evaluate the performance of the models by calculating the out-of-sample mean log predictive score (MLS) T 1 X log p(yt |y1 , . . . , yt−1 ), T − s t=s+1 where y1 , . . . , ys are the observations for a calibration period and T is the total number of observations. 4.1 Simulation study The data generating process (DGP) of Diebold et al. (2022) is yt = µt + xt + σy t , t ∼ N (0, 1), (4.1) xt = φx xt−1 + σx vt , vt ∼ N (0, 1), (4.2) 10
0/6 LDFs,2 s LDFs,2 a '0$ Figure 1: Simulation – The MLS versus values of δ for LDF and α for DMA in the x-axis. The error bars correspond to the standard deviation of MLS over 10 runs. where yt is the variable to be forecast, xt is the long-run component of yt , µt is the time- varying level (in Diebold et al. (2022) set to 0). The error terms are all i.i.d and uncor- related. It is assumed that the data generating process is known to each forecaster apart from the level component µt . Each individual forecaster k models xt with noise and applies different level ηk to yt : zkt = xt + σtk νkt , νkt ∼ N (0, 1), (4.3) ỹkt = ηk + zkt + σy t , t ∼ N (0, 1). (4.4) Notice that the individual forecasters’ levels are not time varying. This emulates a situation where forecasters have access to different sets of information and/or models which might guide a different choice of level. It also opens a possibility that we are in the M-complete or M-open setting where no forecaster is right at all times. In contrast to Diebold et al. (2022), we allow the variable yt to have multiple regime switches. The settings are as follows: φx = 0.9, σx = 0.3, σy = 0.3, σtk = 0.1 ∀k, K = 20, T = 2001, ηk = −2 + 0.2105(k − 1), k = 1, . . . , K and finally: S S S 0, for t ∈ [0, 49] [200, 399] [800, 849] [970, 979] S S S [1000, 1049] [1600, 1650] [1700, 2001] S S S µt = 1, for t ∈ [100, 150] [900, 949] [960, 969] [990, 999] S S S [1050, 1099] [1200, 1599] [1700, 1749] −1, otherwise. More examples are discussed in Appendix A. For LDF we set Sα = {1.0, 0.99, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.001} and c = 10−20 similarly to Koop and Korobilis (2012). The standard Bayesian model averaging (MLS = -4.34) fared poorly since it quickly converged to the wrong model. The BPS (MLS = -0.73) of McAlinn and West (2019) with 11
WUXHOHYHO PHDQSUHGLFWHGOHYHO Figure 2: Simulation – True data generating process mean and mean predicted level ac- cording to LDF2s,s . WUXHOHYHO IRULDFs,2 a = 0.95 IRULDFs,2 a = 1.0 Figure 3: Simulation – Comparison of the α parameters function for LDF2s,a model with δ = 0.95 versus δ = 1. We observe more dynamic adaptation of parameter α when we set δ = 0.95. normal agent predictive densities5 performed better than BMA but struggled to quickly adjust to the regime changes which resulted in low log-scores at the change points. The N-average method of Diebold et al. (2022) performed better (we chose rolling-window of 5 observations which performed best), with an MLS of -0.52 for N = 3 and N = 4, than BMA and BPS and similarly to the standard DMA method of Raftery et al. (2010). Our proposed method 2-level methods, however, performed best for a range of parameters δ, especially LDF2s,s . In Figure 1 we show that the performance of LDF2s,s dominates for any δ < 1, the LDF2s,a performs a bit worse but still seems to be a better choice than DMA which is a single layer model. What is more, the performance for two level LDF models is robust across wide range of values for δ which is not the case for values of α in DMA. In Figure 2 we present how synthesised agent forecast level of LDF2s,s adjusts to the mean levels implied by the DGP. We can see that the model is very reactive to the mean 5 We used the original set of parameters (adjusted β = 0.95 and δ = 0.95 to get better results) as proposed by the authors of the paper but adjusted the prior variance to match the σy parameter. The model was run for 5000 MCMC paths with 3000 paths burnin period. We performed only 1 run due to computational cost. 12
predicted level following the true DGP mean closely with only a small time lag. Figure 3 show how the parameter α dynamically changes using LDF2s,a with δ = 0.95. It is close to 1 in periods of stability and closer to 0 in times of abrupt changes. In comparison, for δ = 1 the parameter α is rather stable, oscillating around 0.6. As mentioned before, this variation in parameter α might be beneficial since the lower the α parameter more models will be taken into consideration and the final outcome might show more uncertainty. Additionally, a lower parameter alpha facilitates the ability to quickly re-weight the models to adapt to the new regime. Whereas, in the times of stability it might be better to narrow down the meaningful forecasts to a smaller group by increasing the parameter α. This illustrates how two level model provides useful flexibility in the discount factor. . 4.2 ECB Survey of Professional Forecasters For the first empirical study we analyse inflation density forecasts from the European Central Bank (ECB) Survey of Professional Forecasters.6 This data is sampled quarterly since 1999, the participants are surveyed in January, April, July and October. To make our results comparable with those in Diebold et al. (2022), we select the surveys starting in 1999 Q1 and ending in 2019 Q3, which constitutes 83 forecasts. The forecasts target the percentage change in the Harmonised Index of Consumer Prices (HICP) for the year following the forecast (one year ahead forecasts), available from The Federal Reserve Bank of St. Louis website7 . The data preparation routine closely follows the procedure described in Diebold et al. (2022) and is detailed in Appendix C. In order to evaluate our proposed methodology we calculate the log-scores for 1-year- ahead Eurozone inflation density forecasts. The burn-in/calibration sample spans Q1 1999 to Q4 2000. The scores are compared using 75 quarters between Q1 2001 and Q3 2019. 0/6 LDFs,2 s LDFs,2 a '0$ Figure 4: Professional Forecasters – The MLS versus values of δ for LDF and α for DML in the x-axis. 6 https://www.ecb.europa.eu/stats/ecb_surveys/survey_of_professional_forecasters/html/ index.en.html 7 https://fred.stlouisfed.org/series/CP0000EZ19M086NEST 13
We used the same methods as the simulated example except we did not use the BPS method of McAlinn and West (2019) since it would require extra work to adjust the model to handle the multinomial predictive distributions of agents’ forecast models. We note that LDF2s,a model specification outperforms other models and marks a noticeable improvement over the specification with δ = 1 (which would correspond to a model averaging version of DML). In Figure 4 we plotted the mean log-scores against parameters δ and α with the LDF specification and DMA respectively. The picture in similar to the one we presented in the simulation study, using 2-level LDF models shows greater robustness to the hyperparameter choice. In this case LDF2s,a performs better (MLS = -1.88 for δ = 0.7) than both DMA (MLS = -1.90 for α = 0.7) and LDF2s,s (MLS = -1.92 for δ = 1). Finally, we also checked that our results do not depend on the value of the parameter c in (2.1). Based on these results, the simple average (MLS = -2.00) is still an attractive option for forecast synthesis which performs almost on par with the best-N forecasters (N = 7, -1.96) methodology. We also see that model combination is always, bar BMA (MLS = -2.16), better than any individual model. The original DMA approach generally performs better than the best-N average. In Figure 5 we compare the forecast predictive densities of the simple average method- ology and LDF2s,a with δ = 0.7. We notice that our model assigns higher probabilities to the forecasters who predict that year-on-year (YOY) inflation increases by around 2% in periods of lower inflation volatility and displays more dynamic changes from the begin- ning of the aftermath of the Global Financial Crisis in 2007. In particular we see that in the deflationary period in 2008 and 2009, our methodology provides forecasts that showed greater uncertainty about the outcomes than the simple average of forecasters’ predictions. 4.3 Foreign Exchange Forecasts Exchange rate forecasting is a popular, important and actively researched topic in eco- nomics and finance (see Rossi, 2013, for a comprehensive review). Beckmann et al. (2020) consider forecast combination using DML for a pool of Time-Varying Parameter Bayesian Vector Autoregressive (TVP-BVAR) models with different subsets of economic fundamen- tals. We will follow closely the set up of Beckmann et al. (2020). An outline of the model is given in Appendix D.1, additional details regarding the model parameters are provided in Appendix D.2, and the full information can be found in Beckmann et al. (2020)8 . We use a set of G10 currencies to evaluate the model performance, namely: Australian dollar (AUD), the Canadian dollar (CAD), the euro (EUR), the Japanese yen (JPY), the New Zealand dollar (NZD), the Norwegian krone (NOK), the Swedish krona (SEK), the Swiss franc (CHF), pound (GBP) and the US dollar (USD). All currencies are expressed in terms of the amount of dollars per unit of a foreign currency, i.e. the domestic price of a foreign currency. 8 Following Koop and Korobilis (2013) we adopt an Exponentially Weighted Moving Average (EWMA) estimator for the measurement covariance matrix to avoid the need for the posterior simulation for mul- tivariate stochastic volatility. This is different than Beckmann et al. (2020) who use the approximation derived by Triantafyllopoulos (2011) 14
Forecast densities obtained by taking an average.
• Long-short interest rate difference - the difference between 10 year benchmark gov- ernment yield and 1 month deposit rate. • Stock growth - monthly return on the main stock index of each of the G10 curren- cies/countries. • Gold price - monthly change in the gold price. The interest rate difference and stock growth factors were previously used by Wright (2008), although, there the author used the annual stock growth as opposed to monthly. In this research we also introduce the change in gold price as a predictor. The gold prices act as exogenous non-asset specific factors, whereas UIP and INT DIFF are asset specific. All collected and analysed economic data is monthly (as of month ends)10 . In line with issues highlighted by Rossi (2013), the data underwent scrutiny in terms of any adjustments, revisions and timing issues that could introduce information not available to an investor at the time of decision making. At this point we note that the data between 1989 and 1990 is of worse quality. The reason of this is twofold. First, some of the financial instruments were not well-developed or liquid enough to be reliably sourced from data providers (for example, short terms deposit or 10 year Government bond benchmarks are not available for all currencies in question at the beginning of the sample). Secondly, certain currencies and stock indices started being available in the 1990s (e.g. EUR exchange rate). The details concerning the data sources and any proxies used are presented in the data Appendix D.6. The data is standardised based on the mean and standard deviation calibrated to an initial training period of 10 years. To understand effects of the pool size, we consider a small pool (which consists of the 32 models based on UIP only and time-constant parameters), and a larger pool (which consists of 2048 models including all possible subsets of the fundamentals). An exhaustive list of model parameter settings is outlined in Appendix D.2. The smaller pool allows a comparison to the N-average method of Diebold et al. (2022) and BPS method of McAlinn and West (2019) which are computationally costly when the number of models is large. Setting the parameter δ < 1 improves both the model averaging as well as model selection procedures for both model pools for two-level LDF models. The best scores in model averaging/selection were achieved for LDF2s,s specification with δ = 0.8 for the large model pool and δ = 0.9 for the small model pool. Model averaging scored better than model selection and the benefit of using a two-level LDF method appears to be greater for the bigger model pool. Interestingly, in the larger pool, the EWMA Random Walk (RW), decay factor 0.97, model was not the best model of all models considered (MLS = 21.77) but it performed almost on par with the a posteriori best model (MLS = 21.78) which indicates that even from a big pool of models it is hard to find a model that outperforms the random walk. the appropriate cross-currency curves. However, we assume that the difference between the deposit rates in two countries provides a good proxy for the interest rate differential. 10 If month end data was not available, it was substituted with the beginning of the month data or monthly average. These substitutions were unavoidable for some of the data in the 1980s. See Appendix D.6 for more details. 16
Small pool Large pool 0/6 0/6 LDFs,2 s LDFs,2 s LDFs,2 a LDFs,2 a '0$ '0$ 0/6 0/6 LDFa,2 s LDFa,2 s LDFa,2 a LDFa,2 a LDFa1 LDFa1 Figure 6: FX – MLS versus values of δ for LDF and α for DMA in the x-axis for the small and large model pool. The upper plots show the cases of model averaging whereas the lower plots show model selection. However, both the single level as well as two-level LDF based model selection methods can provide a better performance versus using the random walk. We note that the proposed LDF2a,a methodology improved upon the DML method Beckmann et al. (2020), which as we recall is LDF2a,a with δ = 1, in model selection for both the large and the small model pool. The model averaging/selection using LDF2s,s /LDF2a,s outperforms all other benchmark methods for both the large as well as the small model pools. This indicates that LDF can perform well for quick and simple model averaging/selection when a large pool of models is available as well demonstrate competitive performance against more sophisticated methods with medium sized model pools. For the small universe of 32 models we benchmark LDF methods against 4-model av- erage with 20-period rolling window (that was the best performing N-day method), BPS and simple average. For multivariate normal BPS we have set the prior for joint covariance matrix s0 to a diagonal matrix with 7.7% annual volatility for all currencies and matrix R0 to diagonal with entries equal to 0.001 and δ = 0.95, β = 0.99, the rest settings for this 17
example are as in McAlinn et al. (2020)11 . We see that for model averaging the LDF2s,s method with δ = 0.9 performs best, followed by other two-level LDF specifications. BPS method (MLS = 21.6) did not perform well here. In Figure 6 we show how the average log-score changes over time with the choice of hyperparameters for LDF based model averaging and DMA methods with both the small and the large pool of models. We show that LDF provides not only a better performance in terms of optimal choice of the hyperparamter but also more robustness with respect to that choice which is visible as a relatively smaller gradient around the maximum value of the mean score, whereas DMA is more sensitive to the choice of α which has a single, narrow peak.12 4.4 US Inflation Forecasts The third study considers an example of McAlinn and West (2019), which involves fore- casting the quarterly US inflation rate between 1961/Q1 and 2014/Q4. Here, the inflation rate corresponds to the annual percentage change in a chain-weighted GDP price index. There are four competing models: M1 includes one period lagged inflation rate, M2 includes period one, two and three lagged inflation interest and unemployment rates, M3 includes period one, two and three lagged inflation rate only and M4 includes period one lagged inflation interest and unemployment rates. All four models provide Student-t distributed forecasts with around 20 degrees of freedom. The distinguishing features of this example are the small number of models and the existence of time periods when none of the models or model combinations lying on simplex provide an accurate mean forecast. In this example we will see the limitation of the LDF and other simplex based methodologies which are unable to correct for forecasting biases if bias corrected models are not explicitly available in the pool. The BPS method (MLS = 0.06) dominates all other methodologies since it allows for model combinations not adhering to simplex. In fact, there were six dates in the evaluation period where the mean of BPS synthesised model was greater than the maximum of the underlying models. The feature to go beyond simplex proved to be one of the key factors in the superior performance. The next most effective method was N-model average of Diebold et al. (2022) which for N = 2 and N = 3 models had a MLS equal to -0.01 and provided better performance than the best single model (M2 , MLS = -0.02). For N = 2, out of the 100 evaluation points, the algorithm selected the pair (M0 , M1 ) 35 times, the pair (M2 , M3 ) 49 times and the pair (M1 , M3 ) 16 times. On the other hand, both 2-level LDF model averaging and DMA methods did not work very well in this example but improved upon picking just a single model. The poor performance of 2-level LDF and DMA could mostly be attributed to the highly dynamic nature of these methods which sometimes attached too much weight to a single model that would score poorly. 11 We tried different values for R0 and β and we report the best result achieved. 12 In Appendix D.3 we show that with a dense grid of allowable values for α and δ the points in figure 6 become smooth curves. 18
0/6 LDFs,2 s LDFs,2 a '0$ Figure 7: US inflation – The MLS versus values of δ for LDF and α for DML in the x-axis. 5 Discussion This paper contributes to the model averaging and selection literature by introducing a Loss Discounting Framework which encompasses Dynamic Model Averaging first presented by Raftery et al. (2010), generalises Dynamic Model Learning (Beckmann et al., 2020) and introduces additional model averaging or selection specifications. The methodology offers extra flexibility which can lead to better forecast scores and yield results which are less sensitive to the choice of hyperparameters. It also empowers users to choose the model specification in terms of number of levels of discounting layers which is suitable for the problem at hand. We show that our proposed methodology performs well in both the simulation study as well as in the empirical examples based on the inflation forecasts as surveyed by the ECB and the exchange rate forecasts based on vector autoregressive models. We find that the LDF can be a good choice when: the number of forecasters is fairly large (depending on the hardware availability it is usually around 10) and sophisticated methods become burdensome; if we want to have only a small number of hyperparameters to calibrate; we suspect that we are in the M-complete/open setting and different models might be optimal at different times but there is no consistent bias to be eliminated across all models; if we believe that scoring forecasters on the joint predictive density or joint utility basis is reasonable. The LDF is by no means the panacea for model synthesis and as we have seen in the empirical studies various model synthesis methods might work in different settings. With that in mind, we believe that the methodology we put forward can be considered as an easy-to-implement and compute, yet well-performing alternative formulation for model averaging and model selection. There are multiple open avenues to explore. Many current forecast combination meth- ods described in the literature assume that the pool of forecasters does not change over time (see e.g. Diebold et al., 2022; McAlinn and West, 2019; Raftery et al., 2010). In some situations this is a substantial limitation, for example, in the Survey of Professional forecasters, many forecasters will only provide forecasts for a some of the quarters. 19
Let us first consider the situation of a new agent being added to the existing pool of forecasters. The existing forecasters already have a track record of forecasts and corre- sponding scores. A new forecaster could be included with an initial weight. This could be fairly easily achieved in the LDF by considering a few initial scores. It is not clear what this weight should be, especially in more formal methodologies which relax the simplex restriction like McAlinn and West (2019). Similarly, forecasters may drop out completely or for some quarters before providing new forecasts. Again, in general, it is hard to know how to weight these forecasters. The LDF provides a rationale, we should be using an estimate of that forecaster’s score when a forecast is made. This is a time series prediction problem and can be approached using standard methods. As mentioned before we use joint predictive log-likelihood as a statistical measure of out-of-sample forecasting performance. It gives an indication of how likely the realisation of the modelled variable was conditional on the model parameters. The logarithmic scoring rule is strictly proper but it severely penalises low probability events and hence it is sensitive to tail or extreme cases, see Gneiting and Raftery (2007). A different proper scoring rule could be used if deemed appropriate for a specific use case. Furthermore, since the scoring function is often based on the joint forecast probability density function, our methodology is not best suited to take strength from forecasters who might be good at forecasting one or more variables but not the others. This is partially due to the fact that our methodology does not consider any dependency structure between expert models and the weighting is solely performance based. An extension introducing a way to take the agent interdependencies into consideration would be of considerable interest. References Abbate, A. and M. G. Marcellino (2018). Point, interval and density forecasts of exchange rates with time-varying parameter models. Journal of the Royal Statistical Society, Series A 181, 155–179. Bacchetta, P. and E. Van Wincoop (2004). A scapegoat model of exchange-rate fluctuations. American Economic Review 94 (2), 114–118. Bates, J. M. and C. W. J. Granger (1969). The combination of forecasts. Journal of the Operational Research Society 20, 451–468. Beckmann, J., G. Koop, D. Korobilis, and R. A. Schüssler (2020). Exchange rate pre- dictability and dynamic Bayesian learning. Journal of Applied Econometrics 35, 410– 421. Billio, M., R. Casarin, F. Ravazzolo, and H. K. Van Dijk (2013). Time-varying combinations of predictive densities using nonlinear filtering. Journal of Econometrics 177, 213–232. Bissiri, P. G., C. C. Holmes, and S. G. Walker (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society, Series B 78, 1103–1130. 20
Clarke, J. L., B. Clarke, C.-W. Yu, et al. (2013). Prediction in M-complete Problems with Limited Sample Size. Bayesian Analysis 8, 647–690. Del Negro, M., R. B. Hasegawa, and F. Schorfheide (2016). Dynamic prediction pools: An investigation of financial frictions and forecasting performance. Journal of Economet- rics 192, 391–405. Della Corte, P. and I. Tsiakas (2012). Statistical and Economic Methods for Evaluating Exchange Rate Predictability, Chapter 8, pp. 221–263. John Wiley & Sons, Ltd. Diebold, F. X. (1991). A note on Bayesian forecast combination procedures. In Economic Structural Change, pp. 225–232. Springer. Diebold, F. X., M. Shin, and B. Zhang (2022). On the Aggregation of Probability As- sessments: Regularized Mixtures of Predictive Densities for Eurozone Inflation and Real Interest Rates. Geweke, J. and G. Amisano (2011). Optimal prediction pools. Journal of Econometrics 164, 130–141. Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102, 359–378. Hall, S. G. and J. Mitchell (2007). Combining density forecasts. International Journal of Forecasting 23, 1–13. Hendry, D. F. and M. P. Clements (2004). Pooling of forecasts. The Econometrics Jour- nal 7, 1–31. Koop, G. and D. Korobilis (2012). Forecasting inflation using dynamic model averaging. International Economic Review 53, 867–886. Koop, G. and D. Korobilis (2013). Large time-varying parameter VARs. Journal of Econo- metrics 177, 185–198. Kouwenberg, R., A. Markiewicz, R. Verhoeks, and R. C. J. Zwinkels (2017). Model uncer- tainty and exchange rate forecasting. Journal of Financial and Quantitative Analysis 52, 341––363. Loaiza-Maya, R., G. M. Martin, and D. T. Frazier (2021). Focused Bayesian prediction. Journal of Applied Econometrics 36, 517–543. McAlinn, K., K. A. Aastveit, J. Nakajima, and M. West (2020). Multivariate bayesian predictive synthesis in macroeconomic forecasting. Journal of the American Statistical Association 115, 1092–1110. McAlinn, K. and M. West (2019). Dynamic Bayesian predictive synthesis in time series forecasting. Journal of Econometrics 210, 155–169. 21
Miller, J. W. and D. B. Dunson (2019). Robust Bayesian inference via coarsening. Journal of the American Statistical Association 114, 1113–1125. Mitchell, J. and S. G. Hall (2005). Evaluating, comparing and combining density forecasts using the KLIC with an application to the Bank of England and NIESR ‘fan’charts of inflation. Oxford Bulletin of Economics and Statistics 67, 995–1033. Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski (2005). Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review 133, 1155– 1174. Raftery, A. E., M. Kárnỳ, and P. Ettler (2010). Online prediction under model uncertainty via dynamic model averaging: Application to a cold rolling mill. Technometrics 52, 52–66. Reuters, J. M. (1996). RiskMetrics-technical document. Technical report, Technical report, JP Morgan-Reuters. Rossi, B. (2013). Exchange rate predictability. Journal of Economic Literature 51, 1063– 1119. Stock, J. H. and M. W. Watson (2004). Combination forecasts of output growth in a seven-country data set. Journal of Forecasting 23, 405–430. Triantafyllopoulos, K. (2011). Time-varying vector autoregressive models with stochastic volatility. Journal of Applied Statistics 38, 369–382. Waggoner, D. F. and T. Zha (2012). Confronting model misspecification in macroeco- nomics. Journal of Econometrics 171, 167–184. Wright, J. H. (2008). Bayesian model averaging and exchange rate forecasts. Journal of Econometrics 146, 329–341. Yao, Y., A. Vehtari, D. Simpson, A. Gelman, et al. (2018). Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Analysis 13, 917–1007. Yusupova, A., N. G. Pavlidis, and E. G. Pavlidis (2019). Adaptive Dynamic Model Aver- aging with an Application to House Price Forecasting. 22
Model MLS ¯ σ[MLS] P P log(p) σ[ log(p)] Model ¯ MLS σ[MLS] P P log(p) σ[ log(p)] BMA -4.34 0.05 -8601.11 107.41 BPS -0.73 N/A∗ -1444.08 N/A∗ Best N-average, rolling-window=5 LDFs2,a N=1 -0.71 0.02 -1404.45 45.73 δ = 1.00 -0.50 0.02 -1000.88 35.00 N=2 -0.55 0.03 -1080.86 52.25 δ = 0.95 -0.46 0.02 -919.52 45.95 N=3 -0.52 0.03 -1031.68 49.74 δ = 0.90 -0.47 0.03 -932.42 52.32 N=4 -0.52 0.02 -1033.21 33.46 δ = 0.80 -0.48 0.03 -955.35 56.16 N=5 -0.54 0.02 -1063.02 31.84 δ = 0.70 -0.49 0.03 -966.49 57.96 N=6 -0.57 0.01 -1134.41 27.37 δ = 0.60 -0.49 0.04 -978.41 58.88 LDFs2,s Dynamic Model Averaging δ = 1.00 -0.49 0.02 -973.02 34.45 α = 1.00 -0.80 0.03 -1585.42 50.03 δ = 0.95 -0.43 0.02 -848.97 40.46 α = 0.95 -0.70 0.02 -1395.40 46.96 δ = 0.90 -0.42 0.02 -832.41 39.40 α = 0.90 -0.63 0.02 -1237.18 49.68 δ = 0.80 -0.42 0.02 -822.47 35.85 α = 0.80 -0.54 0.02 -1063.35 43.70 δ = 0.80 -0.42 0.02 -824.96 35.05 α = 0.70 -0.50 0.02 -993.32 38.20 δ = 0.60 -0.42 0.02 -830.30 34.40 α = 0.60 -0.49 0.02 -970.13 34.55 Table 1: Predictive log-likelihood for our simulated example averaged over R = 10 runs and the associated P P standard deviation. We denote the average P of the mean log score as ¯ MLS = T R1 log(p), the average of the cumulative log score as log(p) = R1 PP log(p) r t r t and by σ(.) the corresponding standard deviation of the quantities in question. We can see that our proposed model outperforms all other methods. A Simulation study - supplementary material In this appendix we provide additional details corresponding to the simulation example from section 4.1 as well as the results of additional experiments performed in order to check the robustness and persistence of the results presented. • time constant Markov switching levels • time varying Markov switching levels A.1 Simulation study - additional results In Table 1 we present the full set of results from the simulation study in section 4.1. Apart from MSE we also provide the sums of log-scores which also show that the two-level LDF models also dominate in this performance metric. A.2 Time-constant Markov switching model In this experiment we adopt the same set up as in Section 4.1 but we set the Markov 0.990 0.005 0.005 transition matrix for µt to Q = 0.005 0.990 0.005 for three states {−1, 0, 1} and 0.005 0.005 0.990 the rest of parameters we set to the same values as in the Section 4.1, namely, φx = 0.9, σx = 0.3, σy = 0.3, σtk = 0.1 ∀k, K = 20, T = 2001. We compare our 2-level method to 23
You can also read