Presenting the Probabilities of Different Effect Sizes: Towards a Better Understanding and Communication of Statistical Uncertainty - arXiv

Page created by Stephen Walsh
 
CONTINUE READING
Presenting the Probabilities of Different Effect Sizes: Towards a
                                              Better Understanding and Communication of Statistical
                                                                     Uncertainty

                                                                               Akisato Suzuki
                                                                School of Politics and International Relations
                                                                          University College Dublin
                                                                          Belfield, Dublin 4, Ireland
                                                                          akisato.suzuki@gmail.com
                                                                        ORCID: 0000-0003-3691-0236
arXiv:2008.07478v2 [stat.AP] 2 Sep 2021

                                                                              Working paper
                                                                             (August 12, 2021)

                                                                                   Abstract

                                          How should social scientists understand and communicate the uncertainty of statisti-
                                          cally estimated causal effects? It is well-known that the conventional significance-vs.-
                                          insignificance approach is associated with misunderstandings and misuses. Behavioral
                                          research suggests people understand uncertainty more appropriately in a numerical, con-
                                          tinuous scale than in a verbal, discrete scale. Motivated by these backgrounds, I propose
                                          presenting the probabilities of different effect sizes. Probability is an intuitive continuous
                                          measure of uncertainty. It allows researchers to better understand and communicate the
                                          uncertainty of statistically estimated effects. In addition, my approach needs no decision
                                          threshold for an uncertainty measure or an effect size, unlike the conventional approaches,
                                          allowing researchers to be agnostic about a decision threshold such as p < 5% and a jus-
                                          tification for that. I apply my approach to a previous social scientific study, showing
                                          it enables richer inference than the significance-vs.-insignificance approach taken by the
                                          original study. The accompanying R package makes my approach easy to implement.

                                          Keywords – Bayesian, inference, p-value, uncertainty
Presenting the Probabilities of Different Effect Sizes                        Akisato Suzuki

1     Introduction
How should social scientists understand and communicate the uncertainty of statistically
estimated causal effects? They usually use a specific decision threshold of a p-value or
confidence/credible interval and conclude whether a causal effect is significant or not,
meaning it is not (practically) null (Gross 2015; Kruschke 2018). While convenient to
make categorical decisions, the significance-vs.-insignificance dichotomy leads us to over-
look the full nuance of the statistical measure of uncertainty. This is because uncer-
tainty is the degree of confidence and, therefore, a continuous scale. The dichotomy is
also associated with problems such as p-hacking, publication bias, and seeing statistical
insignificance as evidence for the null hypothesis (e.g., see Amrhein, Greenland, and Mc-
Shane 2019; Esarey and Wu 2016; Gerber and Malhotra 2008; McShane and Gal 2016,
2017; Simonsohn, Nelson, and Simmons 2014). While these cited articles discuss these
problems stemming from the Frequentist Null Hypothesis Significance Testing, Bayesian
inference is equally susceptible to the same issues if a specific decision threshold is used
to interpret a posterior distribution.
    Behavioral research suggests both researchers and decision makers understand uncer-
tainty more appropriately if it is presented as a numerical, continuous scale rather than
as a verbal, discrete scale (Budescu et al. 2014; Friedman et al. 2018; Jenkins, Harris,
and R. M. Lark 2018; McShane and Gal 2016, 2017; Mislavsky and Gaertig 2019). If
researchers are interested in estimating a causal effect, the natural quantity of uncer-
tainty is the probability of an effect, i.e., the probability that a causal factor affects the
outcome of interest. Probability is an intuitive quantity and used in everyday life, for
example, in a weather forecast for the chance of rain. The probability of an effect can be
computed by a posterior distribution estimated by Bayesian statistics (or, if appropriate,
by a pseudo-Bayesian, confidence distribution; see Wood 2019).
    The standard ways to summarize a posterior is the plot of a probability density func-
tion, the probability of a one-sided hypothesis, and a credible interval. These standard
ways have drawbacks. A probability density plot is not the best for readers to accurately
compute a probability mass for a specific range of parameter values (Kay et al. 2016).
The probability of a one-sided hypothesis needs a decision threshold for the effect size
from which on the probability is computed (typically, the probability of an effect being
greater/smaller than zero). A credible interval also needs a decision threshold for the level
of probability based on which the interval is computed (typically, the 95% level). From
a decision-theoretic perspective, the use of a decision threshold demands a justification
based on a context-specific utility function (Berger 1985; Kruschke 2018, 276–78; Lakens
et al. 2018, 170; Mudge et al. 2012). Yet, social scientific research often deals with too
heterogeneous cases (e.g., all democracies) to find a single utility function, and may want
to be agnostic about it.
    Motivated by these backgrounds, I propose presenting the uncertainty of statistically
estimated causal effects, as the probabilities of different effect sizes. More specifically, it
is the plot of a complementary cumulative distribution, where the probability is presented
for an effect being greater than different effect sizes (here, “greater” is meant in absolute
terms: a greater positive value than zero or some positive value, or a greater negative
value than zero or some negative value). In this way, it is unnecessary for researchers
to use any decision threshold for the “significance,” “confidence,” or “credible” level of
uncertainty or for the effect size beyond which the effect is considered practically relevant.
This means researchers can be agnostic about a decision threshold and a justification for

                                              2
Presenting the Probabilities of Different Effect Sizes                       Akisato Suzuki

that. In my approach, researchers play a role of an information provider and present
different effect sizes and their associated probabilities as such. My approach applies
regardless of the types of causal effect estimate (population average, sample average,
individual, etc).
    The positive implications of my approach can be summarized as follows. First, my
approach could help social scientists avoid the dichotomy of significance vs. insignifi-
cance and present statistical uncertainty regarding causal effects as such: a step also
recommended by Gelman and Carlin (2017). This could enable social scientists to better
understand and communicate the continuous nature of the uncertainty of statistically
estimated causal effects. I demonstrate this point by applying my approach to a previous
social scientific study.
    Second, as a result of social scientists using my approach, decision makers could
evaluate whether to use a treatment or not, in light of their own utility functions. The
conventional thresholds such as p < 5% could produce statistical insignificance for a
treatment effect because of a small sample or effect size, even if the true effect were non-
zero. Then, decision makers might believe it is evidence for no effect and therefore decide
not to use the treatment. This would be a lost opportunity, however, if the probability
of the beneficial treatment effect were actually high enough (if not as high as 95%) for
these decision makers to use the treatment and accept the risk of failure.
    Finally, my approach could help mitigate p-hacking and the publication bias, as re-
searchers could feel less need to report only a particular model out of many that produces
an uncertainty measure below a certain threshold. It might be argued that my approach
will also allow theoretically or methodologically questionable research to claim credibil-
ity based on not so high but decent probability (e.g., 70%) that some factor affects an
outcome. Yet, the threshold of p < 5% has not prevented questionable research from
being published, as the replication crisis suggests. A stricter threshold of a p-value (e.g.,
Benjamin et al. 2017) may reduce, but does not eliminate, the chance of questionable
research satisfying the threshold. A stricter threshold also has a side effect: it would
increase false negatives and, as a result, the publication bias. In short, the important
thing is to evaluate the credibility of research not only based on an uncertainty measure
such as a p-value and probability but also on the entire research design and theoretical
arguments: “No single index should substitute for scientific reasoning” (Wasserstein and
Lazar 2016, 132). While further consideration and research are necessary to understand
what the best practice to present and interpret statistical uncertainty is to maximize
scientific integrity, I hope my approach contributes to this discussion. In this article, I
assume statistically estimated causal effects are not based on questionable research and
the model assumptions are plausible.
    The rest of the article further explains the motivation for, and the detail of, my
approach, and then applies it to a previous social scientific study. All statistical analysis
was done in R (R Core Team 2021). The accompanying R package makes my approach
easy to implement (see the section “Supplemental Materials”).

2    Motivation
Behavioral research suggests the use of probability as a numerical, continuous scale of
uncertainty has merits for both decision makers and researchers, compared to a verbal,
discrete scale. In communicating the uncertainty of predictions, a numerical scale more

                                             3
Presenting the Probabilities of Different Effect Sizes                       Akisato Suzuki

accurately conveys the degree of uncertainty researchers intend to communicate (Budescu
et al. 2014; Jenkins, Harris, and R. M. Lark 2018; Mandel and Irwin 2021); improves
decision makers’ forecasting (Fernandes et al. 2018; Friedman et al. 2018); mitigates
the public perception that science is unreliable in case the prediction of unlikely events
fails (Jenkins, Harris, and R. Murray Lark 2019); and helps people aggregate the dif-
ferent sources of uncertainty information in a mathematically consistent way (Mislavsky
and Gaertig 2019). Even researchers familiar with quantitative methods often misinter-
pret statistical insignificance as evidence for the null effect, while presenting a numerical
probability corrects such a dichotomous thinking (McShane and Gal 2016). My approach
builds on these insights into understanding and communicating uncertainty, adding to a
toolkit for researchers.
    Social scientists usually turn the continuous measures of the uncertainty of statisti-
cally estimated causal effects, such as a p-value and a posterior, into the dichotomy of
significance and insignificance using a decision threshold. Such a dichotomy results in the
misunderstandings and misuses of uncertainty measures (e.g., see Amrhein, Greenland,
and McShane 2019; Esarey and Wu 2016; Gerber and Malhotra 2008; McShane and Gal
2016, 2017; Simonsohn, Nelson, and Simmons 2014). p-hacking is a practice to search
for a model that produces a statistically significant effect to increase the chance of publi-
cation, and results in a greater likelihood of false positives being published (Simonsohn,
Nelson, and Simmons 2014). If only statistically significant effects are to be published, our
knowledge based on published studies is biased – the publication bias (e.g., Esarey and
Wu 2016; Gerber and Malhotra 2008; Simonsohn, Nelson, and Simmons 2014). Mean-
while, statistical insignificance is often mistaken as evidence for no effect (McShane and
Gal 2016, 2017). For example, given a decision threshold of 95%, both 94% probabil-
ity and 1% probability are categorized as statistically insignificant, although the former
presents much smaller uncertainty than the latter. If a study failed to find statistical
significance for a beneficial causal effect because of the lack of access to a large sample
or because of the true effect size being small, it could be a lost opportunity for decision
makers. It might be argued that a small effect size is practically irrelevant, but this is
not by definition so but depends on contexts. For example, if a treatment had a small
but beneficial effect and also were cheap, it could be useful for decision makers.
    A decision threshold is not always a bad idea. For example, it may sometimes be nec-
essary for experts to suggest a way for non-experts to make a yes/no decision (Kruschke
2018, 271). In such a case, a decision threshold may be justified by a utility function
tailored to a particular case (Kruschke 2018, 276; Lakens et al. 2018, 170; Mudge et
al. 2012).
    However, social scientific research often examines too heterogeneous cases to identify
a single utility function that applies to every case, and may want to be agnostic about
it. This may be one of the reasons why researchers usually resort to the conventional
threshold such as whether a point estimate has a p-value of less than 5%, or whether a
95% confidence/credible interval does not include zero. Yet, the conventional threshold
(or any other threshold) is unlikely to be universally optimal for decision making, exactly
because how small the uncertainty of a causal effect should be, depends on a utility
function, which varies across decision-making contexts (Berger 1985; Kruschke 2018, 278;
Lakens et al. 2018, 170; Mudge et al. 2012).
    Even if one adopted the view that research should be free from context-specific utility
and should use an “objective” criterion universally to detect a causal effect, p < 5%
would not be such a criterion. This is because universally requiring a fixed p-value shifts

                                             4
Presenting the Probabilities of Different Effect Sizes                       Akisato Suzuki

subjectivity from the choice of a decision threshold for a p-value to that for a sample size,
meaning that statistical significance can be obtained by choosing a large enough sample
size subjectively (Mudge et al. 2012). If “anything that plausibly could have an effect will
not have an effect that is exactly zero” in social science (Gelman 2011, 961), a non-zero
effect is by definition “detectable” as a sample size increases.
    While some propose, as an alternative to the conventional threshold, that we evaluate
whether an interval estimate excludes not only zero but also practically null values (Gross
2015; Kruschke 2018), this requires researchers to justify why a certain range of values
should be considered practically null, which still depends on a utility function (Kruschke
2018, 276–78). My approach avoids this problem, because it does not require the use
of any decision threshold either for an uncertainty measure or for an effect size. Using
a posterior distribution, it simply presents the probabilities of different effect sizes as
such. My approach allows researchers to play a role of an information provider for
the uncertainty of statistically estimated causal effects, and let decision makers use this
information in light of their own utility functions.
    The standard ways to summarize a posterior distribution are a probability density
plot, the probability of a one-sided hypothesis, and a credible interval. My approach is
different from these in the following respects. A probability density plot shows the full
distribution of parameter values. Yet, it is difficult for readers to accurately compute a
probability mass for a specific range of parameter values, just by looking at a probability
density plot (Kay et al. 2016).
    Perhaps for this reason, researchers sometimes report together with a density plot ei-
ther (1) the probability mass for parameter values greater/smaller than a particular value
(usually zero), i.e., the probability of a one-sided hypothesis, or (2) a credible interval.
However, these two approaches also have drawbacks. In the case of the probability of a
one-sided hypothesis, a specific value needs to be defined as the decision threshold for
the effect size from which on the probability is computed. Therefore, it brings the afore-
mentioned problem, i.e., demanding a justification for that particular threshold based
on a utility function. In the case of a credible interval, we must decide what level of
probability is used to define an interval, and what are practically null values (Kruschke
2018). This also brings the problem of demanding a justification for particular thresh-
olds, one for the level of probability and the other for practically null values, based on
a utility function. In addition, when a credible interval includes conflicting values (e.g.,
both significant positive and significant negative values), it can only imply inconclusive
information (Kruschke 2018), although the degree of uncertainty actually differs depend-
ing on how much portion of the interval includes these conflicting values (e.g., 5% of the
interval vs. 50% of the interval).
    Experimental studies by Allen et al. (2014), Edwards et al. (2012), and Fernandes
et al. (2018) suggest the plot of a complementary cumulative distribution, such as my
approach as explained in the next section, is one of the best ways to present uncertainty
(although their findings are about the uncertainty of predictions rather than that of causal
effects). The plot of a complementary cumulative distribution presents the probability
of a random variable taking a value greater (in absolute terms) than some specific value.
Formally: P (X > x), where X is a random variable and x is a particular value of X.
Allen et al. (2014) and Edwards et al. (2012) find the plot of a complementary cumulative
distribution is effective both for probability estimation accuracy and for making a cor-
rect decision based on the probability estimation, even under time pressure (Edwards et
al. 2012) or cognitive load (Allen et al. 2014). Note that Allen et al. (2014) and Edwards

                                             5
Presenting the Probabilities of Different Effect Sizes                       Akisato Suzuki

et al. (2012) indicate a complementary cumulative distribution is unsuitable to accurately
estimating the mean of a distribution. If the quantities of interest included the mean of
a posterior distribution, one could report it together. Fernandes et al. (2018) find the
plot of a complementary cumulative distribution is one of the best formats for optimal
decision making over repeated decision-making contexts, while a probability density plot,
a one-sided hypothesis, and a credible interval are not.
    I am not arguing my proposed approach should be used or ideal to report the un-
certainty of statistically estimated causal effects in every case. For example, if one were
interested in the Frequentist coverage of a fixed parameter value, she would rather use
a confidence interval over repeated sampling. Note that repeated sampling for the same
study is usually uncommon in social science; uncertainty is usually specific to a single
study.
    It is possible that there is no universally best way to present uncertainty (Visschers
et al. 2009). It is also possible that “the presentation format hardly has an impact on
people in situations where they have time, motivation, and cognitive capacity to process
information systematically” (Visschers et al. 2009, 284; but also see Suzuki 2021). Yet,
even those people would be unable to evaluate uncertainty properly, if the presentation
provided only limited information because of its focus on a specific decision threshold
without giving any justification for that threshold. For example, if only a 95% credible
interval were presented, it would be difficult to see what the range of the effect sizes
would be if a decision maker were interested in a different probability level (e.g., 85%).
My approach conveys the uncertainty of statistically estimated causal effects without any
decision threshold.

3     The Proposed Approach
My approach utilizes a posterior distribution estimated by Bayesian statistics (for Bayesian
statistics, see for example Gelman et al. 2013; Gill 2015; Kruschke 2015; McElreath 2015).
A posterior, denoted as p(θ|D, M ), is the probability distribution of a parameter, θ, given
data, D, and model assumptions, M , such as a functional form, an identification strategy,
and a prior distribution of θ. Here, let us assume θ is a parameter for the effect of a causal
factor.
    Using a posterior distribution, we can compute the probabilities of different effect
sizes. More specifically, we can compute the probability that a causal factor has an
effect greater (in absolute terms) than some effect size. Formally: P (θ > θ̃+ |D, M )
for the positive values of θ, where θ̃+ is zero or some positive value; P (θ < θ̃− |D, M )
for the negative values of θ, where θ̃− is zero or some negative value. If we compute
this probability while changing θ̃ up to theoretical or practical limits (e.g., up to the
theoretical bounds of a posterior or up to the minimum/maximum in posterior samples),
it results in a complementary cumulative distribution.
    As often the case, a posterior distribution might include both positive and negative
values. In such a case, we should compute two complementary distribution functions,
one for the positive values of θ, i.e., P (θ > θ̃+ |D, M ) and the other for the negative
values of θ, i.e., P (θ < θ̃− |D, M ). Whether it is plausible to suspect the effect can be
either positive or negative, depends on a theory. If only either direction of the effect is
theoretically plausible (e.g., see VanderWeele and Robins 2010), this should be reflected
on the prior of θ, e.g., by putting a bound on the range of the parameter.

                                              6
Presenting the Probabilities of Different Effect Sizes                       Akisato Suzuki

Figure 1: Example of presenting a posterior as a complementary cumulative distribution plot.

    Figure 1 is an example of my approach. I use a distribution of 10,000 draws from a
normal distribution with the mean of 1 and the standard deviation of 1. Let us assume
these draws are those from a posterior distribution, or “posterior samples,” for a coeffi-
cient in a linear regression, so that the values represent a change in an outcome variable.
The x-axis is different effect sizes, or different values of minimum predicted changes in
the outcome; the y-axis is the probability of an effect being greater than a minimum
predicted change. The y-axis has the adjective “near” on 0% and 100%, because the nor-
mal distribution is unbounded and, therefore, the plot of the posterior samples cannot
represent the exact 0% and 100% probability.
    From the figure, it is possible to see what is the probability of an effect being greater
(in absolute terms) than a certain effect size. For example, we can say: “The effect
is expected to increase the outcome by greater than zero point with a probability of
approximately 84%.” It is also clear that the positive effect is much more probable than
the negative effect: 16% probability for θ < 0 while 84% probability for θ > 0. Finally,
with a little more attention, we can also read the probability of a range of the effect size.
For example, we can compute the probability that the effect increases the outcome by
greater than one point and up to three points, as P (θ > 1) − P (θ > 3) ≈ .49 − .02 = .47.
This computation is useful if there is such a “too much” effect for a treatment to become
counterproductive (e.g., the effect of a diet on weight loss).
    For comparison, I also present the standard ways to summarize a posterior, using the
same 10,000 draws as in Figure 1: a probability density plot (Figure 2), a credible interval,
and one-sided hypotheses (both in Table 1). The probability density plot gives an overall
impression that positive values are more probable than negative values. However, it is
difficult to compute exactly what probability an effect is greater than, say, 1. Unlike my
approach, the y-axis is density rather than probability, and density is not an intuitive
quantity.
    The credible interval is computed based on the conventional decision threshold of
the 95% credible level. It includes both negative and positive values and, therefore, the
conventional approach leads either to the conclusion that the effect is not statistically
significant, or to the one that there is no conclusive evidence for the effect (Gross 2015;
Kruschke 2018). The one-sided hypotheses are computed based on the decision thresh-
old of the null effect, i.e., P (θ > 0) and P (θ < 0), as commonly done. Because of
this threshold, the information of the uncertainty is much more limited than what my

                                             7
Presenting the Probabilities of Different Effect Sizes                          Akisato Suzuki

          Figure 2: Example of presenting a posterior as a probability density plot.

                   Mean     95% Credible Interval P (θ > 0) P (θ < 0)
                   0.98         [−1.01, 2.92]        0.84      0.16

 Table 1: Example of presenting a posterior as a credible interval or as one-sided hypotheses.

approach presents in Figure 1. Most importantly, the use of these decision thresholds
require researchers to justify why these thresholds should be used, rather than, say, the
94% credible interval or P (θ > 0.1) and P (θ < −0.1). This problem does not apply to my
approach. My approach can be considered as a generalized way to use one-sided hypothe-
ses. It graphically summarizes all one-sided hypotheses (up to a practically computable
point), using different effect sizes as different thresholds from which on the probability of
an effect is computed.
    I note three caveats about my approach. First, it takes up more space than presenting
regression tables, the common presentation style in social science. Therefore, if we wanted
to present all regression coefficients using my approach, it would be too cumbersome.
Yet, the purpose of quantitative causal research in social science is usually to estimate
the effect size of a causal factor or two, and the remaining regressors are controls to enable
the identification of the causal effect(s) of interest. Indeed, it is often hard to interpret all
regression coefficients causally because of complex causal mechanisms (Keele, Stevenson,
and Elwert 2020). When researchers use matching instead of regression, they typically
report only the effect of a treatment variable (Keele, Stevenson, and Elwert 2020, 1–2),
although the identification strategy is the same – selection on observables (Keele 2015,
321–22). Thus, if researchers are interested in estimating causal effects, they usually need
to report only one or two causal factors. If so, it is not a problem even if my proposed
approach takes more space than regression tables.
    Second, if the effect size is scaled in a nonintuitive measure (such as a log odds ratio
in logistic regression), researchers will need to convert it to an intuitive scale to make my
approach work best (for detail, see Sarma and Kay 2020). For example, in the case of
logistic regression, researchers can express an effect size as a difference in the predicted
likelihood of an outcome variable.
    Third, as all models are wrong, Bayesian models are also wrong; they are the sim-
plification of reality. A posterior distribution is conditional on data used and model as-
sumptions. It cannot be a reliable estimate, if data used are inadequate for the purpose

                                               8
Presenting the Probabilities of Different Effect Sizes                      Akisato Suzuki

of research (e.g., a sample being unrepresentative of the target population or collected
with measurement errors), and/or if one or more of the model assumptions are implau-
sible (which is usually the case). Moreover, in practice a posterior usually needs to be
computed by a Markov chain Monte Carlo (MCMC) method, and there is no guarantee
that the resulting posterior samples precisely mirror the true posterior. Therefore, the
estimated probability of an effect should not be considered as the “perfect” measure of
uncertainty. For example, even if the estimated probability of an effect being greater than
zero is 100% given the model and the computational method, it should NOT be interpreted
as the certainty of the effect in practice.
    Given these limitations, the same principle applies to my proposed approach as to
the p-value: “No single index should substitute for scientific reasoning” (Wasserstein and
Lazar 2016, 132). What matters is not the “trueness” of a model but the “usefulness”
of a model. My approach makes a model more useful than the conventional approaches
to evaluating and communicating the uncertainty of statistically estimated causal effects,
in the following respects. First, it uses the probability of an effect as an intuitive quan-
tity of uncertainty, for a better understanding and communication. Second, it does not
require any decision thresholds for uncertainty measures or effect sizes. Therefore, it
allows researchers to be agnostic about a utility function required to justify such decision
thresholds, and to be an information provider presenting the probabilities of different
effect sizes as such.

4    Application
I exemplify my proposed approach by applying it to a previous social scientific study
(Huff and Kruszewska 2016) and using its dataset (Huff and Kruszewska 2015). Huff and
Kruszewska (2016) collected a nationally representative sample of 2,000 Polish adults and
experimented whether more violent methods of protest by an opposition group increase
or decrease public support for the government negotiating with the group. Specifically, I
focus on the two analyses in Figure 4 of Huff and Kruszewska (2016), which present (1)
the effect of an opposition group using bombing in comparison to occupation, and (2)
the effect of an opposition group using occupation in comparison to demonstrations, on
the attitude of the experiment participants towards tax policy in favor of the opposition
group. The two treatment variables are measured dichotomously, while the attitude
towards tax policy as the dependent variable is measured in a 100-point scale. Huff and
Kruszewska (2016) use linear regression per treatment variable to estimate its average
effect. The model is Y = β0 + β1 D + , where Y is the dependent variable, D is the
treatment variable, β0 is the constant, β1 captures the size of the average causal effect,
and  is the error term.
    In the application, I convert the model to the following equivalent Bayesian linear
regression model:

                            yi ∼ N ormal(µ, σ),
                             µ = β0 + β1 di ,
                            β0 ∼ N ormal(µβ0 = 50, σβ0 = 20),
                            β1 ∼ N ormal(µβ1 = 0, σβ1 = 5),
                             σ ∼ Exponential(rate = 0.5),

                                             9
Presenting the Probabilities of Different Effect Sizes                         Akisato Suzuki

                                            Mean β1    95% Credible Interval       N
            bombing vs. occupation           −2.49         [−5.48, 0.49]          996
         occupation vs. demonstration        −0.53         [−3.31, 2.32]          985

        Table 2: Results using the 95% credible interval. R̂ ∼
                                                             = 1.00 for all parameters.

where yi and di are respectively the outcome Y and the treatment D for an individual i.
For the quantity of interest β1 , I use a weakly informative prior of β1 ∼ N ormal(µβ1 =
0, σβ1 = 5). This prior reflects the point that the original study presents no prior belief
in favor of the negative or positive average effect of a more violent method by an oppo-
sition group on public support, as it hypothesizes both effects as plausible. I also use
a weakly informative prior for the constant term, β0 ∼ N ormal(µβ0 = 50, σβ0 = 20),
which means that, without the treatment, respondents are expected to have a neu-
tral attitude on average, but the baseline attitude of each individual may well vary.
σ ∼ Exponential(rate = 0.5) is a weakly informative prior for the model standard de-
viation parameter, implying any stochastic factor is unlikely to change the predicted
outcome value by greater than 10 points. I use four chains of MCMC process; per chain
10,000 iterations are done, the first 1,000 of which are discarded. The MCMC algorithm
used is Stan (Stan Development Team 2019), implemented via the rstanarm package
version 2.21.1 (Goodrich et al. 2020) on R version 4.1.0 (R Core Team 2021). The R̂
was approximately 1.00 for every estimated parameter, suggesting the models did not
fail to converge. The effective sample size exceeded at least 15,000 for every estimated
parameter.
    Table 2 presents the results in a conventional way: the mean in the posterior of β1 and
the 95% credible interval for the model examining the effect of bombing in comparison
to occupation, and those for the model examining the effect of occupation in comparison
to demonstrations. While the mean is a negative value in both models, the 95% credible
interval includes not only negative values but also zero and positive values. This means
the typical interval approach (e.g., Gross 2015; Kruschke 2018) would lead us to simply
conclude there is no conclusive evidence for the average effect.
    Figure 3 uses my approach for the effect of bombing in comparison to occupation.
It enables richer inference than the above conventional approach. If we focus on the
probability of β1 < 0 for example, the model expects that if an opposition group uses
bombing instead of occupation, it should reduce public support for tax policy in favor of
the group by greater than 0 point, with a probability of approximately 95%. Thus, the
negative effect of bombing is much more likely (95% probability) than the positive effect
(5% probability).
    The original conclusion was that the effect was not statistically significant at p <
5%, the threshold set at the pre-registration stage (Huff and Kruszewska 2016, 1794–
1795). However, they put a post hoc caveat that if the threshold of statistical significance
had been set at 10%, the effect would have been regarded as statistically significant
(Huff and Kruszewska 2016, 1795). This interpretation is inconsistent either with the
Fisherian paradigm of p-values or with the Neyman-Person paradigm of p-values (Lew
2012). According to the Fisherian paradigm, the preset threshold of statistical significance
is unnecessary, because a p-value in this paradigm is a local measure and not a global
false positive rate – the rate of false positives over repeated sampling from the same
data distribution (Lew 2012, 1562–63). An exact p-value should be interpreted as such
– although it is difficult to make intuitive sense of because it is not the probability of an

                                              10
Presenting the Probabilities of Different Effect Sizes                      Akisato Suzuki

                 Figure 3: Effect of bombing in comparison to occupation.

               Figure 4: Effect of occupation in comparison to demonstration.

effect but the probability of obtaining data as extreme as, or more extreme than, those
that are observed, given the null hypothesis being true (Lew 2012, 1560; Wasserstein and
Lazar 2016, 131). According to the Neyman-Person paradigm, no post hoc adjustment to
the preset threshold of statistical significance should be made, because a p-value in this
paradigm is used as a global false positive rate and not as a local measure (Lew 2012,
1562–63). The estimates from my approach are more straightforward to interpret. We
need no a priori decision threshold of a posterior to determine significance, and can post
hoc evaluate the probability of an effect (Kruschke 2015, 328).
    Figure 4 uses my approach for the effect of occupation in comparison to demon-
strations. The model expects that if an opposition group uses occupation instead of
demonstrations, it should reduce public support for tax policy in favor of the group by
greater than 0 point, with a probability of approximately 64%. This suggests weak ev-
idence, rather than no evidence, for the negative effect of occupation, meaning that the
negative effect is more likely (64% probability) than the positive effect (36% probabil-
ity). Meanwhile, the original conclusion was simply that the effect was not statistically
significant at p < 5% (Huff and Kruszewska 2016, 1777, 1795).
    In short, my approach presents richer information about the uncertainty of the statis-
tically estimated causal effects, than the significance-vs.-insignificance approach taken by

                                            11
Presenting the Probabilities of Different Effect Sizes                       Akisato Suzuki

the original study. It helps a better understanding and communication of the uncertainty
of the average causal effects of violent protesting methods.

5    Conclusion
I have proposed the alternative approach to social scientists presenting the uncertainty
of statistically estimated causal effects: the probabilities of different effect sizes via
the plot of a complementary cumulative distribution function. Unlike the conventional
significance-vs.-insignificance approach and the standard ways to summarize a posterior
distribution, my approach does not require any preset decision threshold for the “signifi-
cance,” “confidence,” or “credible” level of uncertainty or for the effect size beyond which
the effect is considered practically relevant. It therefore allows researchers to be agnostic
about a decision threshold and a justification for that. In my approach, researchers play
a role of an information provider and present different effect sizes and their associated
probabilities as such. I have shown through the application to the previous study that
my approach presents richer information about the uncertainty of statistically estimated
causal effects than the conventional significance-vs.-insignificance approach, helping a
better understanding and communication of the uncertainty.
    My approach has implications for problems in the current (social) scientific practices,
such as p-hacking, the publication bias, and seeing statistical insignificance as evidence
for the null effect (Amrhein, Greenland, and McShane 2019; Esarey and Wu 2016; Ger-
ber and Malhotra 2008; McShane and Gal 2016, 2017; Simonsohn, Nelson, and Simmons
2014). First, if the uncertainty of statistically estimated causal effects were reported
by my approach, both researchers and non-experts would be able to understand it in-
tuitively, as probability is used as a continuous measure of uncertainty in everyday life
(e.g., a weather forecast for the chance of rain). This could help mitigate the dichotomous
thinking of there being an effect or no effect, commonly associated with the significance-
vs.-insignificance approach. Second, if research outlets such as journals accepted my
approach as a way to present the uncertainty of statistically estimated causal effects,
researchers could feel less need to report only a model that produces an uncertainty mea-
sure below a certain threshold, such as p < 5%. This could help address the problem of
p-hacking and the publication bias.
    Presenting the uncertainty of statistically estimated causal effects as such has impli-
cations for decision makers as well. If decision makers had access to the probabilities of
different effect sizes, they could use this information in light of their own utility func-
tions and decide whether to use a treatment or not. It is possible that even when the
conventional threshold of the 95% credible level produced statistical insignificance for
a treatment effect, decision makers could have such a utility function to prefer using
the treatment over doing nothing, given the level of probability that does not reach the
conventional threshold but they see “high enough.” It is also possible that even when a
causal effect reaches the conventional threshold of the 95% credible level, decision makers
could have such a utility function to see the probability not high enough (e.g., must be
99% rather than 95%) and prefer doing nothing over using the treatment.
    I acknowledge my proposed approach may not suit everyone’s needs. Yet, I hope this
article provides a useful insight into how to understand and communicate the uncertainty
of statistically estimated causal effects, and the accompanying R package helps other
researchers to implement my approach without difficulty.

                                             12
Presenting the Probabilities of Different Effect Sizes                    Akisato Suzuki

Acknowledgments
I would like to thank Johan A. Elkink, Jeff Gill, Zbigniew Truchlewski, Alexandru Moise,
and participants in the 2019 PSA Political Methodology Group Annual Conference, the
2019 EPSA Annual Conference, and seminars at Dublin City University and University
College Dublin, for their helpful comments. I would like to acknowledge the receipt of
funding from the Irish Research Council (the grant number: GOIPD/2018/328) for the
development of this work. The views expressed are my own unless otherwise stated, and
do not necessarily represent those of the institutes/organizations to which I am/have
been related.

Supplemental Materials
The R code to reproduce the results in this article and the R package to implement my
approach (“ccdfpost”) is available on my website at https://akisatosuzuki.github.io.

References
Allen, Pamela M., John A. Edwards, Frank J. Snyder, Kevin A. Makinson, and David M.
    Hamby. 2014. “The Effect of Cognitive Load on Decision Making with Graphically
    Displayed Uncertainty Information.” Risk Analysis 34 (8): 1495–1505.
Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Retire Statistical Sig-
   nificance.” Nature 567:305–307.
Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wa-
    genmakers, Richard Berk, Kenneth A. Bollen, et al. 2017. “Redefine Statistical Sig-
    nificance.” Nature Human Behaviour 2 (January): 6–10.
Berger, James O. 1985. Statistical Decision Theory and Bayesian Analysis. 2nd ed. New
    York, NY: Springer.
Budescu, David V., Han-Hui Por, Stephen B. Broomell, and Michael Smithson. 2014.
   “The Interpretation of IPCC Probabilistic Statements around the World.” Nature
   Climate Change 4 (6): 508–512.
Edwards, John A., Frank J. Snyder, Pamela M. Allen, Kevin A. Makinson, and David M.
   Hamby. 2012. “Decision Making for Risk Management: A Comparison of Graphical
   Methods for Presenting Quantitative Uncertainty.” Risk Analysis 32 (12): 2055–
   2070.
Esarey, Justin, and Ahra Wu. 2016. “Measuring the Effects of Publication Bias in Political
    Science.” Research & Politics 3 (3): 1–9.
Fernandes, Michael, Logan Walls, Sean Munson, Jessica Hullman, and Matthew Kay.
    2018. “Uncertainty Displays Using Quantile Dotplots or CDFs Improve Transit
    Decision-Making.” In Proceedings of the 2018 CHI Conference on Human Factors
    in Computing Systems, 1–12. https://doi.org/10.1145/3173574.3173718.

                                            13
Presenting the Probabilities of Different Effect Sizes                   Akisato Suzuki

Friedman, Jeffrey A., Joshua D. Baker, Barbara A. Mellers, Philip E. Tetlock, and Richard
     Zeckhauser. 2018. “The Value of Precision in Probability Assessment: Evidence from
     a Large-Scale Geopolitical Forecasting Tournament.” International Studies Quarterly
     62 (2): 410–422.
Gelman, Andrew. 2011. “Causality and Statistical Learning.” American Journal of Soci-
    ology 117 (3): 955–966.
Gelman, Andrew, and John Carlin. 2017. “Some Natural Solutions to the P-Value Com-
    munication Problem—and Why They Won’t Work.” Journal of the American Sta-
    tistical Association 112 (519): 899–901.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and
    Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. Boca Raton, FL: CRC
    Press.
Gerber, Alan, and Neil Malhotra. 2008. “Do Statistical Reporting Standards Affect What
    Is Published? Publication Bias in Two Leading Political Science Journals.” Quarterly
    Journal of Political Science 3 (3): 313–326.
Gill, Jeff. 2015. Bayesian Methods: A Social and Behavioral Sciences Approach. 3rd ed.
     Boca Raton, FL: CRC Press.
Goodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2020. “rstanarm: Bayesian
   Applied Regression Modeling via Stan.” R package version 2.21.1. https : / / mc -
   stan.org/rstanarm.
Gross, Justin H. 2015. “Testing What Matters (If You Must Test at All): A Context-
    Driven Approach to Substantive and Statistical Significance.” American Journal of
    Political Science 59 (3): 775–788.
Huff, Connor, and Dominika Kruszewska. 2015. “Replication Data for: Banners, Barri-
    cades, and Bombs: The Tactical Choices of Social Movements and Public Opinion,
    V1.” Harvard Dataverse. Accessed April 17, 2019. https://doi.org/10.7910/DVN/
    DMMQ4L.
     . 2016. “Banners, Barricades, and Bombs: The Tactical Choices of Social Move-
    ments and Public Opinion.” Comparative Political Studies 49 (13): 1774–1808.
Jenkins, Sarah C., Adam J. L. Harris, and R. M. Lark. 2018. “Understanding ‘Unlikely
    (20% Likelihood)’ or ‘20% Likelihood (Unlikely)’ Outcomes: The Robustness of the
    Extremity Effect.” Journal of Behavioral Decision Making 31 (4): 572–586.
Jenkins, Sarah C., Adam J. L. Harris, and R. Murray Lark. 2019. “When Unlikely Out-
    comes Occur: The Role of Communication Format in Maintaining Communicator
    Credibility.” Journal of Risk Research 22 (5): 537–554.
Kay, Matthew, Tara Kola, Jessica R. Hullman, and Sean A. Munson. 2016. “When (Ish)
    Is My Bus? User-Centered Visualizations of Uncertainty in Everyday, Mobile Pre-
    dictive Systems.” In Proceedings of the 2016 CHI Conference on Human Factors in
    Computing Systems, 5092–5103. https://doi.org/10.1145/2858036.2858558.
Keele, Luke. 2015. “The Statistics of Causal Inference: A View from Political Methodol-
    ogy.” Political Analysis 23 (3): 313–335.

                                            14
Presenting the Probabilities of Different Effect Sizes                  Akisato Suzuki

Keele, Luke, Randolph T. Stevenson, and Felix Elwert. 2020. “The Causal Interpretation
    of Estimated Associations in Regression Models.” Political Science Research and
    Methods 8 (1): 1–13.
Kruschke, John K. 2015. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and
    Stan. London: Academic Press.
      . 2018. “Rejecting or Accepting Parameter Values in Bayesian Estimation.” Ad-
    vances in Methods and Practices in Psychological Science 1 (2): 270–280.
Lakens, Daniel, Federico G. Adolfi, Casper J. Albers, Farid Anvari, Matthew A. J. Apps,
    Shlomo E. Argamon, Thom Baguley, et al. 2018. “Justify Your Alpha.” Nature Hu-
    man Behaviour 2:168–171.
Lew, Michael J. 2012. “Bad Statistical Practice in Pharmacology (and Other Basic
    Biomedical Disciplines): You Probably Don’t Know P.” British Journal of Phar-
    macology 166 (5): 1559–1567.
Mandel, David R., and Daniel Irwin. 2021. “Facilitating Sender-Receiver Agreement in
   Communicated Probabilities: Is It Best to Use Words, Numbers or Both?” Judgment
   and Decision Making 16 (2): 363–393.
McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with Examples in
   R and Stan. Boca Raton, FL: CRC Press.
McShane, Blakeley B., and David Gal. 2016. “Blinding Us to the Obvious? The Effect
   of Statistical Training on the Evaluation of Evidence.” Management Science 62 (6):
   1707–1718.
      . 2017. “Statistical Significance and the Dichotomization of Evidence.” Journal of
    the American Statistical Association 112 (519): 885–895.
Mislavsky, Robert, and Celia Gaertig. 2019. “Combining Probability Forecasts: 60% and
    60% Is 60%, but Likely and Likely Is Very Likely.” SSRN, https : / / ssrn . com /
    abstract=3454796.
Mudge, Joseph F., Leanne F. Baker, Christopher B. Edge, and Jeff E. Houlahan. 2012.
   “Setting an Optimal α That Minimizes Errors in Null Hypothesis Significance Tests.”
   PLoS ONE 7 (2): 1–7.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna,
   Austria: R Foundation for Statistical Computing. https://www.R-project.org.
Sarma, Abhraneel, and Matthew Kay. 2020. “Prior Setting in Practice: Strategies and
    Rationales Used in Choosing Prior Distributions for Bayesian Analysis.” In Proceed-
    ings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–12.
    https://doi.org/10.1145/3313831.3376377.
Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the
    File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534–547.
Stan Development Team. 2019. Stan Reference Manual, Version 2.27. https://mc-stan.
    org/docs/2 27/reference-manual/index.html.

                                            15
Presenting the Probabilities of Different Effect Sizes                   Akisato Suzuki

Suzuki, Akisato. 2021. “Which Type of Statistical Uncertainty Helps Evidence-Based Pol-
    icymaking? An Insight from a Survey Experiment in Ireland.” arXiv: 2108.05100v1
    [stat.OT]. https://arxiv.org/abs/2108.05100.
VanderWeele, Tyler J., and James M. Robins. 2010. “Signed Directed Acyclic Graphs
    for Causal Inference.” Journal of the Royal Statistical Society Series B (Statistical
    Methodology) 72 (1): 111–127.
Visschers, Vivianne H.M., Ree M. Meertens, Wim W.F. Passchier, and Nanne N.K. De
    Vries. 2009. “Probability Information in Risk Communication: A Review of the Re-
    search Literature.” Risk Analysis 29 (2): 267–287.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p-Values:
   Context, Process, and Purpose.” The American Statistician 70 (2): 129–133.
Wood, Michael. 2019. “Simple Methods for Estimating Confidence Levels, or Tentative
   Probabilities, for Hypotheses Instead of p Values.” Methodological Innovations 12
   (1): 1–9.

                                            16
You can also read