The Rules of the Game Called Psychological Science

 
CONTINUE READING
Perspectives on Psychological Science

The Rules of the Game Called Psychological                                                                                7(6) 543­–554
                                                                                                                          © The Author(s) 2012
                                                                                                                          Reprints and permission:
Science                                                                                                                   sagepub.com/journalsPermissions.nav
                                                                                                                          DOI: 10.1177/1745691612459060
                                                                                                                          http://pps.sagepub.com

Marjan Bakker1, Annette van Dijk1, and Jelte M. Wicherts2
1
University of Amsterdam, The Netherlands, and 2Tilburg University, The Netherlands

Abstract
If science were a game, a dominant rule would probably be to collect results that are statistically significant. Several reviews
of the psychological literature have shown that around 96% of papers involving the use of null hypothesis significance testing
report significant outcomes for their main results but that the typical studies are insufficiently powerful for such a track
record. We explain this paradox by showing that the use of several small underpowered samples often represents a more
efficient research strategy (in terms of finding p < .05) than does the use of one larger (more powerful) sample. Publication
bias and the most efficient strategy lead to inflated effects and high rates of false positives, especially when researchers also
resorted to questionable research practices, such as adding participants after intermediate testing. We provide simulations
that highlight the severity of such biases in meta-analyses. We consider 13 meta-analyses covering 281 primary studies in
various fields of psychology and find indications of biases and/or an excess of significant results in seven. These results
highlight the need for sufficiently powerful replications and changes in journal policies.

Keywords
replication, sample size, power, publication bias, false positives

In many ways, science resembles a game (Mahoney, 1976). It                       introduced. We assess these problems in 13 psychological
involves rules (not cheating), individual players (researchers),                 meta-analyses and discuss solutions.
competing teams (paradigms), arbiters (reviewers and edi-
tors), and the winning of points (publications) and trophies
(professorships, grants, and awards). Just like many games,                      Authors Are Lucky!
science also involves the laws of chance. This is so specifi-                    It has long been argued that the combined outcomes of NHST
cally because many results are obtained by null hypothesis                       in the scientific literature are too good to be true (Fanelli,
significance testing (NHST; Kline, 2004). Notwithstanding                        2010; Fiedler, 2011; Ioannidis, 2008a; Sterling, 1959; Vul,
the criticism it has received (Cohen, 1990, 1994; Gigerenzer,                    Harris, Winkielman, & Pashler, 2009). Fanelli (2010) docu-
2004; Kruschke, 2011; Meehl, 1978; Nickerson, 2000; Roze-                        mented that over 80% of scientific publications in various sci-
boom, 1960; Wagenmakers, 2007; Wetzels et al., 2011), NHST                       ences report positive results and that the psychological
continues to be the main method of statistical inference in                      literature shows the highest prevalence of positive outcomes.
many fields. In NHST, the researcher defines a null hypothesis                   Sterling (1959) and Sterling, Rosenbaum, and Weinkam
of no effect (H0) and then determines the chance of finding at                   (1995) showed that in 97% (in 1958) and 96% (in 1986–1987)
least the observed effect given that this null hypothesis is true.               of psychological studies involving the use of NHST, H0 was
If this collected chance (or p value) is lower than a predeter-                  rejected at α = .05. Although it should be noted that psycho-
mined threshold (typically .05), the result is called significant.               logical papers report a host of test results (Maxwell, 2004), the
A significant result will increase the possibility of publishing a               abundance of positive outcomes is striking because effect
result (Mahoney, 1977). If science were a game, winning                          sizes (ESs) in psychology are typically not large enough to be
would entail writing the most interesting publications by gath-                  detected by the relatively small samples used in most studies
ering many p values below .05.                                                   (i.e., studies are often underpowered; Cohen, 1990).
    In this article, we discuss the replication crisis in psychol-
ogy in terms of the strategic behaviors of researchers in their
quest for significant outcomes in NHST. In line with previous
                                                                                 Corresponding Author:
work (Ioannidis, 2005, 2008b; Simmons, Nelson, & Simon-                          Marjan Bakker, Department of Psychology, Psychological Methods, University
shon, 2011), we present the results of simulations in the con-                   of Amsterdam, Weesperplein 4, 1018 XA Amsterdam, The Netherlands
text of meta-analysis to highlight the potential biases thus                     E-mail: M.Bakker1@uva.nl

                                                   Downloaded from pps.sagepub.com by guest on May 17, 2015
544		                                                                                                                                                       Bakker et al.

    The power of statistical tests depends on the nominal sig-                             major U.S. university, the percentage of unpublished studies in
nificance level (typically .05), the sample size, and the under-                           psychology is estimated to be at least 50% (Cooper, DeNeve,
lying ES, such as Cohen’s d for between-group mean                                         & Charlton, 1997; Coursol & Wagner, 1986; Shadish, Doherty,
comparisons. According to Marszalek, Barber, Kohlhart, and                                 & Montgomery, 1989), but the problem goes beyond wide-
Holmes (2011), the median total sample size in four represen-                              spread failure to publish. Statistical textbooks advise the use
tative psychological journals (Journal of Abnormal Psychol-                                of formal a priori power estimates, but in a recent sample of
ogy, Journal of Applied Psychology, Journal of Experimental                                psychological papers with NHST (Bakker & Wicherts, 2011),
Psychology: Human Perception and Performance, and Devel-                                   only 11% referred to power as a rationale for the choice of
opmental Psychology) was 40. This finding is corroborated by                               sample size or design. Although power estimates can also be
Wetzels et al. (2011), who found a median cell size of 24 in                               done informally, the typical study in psychology appears to be
both between- and within-subjects designs in their large sam-                              underpowered. If a study’s power equals 0.50, the chance to
ple of t tests from Psychonomic Bulletin & Review and Jour-                                find a significant result equals that of correctly predicting
nal of Experimental Psychology: Learning, Memory and                                       “heads” in a coin flip. The number of “heads” presented in the
Cognition. The average ES found in meta-analyses in psychol-                               psychological literature (and in other literatures) suggests a
ogy is around d = 0.50 (Anderson, Lindsay, & Bushman, 1999;                                problem. Although one author has explained this by claiming
Hall, 1998; Lipsey & Wilson, 1993; Meyer et al., 2001; Rich-                               that researchers are psychic (Bones, 2012), we think that they
ard, Bond, & Stokes-Zoota, 2003; Tett, Meyer, & Roese,                                     just act strategically.
1994), which might be an overestimation of the typical ES
given the biases we discuss below. Nevertheless, the typical
power in our field will average around 0.35 in a two indepen-                              A Dozen Replications
dent samples comparison, if we assume an ES of d = 0.50 and                                The common lack of power is well illustrated by studies of the
a total sample size of 40. This low power in common psycho-                                (positive) association between infants’ habituation to a given
logical research raises the possibility of a file drawer (Rosen-                           stimulus and their later cognitive ability (IQ). One often-cited
thal, 1979) containing studies with negative or inconclusive                               meta-analysis (McCall & Carriger, 1993) collated 12 studies
results. Publication bias can have dire consequences, as illus-                            of the correlation between measures of habituation during
trated recently by clear failures to replicate medical findings                            children’s first year of life and IQ as measured between 1 and
(Begley & Ellis, 2012; Prinz, Schlange & Asadullah, 2011).                                 8 years of age. In the funnel plot (Light & Pillemer, 1984) of
On the basis of surveys of researchers and a study of the fate                             Figure 1, these 12 Fisher-transformed (normalized) correla-
of studies approved by the institutional review board of a                                 tions are plotted against the inverse of the standard error (SE)

                                                            Distribution Under H0                         Distribution Under HA
                                                            for N = 50                                    for N=50
                                                       10

                                                               p.05                  p
The Rules of the Game                                                                                                                         545

in each study. The SE depends on sample size and equals                        larger study. The use of multiple small studies rather than a
1/N–3. The white area represents the 95% confidence inter-                    larger one gives the researcher the opportunity to make small
val (or area) under H0 = 0, and study outcomes that fall in the                alterations to the research design and provides ample opportu-
grey area are significant at α = .05 (two-tailed). The straight                nity for capitalizing on chance.
dotted line represents the estimated underlying ES from a                         Simmons et al. (2011) illustrated how easy it is to inflate
fixed effect meta-analysis (Zr = .41, which corresponds to r =                 Type I error rates when researchers employ hidden degrees of
.39 and d = .85) and the curved dotted lines represent the 95%                 freedom in their analyses and design of studies (e.g., selecting
confidence interval around that estimate of the alternative                    the most desirable outcomes, letting the sample size depend on
hypothesis (HA; study outcomes invariably fell in this interval                results of significance tests). John, Loewenstein, and Prelec
and so they appear to be homogeneous; Q = 6.74, DF = 11,                       (2012) surveyed over 2,000 psychological researchers and
p = .820). The upward narrowing of both 95% confidence                         found a majority of them to admit to use at least some of these
intervals in the funnel reflects larger power to reject H0 in                  questionable research practices (QRPs). For instance, the
large samples. On the right hand side, we depicted the power                   majority admitted to having ever failed to report all of the
to reject H0 given the sample size (SE level) as implied by the                dependent measures in a study. Forty-eight percent admitted to
estimated underlying ES of .41. For instance, on the top of                    having only report studies that “worked” (which we take to
Figure 1, we depicted the distributions under H0 and HA for                    imply p < .05), whereas 57% acknowledged to having used
N = 50 (1/SE = 6.86), which corresponds to a power of .807.                    sequential testing (cf. Wagenmakers, 2007) in their work.
As we go down the funnel, SEs become larger and so out-                        Such practices lead to inflated ESs and increased false positive
comes should deviate more strongly from the estimate of HA.                    rates (Ioannidis, 2005, 2008b; Simmons et al., 2011).
It is noteworthy that (a) all but three of the studies have a                     Suppose psychology were a game in which players have to
power below .80, (b) the correlation differed significantly                    gather a significant result in a particular direction. Players
from zero in all but one study, and (c) the study outcomes are                 have resources to gather data from N participants and can
clearly not evenly distributed in the right- and left-hand side of             choose between these options:
the funnel associated with HA. In fact, the two largest studies
showed the weakest link between infant cognition and later
IQ, whereas the smaller studies all lie on the right-hand side                      Strategy 1. Perform one large study (with N as the
the funnel (i.e., in the grey area where p < .05). Such funnel                         sample size) with sufficient power and publish it.
plot asymmetry is awkward and can be tested (Sterne & Egger,                        Strategy 2. Perform one large study and use some of the
2005) by regressing outcomes on sample sizes (or SEs) across                           QRPs most popular in psychology (John et al.,
the 12 studies: Z = 2.24, p = .025. The median sample size of                          2012). These QRPs may be performed sequentially
these studies was 25 and their typical (median) power equaled                          until a significant result is found:
.488. Under the assumption that studies are independent, the
expected number of significant findings on the basis of this                             a. Test a second dependent variable that is correlated
power analysis (i.e., the sum of power values) is 6.71, and so a                            with the primary dependent variable (for which
positive outcome in 11 out of such 12 underpowered studies is                               John et al. found a 65% admittance rate)
unlikely. Ioannidis and Trikalinos (2007) proposed a straight-                           b. Add 10 subjects (sequential testing; 57% admit-
forward χ2 test for such an excess of significant findings: χ2                              tance rate)
(DF = 1) = 6.21, p = .013. So this meta-analysis shows the                               c. Remove outliers (|Z > 2|) and rerun analysis (41%
typical signs of publication bias and results that are too good                             admittance rate)
to be true (Francis, 2012b). One possible explanation is that
studies with nonsignificant or lower correlations were missing                      Strategy 3. Perform, at most, five small studies each
from the meta-analysis. In addition, research with infants is                          with (N/5) as sample size. Players may stop data col-
not easy and seldom are these analyses carved in stone. Statis-                        lection when they find a significant result in the
tical choices concerning exclusion of data points, outliers, and                       expected direction and only publish the desired result
operationalization of dependent variables require choices that                         (the other studies are denoted “failed”; 48% admit-
are often arbitrary and so provide researchers degrees of free-                        tance rate).
dom (Simmons et al., 2011) in their analyses.                                       Strategy 4. Perform, at most, five small studies and
                                                                                       apply the QRPs described above in each of these
                                                                                       small studies if the need arises. Players may report
Playing the Game Strategically                                                         only the first study that “worked.”
The excess of significant findings may partly be explained by
researchers’ exploitation of these degrees of freedom in their                    Strategies 3 and 4 imply publication bias in the traditional
pursuit of significant outcomes (Fiedler, 2011; Wicherts, Bak-                 sense, whereas Strategies 2 and 4 relate to the analysis of the
ker, & Molenaar, 2011) and by the fact that it is easier to find               data. So what is the winning strategy? We simulated data (see
a significant effect in multiple small studies rather than one                 the online Appendix at http://pps.sagepub.com/supplemental

                                                 Downloaded from pps.sagepub.com by guest on May 17, 2015
546		                                                                                                                                                   Bakker et al.

for details) on the basis of sample sizes and ESs that are typi-                     least one significant finding (p < .05) under these four strate-
cal for psychology and found a clear answer. The left panel of                       gies. Note that we simulated one-sided results (i.e., directional
Figure 2 gives the proportion of researchers who gather at                           hypotheses) but employed two-sided tests, which should be

                          Chance of Getting at Least 1 Significant Result                                      ES Bias; Small N = 10
                                          Small N = 10
                        1.0

                                                                                            1.2
                                                                                                        1 Large Study
                                                                                                        1 Large Study with QRPs
                                                                                                        5 Small Studies

                                                                                            1.0
                                                                                                        5 Small Studies with QRPs
                        0.8

                                                                                            0.8
                        0.6
           Proportion

                                                                                     Bias
                                                                                            0.6
                        0.4

                                                                                            0.4
                                                       1 Large Study
                        0.2

                                                       1 Large Study with QRPs

                                                                                            0.2
                                                       5 Small Studies
                                                       5 Small Studies with QRPs
                        0.0

                                                                                            0.0
                              0.0   0.2     0.4          0.6        0.8        1.0                0.0       0.2        0.4          0.6   0.8       1.0
                                                  ES                                                                         ES

                          Chance of Getting at Least 1 Significant Result                                      ES Bias; Small N = 20
                                          Small N = 20
                        1.0

                                                                                            1.2

                                                                                                        1 Large Study
                                                                                                        1 Large Study with QRPs
                                                                                                        5 Small Studies
                                                                                            1.0

                                                                                                        5 Small Studies with QRPs
                        0.8

                                                                                            0.8
                        0.6
           Proportion

                                                                                     Bias
                                                                                            0.6
                        0.4

                                                                                            0.4

                                                       1 Large Study
                        0.2

                                                       1 Large Study with QRPs
                                                                                            0.2

                                                       5 Small Studies
                                                       5 Small Studies with QRPs
                        0.0

                                                                                            0.0

                              0.0   0.2     0.4          0.6        0.8        1.0                0.0       0.2        0.4          0.6   0.8       1.0
                                                  ES                                                                         ES

                          Chance of Getting at Least 1 Significant Result                                      ES Bias; Small N = 40
                                          Small N = 40
                        1.0

                                                                                            1.2

                                                                                                        1 Large Study
                                                                                                        1 Large Study with QRPs
                                                                                                        5 Small Studies
                                                                                            1.0

                                                                                                        5 Small Studies with QRPs
                        0.8

                                                                                            0.8
                        0.6
           Proportion

                                                                                     Bias
                                                                                            0.6
                        0.4

                                                                                            0.4

                                                       1 Large Study
                        0.2

                                                       1 Large Study with QRPs
                                                                                            0.2

                                                       5 Small Studies
                                                       5 Small Studies with QRPs
                        0.0

                                                                                            0.0

                              0.0   0.2     0.4          0.6        0.8        1.0                0.0       0.2        0.4          0.6   0.8       1.0
                                                  ES                                                                         ES
           Fig. 2. Results of simulations showing the optimal strategy for players (in terms of probability of finding at least one significant
           result; left column) and the bias in estimated ESs (right column) under the four strategies described in the text under a range of
           genuine ESs (d = 0 to 1). N represents sample sizes for small studies, whereas the larger sample size equals 5*N within each row.
           Results are one-sided (i.e., directional hypotheses),    butpps.sagepub.com
                                                       Downloaded from   tests are two-sided        and
                                                                                       by guest on May 17,should
                                                                                                           2015  be significant in the expected direction.
The Rules of the Game                                                                                                                        547

significant in the expected direction. So the (combined) Type I                effects and applied Sterne and Egger’s (2005) test of funnel
error rate (i.e., when ES = 0) is .025 for the large study and 1               plot asymmetry and Ioannidis and Trikalonis’ (2007) test for
− (1 − .025)^5 = .119 for the five small studies.                              an excess of significant findings. Both methods are described
   The upshot of the simulation results is that when the true ES               above and functioned well in the simulation (see the online
is small executing multiple small and underpowered studies                     Appendix at http://pps.sagepub.com/supplemental and Fig. 4),
represents the optimal strategy for individual players to generate             although it is important to note that they are sensitive to actual
a p value of less than .05. Furthermore, the use of QRPs pays                  heterogeneity of the underlying ESs (Ioannidis & Trikalinos,
off, especially with small samples and ESs. Besides, many play-                2007; Sterne & Egger, 2005). For instance, funnel plot asym-
ers need not even perform all five studies—for example, with a                 metry may arise if smaller studies tap on stronger underlying
cell size of 20 and an ES of 0.5 they need on average 2.58 stud-               effects because they are done in relatively more controlled set-
ies (expected total N = 103) without QRPs and 1.14 studies                     tings. Therefore, these tests are best applied to relatively
(expected total N = 66) with QRPs. Combined with the selection                 homogenous sets of studies as defined in the realm of
of significant results through publication decisions, these strate-            meta-analysis.
gies may explain why so many psychological researchers con-                        To gather a representative sample of sets of psychological
tinue to run underpowered studies yet almost always report                     studies that concern the same phenomenon or at least highly
significant results.                                                           similar phenomena, we retrieved from PsycARTICLES all
                                                                               108 peer-reviewed articles published in 2011 that contained
                                                                               the strings “research synthesis,” “systematic review,” or
Get Serious!                                                                   “meta-anal*” in title and/or abstract. Subsequently, we ran-
Science is not a game. The optimal strategies described above                  domly selected 11 useful meta-analyses (10% of the total). We
lead to inflated Type I errors of up to 40% and inflate genuine                only included meta-analyses that reported the ESs and stan-
effects. We calculated the bias in our simulation as the differ-               dard errors (or sample sizes) of primary studies. From each
ence between the average estimated ES and the true ES.                         meta-analysis, we retrieved the subset (as selected by the
Results are presented in the right panel of Figure 2. With large               authors of the meta-analyses) of at least 10 primary studies
samples (Strategy 1), there is no systematic bias. For large                   that was the most homogenous subset in terms of Higgins’ I2.
samples with QRPs (Strategy 2), the bias goes to zero with a                   We assumed that the meta-analysts employed rigorous inclu-
larger true ES, which is to be expected because of the larger                  sion and exclusion criteria and that they correctly determined
power under this scenario (QRPs are not required for “win-                     ESs, and we feel confident that the primary studies in each of
ning”). The bias is large with multiple small studies (Strategies              the fields are sufficiently comparable to be considered
3 and 4). In the typical psychological study (cell size 20 and                 replications.
true ES = .5), the biases with and without QRPs are 0.327 and                      The selected (subsets from) meta-analyses are given in
0.158, respectively. With smaller cell sizes and an ES of .5, the              Table 1 together with tests for homogeneity, excess of signifi-
bias can be as large as 0.913. When multiple small studies are                 cant findings, and funnel plot asymmetry. The average impact
combined with QRPs (Strategy 4), the bias is large for small                   factor of the journals in which the meta-analyses appeared
true ES, but decreases with larger true ES, possibly because of                was 4 (see the online Appendix at http://pps.sagepub.com/
the adding of subjects (in Step b). Even those who ignore                      supplemental for full references). Meta-analyses were from
p values of individual studies will find inflated ESs in the psy-              clinical, counseling, educational, evolutionary, developmen-
chological literature if a sufficient number of researchers play               tal, family, and industrial/organizational psychology. The
strategically, which indeed many psychological researchers                     medians of the sample sizes align with those found in the
appear to do (John et al., 2012).                                              wider literature, although the median ES (d = .37) was slightly
   Our field lacks clear codes of conduct considering the use                  lower than d = .50, as described earlier.
of these analytic strategies (Sterba, 2006) and many reviewers                     Figure 3 depicts the funnel plots of the 11 meta-analyses.
tend not to accept p values above .05 (Mahoney, 1977), pos-                    Tests for funnel plot asymmetry (with an α of .10 as suggested
sibly because this presents an easy heuristic (Hoekstra, Finch,                by various authors; Ioannidis & Trikalinos, 2007) were signifi-
Kiers, & Johnson, 2006; Nelson, Rosenthal, & Rosnow, 1986).                    cant in four instances (36%). In three instances, we found signs
Sole studies are seldom definitive but even knowledgeable                      of an excess of significant results (27% at α = .10). These results
researchers tend to underestimate the randomness associated                    replicate earlier indicators of the prevalence of funnel plot
with small samples (Tversky & Kahneman, 1971). At the end                      asymmetry in 99 psychological meta-analyses (Ferguson &
of the day it is all about replication.                                        Brannick, 2012; Levine, Asada, & Carpenter, 2009) and the
                                                                               finding of an excess of significant results in four areas of psy-
                                                                               chological research (Francis, 2012a, 2012b, in press).
Another 250+ Replications                                                          To get a feel for the likelihood of biases in the actual meta-
A set of reasonably similar replications can be used to deter-                 analyses, we simulated results for 16 meta-analyses with
mine robustness of findings and to study signs of the use of the               100 studies each. These meta-analyses are presented in Fig-
strategies described above. In our simulations, we used fixed                  ure 4 (see the online Appendix at http://pps.sagepub.com/

                                                 Downloaded from pps.sagepub.com by guest on May 17, 2015
548
                                                           Table 1. References, Median Sample Sizes, Mean Estimates, Homogeneity Tests, Tests of the Excess of Significant Findings, and Funnel Plot Asymmetry of 13 Meta-Analyses.

                                                           Reference                    Subgroup             ES (d)          (SE)         k           N       Med. N          Q (p)             I2        Exp.        Obs.         χ2 (p)         Regtest (p)
                                                           Alfieri et al. (2011)    Enhanced              0.20              (.042)        24        2350     61           39.35 (.018)         41.6       3.85          8       5.34 (.021)        0.64 (.519)
                                                                                      discovery:
                                                                                      Children
                                                           Benish et al. (2011)     All                   0.37              (.067)        21         933     39          16.28 (.699)           0         4.72          5       0.02 (.884)       1.48 (.140)
                                                           Berry et al. (2011)      Self-other            0.34 (0.70)       (.017)        21        3502     146        147.29 (
The Rules of the Game                                                                                                                                                                                                                                                                             549

                               12                                Alfieri et al. 2011                                                                                    Benish et al. 2011                                                                   Berry et al. 2011

                                                                                                                                                                                                                                         25
                                                          Est ES=0.2                                                                                              Est ES=0.37                                                                         Est ES=0.34
   Inverse of Standard Error

                                                                                                                Inverse of Standard Error

                                                                                                                                                                                                             Inverse of Standard Error
                                                          Q(23)=39.35, p=0.018                                                                                    Q(20)=16.28, p=0.699                                                                Q(20)=147.29, p
550
                                                                                                                           1                                                                   2                                                                           3                                                                               4
                                                                                                              1 Large Study;                                           1 Large Study With QRPs;                                                              5 Small Studies;                                                 5 Small Studies With QRPs;
                                                                                                               True ES = 0                                                    True ES = 0                                                                      True ES = 0                                                            True ES = 0
                                                                                                        Est ES = −0.039                                                      Est ES = 0.054                                                              Est ES = 0.345                                                                 Est ES = 0.479
                                                                                                        I−Chi(1)=2.69,                                                       I−Chi(1)=1.212,                                                             I−Chi(1)=1.144, p=0.285                                                        I−Chi(1)=0.001, p=0.97

                                                                                              10 12
                                                                                                                                                                   10 12
                                                                                                         p=0.101                                                              p=0.271                                                                    Z=3.821, p
Fig. 4 (continued)

                                                                                                            1 Large Study;                                               1 Large Study With QRPs;                                              5 Small Studies;                                          5 Small Studies With QRPs;
                                                                                                            True ES = 0.5                                                      True ES = 0.5                                                    True ES = 0.5                                                  True ES = 0.5
                                                                                                      Est ES = 0.497                                                       Est ES = 0.509                                                  Est ES = 0.787                                                     Est ES = 0.779

                                                                                                                                                                                                                                      7
                                                                                                                                                                                                                                                                                                         7
                                                                                                      I−Chi(1)=2.163, p=0.141                                              I−Chi(1)=11.11, p
552		                                                                                                                                  Bakker et al.

supplemental for details), based on the four strategies as                    what they looked for in early phases of research, while in the
described above (Column 1 through 4) and four levels of true                  confirmatory phases, they should conduct studies that are pos-
ES (d = 0.0, 0.2, 0.5, or 0.8; Rows A through D). As can                      sibly preregistered, sufficiently powerful, and analyzed in
be seen, the pattern of results in the habituation–IQ studies                 ways that are explicated in advance. The ideal paper then is
(Fig. 1) looks highly similar to results simulated under Strat-               not one with one or a few small studies with p values just
egy 4 and an ES greater than 0 (Fig. 4; Panels B4 and C4). The                below .05, but one in which all small pilot studies are reported
funnel plot from Alfieri et al. (2011; educational psychology)                in a meta-analytic summary and tested for homogeneity and/or
resembles that from Strategy 2 under a small ES (Panel B2). In                moderation and in which one major study lends clear support.
Woodley’s meta-analysis (2011; evolutionary psychology),                      Small and underpowered studies may lead to biases of differ-
the overall effect is close to zero and Strategy 2 appears to be              ent kinds (Ledgerwood & Sherman, 2012) and some have
at play (Panel A2). In Farber and Doodlin’s meta-analysis                     even argued to simply exclude them from meta-analyses
(2011; psychotherapy works better with positive regard), sam-                 (Kraemer, Gardner, Brooks, & Yesavage, 1998).
ples sizes were small and there is an indication of the use of                   We found indications of bias in nearly half of the psycho-
Strategy 3 (Panel B3). In Hallion and Ruscio’s meta-analysis                  logical research lines we scrutinized. The ambition of players
(2011), the effects of cognitive bias modification on stress and              in the game of science does not always sit well with the goal
anxiety appear small and based on too many underpowered                       of the scientific enterprise. Optimal strategies for individual
studies (Panel A3 or B3). Correlations between relationship                   researchers introduce biases that we can only counter by
conflict and hostility of partners as studied by Woodin (2011)                improving the rules of the game. The arbiters in the game
appear to be substantial but may also be inflated by publica-                 (peer reviewers and editors) are in an ideal position to do so.
tion bias and the use of small samples (Panel C3). In these
research lines, additional studies with larger sample sizes are               Declaration of Conflicting Interests
clearly welcome.                                                              The authors declared that they had no conflicts of interest with
   We also included in Figure 3 and Table 1 a recent meta-                    respect to their authorship or the publication of this article.
analysis on the predictive validity of the Implicit Association
Test (IAT). The subset of studies that concerned racial discrim-              Funding
ination is another example of an excess of significant results                The preparation of this article was supported by Netherlands
and funnel plot asymmetry. The results from the 32 studies                    Organization for Scientific Research (NWO) Grants 400-08-214 and
collated by Greenwald, Poehlman, Uhlman, and Banaij (2009)                    016-125-385.
are based on small samples sizes, considerable freedom in the
analysis, and a high degree of faddism, all of which may con-                 References
spire to bring about inflated effects (Ioannidis, 2005). Further              References marked with an asterisk indicate studies included in the
studies with larger sample sizes should be added to the data-                 study sample
base to accurately determine IAT’s validity.                                  *Alfieri, L., Brooks, P. J., Aldrich, N. J., & Tenenbaum, H. R. (2011).
                                                                                  Does discovery-based instruction enhance learning? Journal of
                                                                                  Educational Psychology, 103, 1–18. doi: 10.1037/a0021017
Improving the Game                                                            Anderson, C. A., Lindsay, J. J., & Bushman, B. J. (1999). Research in
Without any clear rules concerning the use (and documenta-                        the psychological laboratory. Current Directions in Psychologi-
tion) of multiple small studies and QRPs, strategic behaviors                     cal Science, 8, 3–9. doi:10.1111/1467-8721.00002
by researchers can lead literatures astray. The best way to sep-              Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statisti-
arate the wheat from the chaff in psychology is to (a) end the                    cal results in psychology journals. Behavior Research Methods,
pretense that small studies are definitive, (b) improve report-                   43, 666–678. doi:10.3758/s13428-011-0089-5
ing standards, (c) start considering and publishing nonsignifi-               Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical
cant results, and (d) introduce a distinction between exploratory                 cancer research. Nature, 483, 531–533. doi:10.1038/483531a
and confirmatory studies into journal policies (Wagenmakers,                  *Benish, S. G., Quintana, S., & Wampold, B. E. (2011). Culturally
Wetzels, Borsboom, & van der Maas, 2011). The prevalence                          adapted psychotherapy and the legitimacy of myth: A direct-
of underpowered studies in psychological research hints at the                    comparison meta-analysis. Journal of Counseling Psychology,
common use of Strategies 3 and 4 in which researchers con-                        58, 279–289. doi: 10.1037/a0023626
duct a series of small studies up to the point that one turns out             *Berry, C. M., Carpenter, N. C., & Barratt, C. L. (2011). Do other-
to be significant. Sample sizes should be based on a priori                       reports of counterproductive work behavior provide an incremental
power analyses that take into account the potential inflation of                  contribution over self-reports? A meta-analytic comparison. Jour-
effects in earlier small studies. Later replications of published                 nal of Applied Psychology, 97, 613–636. doi: 10.1037/a0026739
results and/or of statistical outcomes will be facilitated by                 Bones, A. K. (2012). We knew the future all along: Scientific a priori
including experimental material, raw data, and computer code                      hypothesizing is much more accurate than other forms of pre-
(in the case of nonstandard analyses) as online supplements. In                   cognition. Perspectives on Psychological Science, 7, 307–309.
our view, researchers should be open about having not found                       doi:10.1177/1745691612441216

                                                Downloaded from pps.sagepub.com by guest on May 17, 2015
The Rules of the Game                                                                                                                                  553

*Card, N. A., Bosch, L., Casper, D. M., Wiggs, C. B., Hawkins, S.,                Ioannidis, J. P. A. (2005). Why most published research find-
    Schlomer, G. L., & Borden, L. M. (2011). A meta-analytic review                   ings are false. PLoS Medicine, 2, e124. doi:10.1371/journal.
    of internalizing, externalizing, and academic adjustment among                    pmed.0020124
    children of deployed military service members. Journal of Fam-                Ioannidis, J. P. A. (2008a). Effect of formal statistical significance on
    ily Psychology, 25, 508–520. doi:10.1037/a0024395                                 the credibility of observational associations. American Journal of
Cohen, J. (1990). Things I have learned (thus far). American Psy-                     Epidemiology, 168, 374–383. doi:10.1093/aje/kwn156
    chologist, 45, 1304–1312.                                                     Ioannidis, J. P. A. (2008b). Why most discovered true associa-
Cohen, J. (1994). The earth is round (p < .05). American Psycholo-                    tions are inflated. Epidemiology, 19, 640–648. doi:10.1097/
    gist, 49, 997–1003.                                                               EDE.0b013e31818131e7
Cooper, H., DeNeve, K., & Charlton, K. (1997). Finding the missing                Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test
    science: The fate of studies submitted for review by a human sub-                 for an excess of significant findings. Clinical Trials, 4, 245–253.
    jects committee. Psychological Methods, 2, 447–452.                               doi:10.1177/1740774507079441
Coursol, A., & Wagner, E. E. (1986). Effect of positive findings on               John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the
    submission and acceptance rates: A note on meta-analysis bias.                    prevalence of questionable research practices with incentives for
    Professional Psychology-Research and Practice, 17, 136–137.                       truth-telling. Psychological Science. Advance online publication.
Fanelli, D. (2010). “Positive” results increase down the hierar-                      doi:10.1177/0956797611430953
    chy of the sciences. PLoS One, 5, e10068. doi:10.1371/journal                 Kline, R. B. (2004). Beyond significance testing: Reforming data
    .pone.0010068                                                                     analysis methods in behavioral research. Washington, DC:
*Farber, B. A., & Doolin, E. M. (2011). Positive regard. Psychother-                  American Psychological Association.
    apy, 48, 58–64. doi: 10.1037/a0022141                                         Kraemer, H. C., Gardner, C., Brooks, J. O., & Yesavage, J. A. (1998).
Ferguson, C. J., & Brannick, M. T. (2012). Publication bias in psy-                   Advantages of excluding underpowered studies in meta-analysis:
    chological science: Prevalence, methods for identifying and                       Inclusionist versus exclusionist viewpoints. Psychological Meth-
    controlling, and implications for the use of meta-analyses. Psy-                  ods, 3, 23–31.
    chological Methods, 17, 120–128. doi:10.1037/a0024445                         Kruschke, J. K. (2011). Bayesian assessment of null values via
Fiedler, K. (2011). Voodoo correlations are everywhere—Not only                       parameter estimation and model comparison. Perspectives on
    in neuroscience. Perspectives on Psychological Science, 6, 163–                   Psychological Science, 6, 299–312.
    171. doi:10.1177/1745691611400237                                             Ledgerwood, A., & Sherman, J. W. (2012). Short, sweet, and
Francis, G. (2012a). The same old new look: Publication bias in a                     problematic? The rise of the short report in psychological
    study of wishful seeing. i-Perception, 3, 176–178. doi:10.1068/                   science. Perspectives on Psychological Science, 7, 60–66.
    i0519ic                                                                           doi:10.1177/1745691611427304
Francis, G. (2012b). Too good to be true: Publication bias in two prom-           Levine, T. R., Asada, K. J., & Carpenter, C. (2009). Sample sizes
    inent studies from experimental psychology. Psychonomic Bulletin                  and effect sizes are negatively correlated in meta-analyses: Evi-
    & Review, 19, 151–156. doi:10.3758/s13423-012-0227-9                              dence and implications of a publication bias against non-sig-
Francis, G. (in press). Publication bias in “red, rank, and romance in                nificant findings. Communication Monographs, 76, 286–302.
    women viewing men” by Elliot et al. (2010). Journal of Experi-                    doi:10.1080/03637750903074685
    mental Psychology: General.                                                   Light, R. J., & Pillemer, D. B. (1984). Summing up: The science of
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Eco-                 reviewing research. Cambridge, MA: Harvard University Press.
    nomics, 33, 587–606.                                                          Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychologi-
*Green, D., & Rosenfeld, B. (2011). Evaluating the gold standard: A                   cal, educational, and behavioral treatment: Confirmation from
    review and meta-analysis of the Structured Interview of Reported                  meta-analysis. American Psychologist, 48, 1181–1209.
    Symptoms. Psychological Assessment, 23, 95–107. doi: 10.1037/                 *Lucassen, N., Tharner, A., Van IJzendoorn, M. H., Bakermans-
    a0021149                                                                          Kranenburg, M. J., Volling, B. L., Verhulst, F. C., . . . Tiemeier, H.
Greenwald, A. G., Poehlman, T. A., Uhlman, E. L., & Banaij, M. R.                     (2011). The association between paternal sensitivity and infant-
    (2009). Understanding and using the implicit association test III.                father attachment security: A meta-analysis of three decades
    Meta-analysis of predictive validity. Journal of Personality and                  of research. Journal of Family Psychology, 25, 986–992. doi:
    Social Psychology, 97, 17–41. doi:10.1037/a0015575                                10.1037/a0025855
Hall, J. A. (1998). How big are nonverbal sex differences? In D. J.               Mahoney, M. J. (1976). Scientist as subject: The psychological
    Canary & K. Dindia (Eds.), Sex differences and similarities in                    imperative. Cambridge, MA: Ballinger.
    communication (pp. 155–177). Mahwah, NJ: Erlbaum.                             Mahoney, M. J. (1977). Publication prejudices: An experimental
*Hallion, L. S., & Ruscio, A. M. (2011). A meta-analysis of the effect                study of confirmatory bias in the peer review system. Cognitive
    of cognitive bias modification on anxiety and depression. Psy-                    Therapy and Research, 1, 161–175.
    chological Bulletin, 137, 940–958. doi: 10.1037/a0024355                      Marszalek, J. M., Barber, C., Kohlhart, J., & Holmes, C. B. (2011).
Hoekstra, R., Finch, S., Kiers, H. A. L., & Johnson, A. (2006). Prob-                 Sample size in psychological research over the past 30 years.
    ability as certainty: Dichotomous thinking and the misuse of p                    Perceptual & Motor Skills, 112, 331–348. doi:10.2466/03.11.pms
    values. Psychonomic Bulletin & Review, 13, 1033–1037.                             .112.2.331-348

                                                    Downloaded from pps.sagepub.com by guest on May 17, 2015
554		                                                                                                                                       Bakker et al.

Maxwell, S. E. (2004). The persistence of underpowered studies in                 Sterling, T. D. (1959). Publication decisions and their possible effects
    psychological research: Causes, consequences, and remedies. Psy-                  on inferences drawn from tests of significance—Or vice versa.
    chological Methods, 9, 147–163. doi:10.1037/1082-989x.9.2.147                     Journal of the American Statistical Association, 54, 30–34.
McCall, R. B., & Carriger, M. S. (1993). A meta-analysis of infant                Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Pub-
    habituation and recognition memory performance as predictors                      lication decisions revisited: The effect of the outcome of statis-
    of later IQ. Child Development, 64, 57–79.                                        tical tests on the decision to publish and vice versa. American
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl,               Statistician, 49, 108–112. Retrieved from http://www.jstor.org/
    Sir Ronald, and the slow progress of soft psychology. Journal of                  stable/2684823
    Consulting and Clinical Psychology, 46, 806–834.                              Sterne, J. A. C., & Egger, M. (2005). Regression methods to detect
Meyer, G. J., Finn, S. E., Eyde, L. D., Kay, G. G., Moreland, K.                      publication and other bias in meta-analysis. In H. R. Rothstein,
    L., Dies, R. R., . . . Reed, G. M. (2001). Psychological test-                    A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-
    ing and psychological assessment: A review of evidence and                        analysis: Prevention, assessment and adjustments (pp. 99–110).
    issues. American Psychologist, 56, 128–156. doi:10.1037/0003-                     New York, NY: Wiley.
    066X.56.2.128                                                                 Tett, R. P., Meyer, J. P., & Roese, N. J. (1994). Applications of meta-
*Mol, S. E., & Bus, A. G. (2011). To read or not to read: A meta-                     analysis: 1987-1992. International Review of Industrial and
    analysis of print exposure from infancy to early adulthood. Psy-                  Organizational Psychology, 9, 71–112.
    chological Bulletin, 137, 267–296. doi: 10.1037/a0021890                      Tversky, A., & Kahneman, D. (1971). Belief in the law of small
Nelson, N., Rosenthal, R., & Rosnow, R. L. (1986). Interpretation of                  numbers. Psychological Bulletin, 76, 105–110. doi: 10.1037/
    significance levels and effect sizes by psychological researchers.                h0031322
    American Psychologist, 41, 1299–1301.                                         Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly
Nickerson, R. S. (2000). Null hypothesis significance testing:                        high correlations in fMRI studies of emotion, personality, and
    A review of an old and continuing controversy. Psychological                      social cognition. Perspectives on Psychological Science, 4, 274–
    Methods, 5, 241–301.                                                              290. doi:10.1111/j.1745-6924.2009.01125.x
Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not:               Wagenmakers, E. J. (2007). A practical solution to the pervasive
    How much can we rely on published data on potential drug tar-                     problems of p values Psychonomic Bulletin & Review, 14, 779–
    gets? Nature Reviews Drug Discovery, 10, 712–713. doi:10.1038/                    804.
    nrd3439-c1                                                                    Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas,
Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). One hun-                   H. L. J. (2011). Why psychologists must change the way they
    dred years of social psychology quantitatively described. Review                  analyze their data: The case of psi: Comment on Bem (2011).
    of General Psychology, 7, 331–363.                                                Journal of Personality and Social Psychology, 100, 426–432.
Rosenthal, R. (1979). The file drawer problem and tolerance for null              Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., &
    results. Psychological Bulletin, 86, 638–641.                                     Wagenmakers, E. J. (2011). Statistical evidence in experimental
Rozeboom, W. W. (1960). The fallacy of the null hypothesis signifi-                   psychology: An empirical comparison using 855 t tests. Perspec-
    cance test. Psychological Bulletin, 57, 416–428.                                  tives on Psychological Science, 6, 291–298.
Shadish, W. R., Doherty, M., & Montgomery, L. M. (1989). How                      Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness
    many studies are in the file drawer? An estimate from the family                  to share research data is related to the strength of the evidence
    marital psychotherapy literature. Clinical Psychology Review, 9,                  and the quality of reporting of statistical results. PLoS ONE, 6,
    589–603.                                                                          e26828. doi:10.1371/journal.pone.0026828
Simmons, J. P., Nelson, L. D., & Simonshon, U. (2011). False-pos-                 *Woodin, E. M. (2011). A two-dimensional approach to relationship
    itive psychology: Undisclosed flexibility in data collection and                  conflict: Meta-analytic findings. Journal of Family Psychology,
    analysis allows presenting anything as significant. Psychological                 25, 325–335. doi: 10.1037/a0023791
    Science, 22, 1359–1366. doi:10.1177/0956797611417632                          *Woodley, M. A. (2011). The cognitive differentiation-integration
Sterba, S. K. (2006). Misconduct in the analysis and reporting of                     effort hypothesis: A synthesis between the fitness indicator and
    data: Bridging methodological and ethical agendas for change.                     life history models of human intelligence. Review of General
    Ethics & Behavior, 16, 305–318.                                                   Psychology, 15, 228–245. doi: 10.1037/a0024348

                                                    Downloaded from pps.sagepub.com by guest on May 17, 2015
You can also read