Skill, Luck and Hot Hands on the PGA Tour

 
Skill, Luck and Hot Hands on the PGA Tour

                                                June 21, 2005

                   Robert A. Connolly                             Richard J. Rendleman, Jr.
                Kenan-Flagler Business School                       Kenan-Flagler Business School
                  CB3490, McColl Building                              CB3490, McColl Building
                     UNC - Chapel Hill                                     UNC - Chapel Hill
                 Chapel Hill, NC 27599-3490                          Chapel Hill, NC 27599-3490
                   (919) 962-0053 (phone)                               (919) 962-3188 (phone)
                    (919) 962-5539 (fax)                                  (919) 962-2068 (fax)
                  connollr@bschool.unc.edu                           richard_rendleman@unc.edu

We thank Tom Gresik and seminar participants at the University of Notre Dame for helpful comments. We provide
special thanks to Carl Ackermann and David Ravenscraft who provided significant input and assistance during the early
stages of this study. We also thank Mustafa Gultekin for computational assistance and Yuedong Wang and Douglas
Bates for assistance in formulating, programming and testing portions of our estimation procedures. Please direct all
correspondence to Richard Rendleman.
Skill, Luck and Hot Hands on the PGA Tour

                                          1. INTRODUCTION
        Like all sports, outcomes in golf involve elements of both skill and luck. Perhaps the highest
level of skill in golf is displayed on the PGA Tour. Even among these highly skilled players,
however, a small portion of each 18-hole score can be attributed to luck, or what players and
commentators often refer to as “good and bad breaks.”
        The purpose of our work is to determine the extent to which skill and luck combine to
determine 18-hole scores in PGA Tour events. We are also interested in the question of whether
PGA players experience “hot or cold hands,” or runs of exceptionally good or bad scores, in relation
to those predicted by their statistically-estimated skill levels.
      From a psychological standpoint, understanding the extent to which luck plays a role in
determining 18-hole golf scores is important. Clearly, a player would not want to make swing
changes to correct abnormally high scores that were due, primarily, to bad luck. Similarly, a player
who shoots a low round should not get discouraged if he cannot sustain that level of play, especially
if good luck was the primary reason for his score. At the same time, it is important for a player to
know whether his general (mean) skill level is improving or deteriorating over time and to understand
that deviations from past scoring patterns may not be due to luck alone.
        From a policy standpoint, it seems reasonable that the PGA Tour and other professional golf
tournament organizations should attempt to minimize the role of luck in determining tournament
outcomes and qualification for play in Tour events. Ideally, the PGA Tour should be comprised of
the most highly-skilled players. Also, the Tour should strive to conduct tournaments that reward the
most highly skilled players rather than those who experience the greatest luck.
        In some cases, luck in a round of golf can easily be identified. In the final round of the 2002
Bay Hill Invitational, David Duval’s approach shot to the 16th hole hit the pin and bounced back into
a water hazard fronting the green. Duval took a nine on the hole. Few would argue that Duval’s
score of nine was due to bad judgment or a sudden change in his skill level, and it is highly unlikely
that Duval made swing changes or changes to his general approach to the game to correct the type of
problem he incurred on the 16th hole.
        In contrast, consider the good fortune experienced by Craig Perks in the final round of the
2002 Players Championship, when he chipped in on holes 16 and 18 and sunk a 28-foot putt for
2

birdie on the 17th hole en route to victory. Was Perks’ victory, his first on the PGA Tour, a reflection
of exceptional ability or four consecutive days of good luck? The fact that his best season since 2002
placed him 146th on the PGA Tour’s Official Money List suggests that luck may have been the
overriding factor. Similarly, as of June, 2005, Ben Curtis has had only has had only one top-10 finish
since winning the 2003 British Open. At this point, few would argue that Curtis’s victory was
anything other than a fluke.
       In many situations, specific occurrences of good and bad luck may not be as obvious as in the
examples above. Luck simply occurs in small quantities as part of the game. Even players as highly
skilled as Tiger Woods and Vijay Singh cannot produce perfect swings on every shot. As a result, a
certain human element of luck is introduced to a shot even before contact with the ball is made. (An
excellent article on the role of luck in golf, with a focus on how the ultimate outcome of the dual
between Tiger Woods and Chris DiMarco in the 2005 Masters was determined as much by luck as by
skill, is provided in Jenkins (2005).)
       Our work draws on the rich literature in sports statistics. Klaassen and Magnus (2001) model
the probability of winning a tennis point as a function of player quality, context (situational)
variables, and a random component. They note that failing to account for quality differences will
create pseudo-dependence in scoring outcomes, because winning reflects, in part, player quality, and
player quality is generally persistent. Parameter estimates from their dynamic, panel random effects
model using four years of data from Wimbledon matches suggest that the iid hypothesis for tennis
points is a good approximation, provided there are controls for player quality. The most important
advantage of their approach from our perspective is that it supports a decomposition of actual
performance into two parts: an expected score based on skill and an unexpected residual portion.
       In this paper, we decompose individual golfer’s scores into skill-based and unexpected
components by using Wang’s (1998) smoothing spline model to estimate each player’s mean skill
level as a function of time while simultaneously estimating the correlation in the random error
structure of fitted player scores and the relative difficulty of each round. The fitted spline values of
this model provide estimates of expected player scores. We define luck as deviations from these
estimated scores, whether positive or negative, and explore its statistical properties.
       Our tests show that after adjusting a player’s score for his general skill level and the relative
difficulty of the course on the day a round is played, the average (and median) standard deviation of
residual golf scores on the PGA Tour is approximately 2.7 strokes per 18-hole round, ranging
3

between 2.1 and 3.5 strokes per round per player. We find clear evidence of positive first-order
autocorrelation in the error structure for over 13 percent of the golfers and, therefore, conclude that a
significant number of PGA Tour players experience hot and cold hands. We also apply some
traditional hot hands tests to the autocorrelated residuals and come to similar conclusions. However,
after removing the effects of first-order autocorrelation from the residual scores, we find little
additional evidence of hot hands. This suggests that any statistical study of sporting performance
should estimate skill dynamics and deviations from normal performance while simultaneously
accounting for the relative difficulty of the task and the potential autocorrelation in unexpected
performance outcomes.
        The remainder of the paper is organized as follows. In Section 2 we describe our data and
criteria for including players in our statistical samples. In Section 3 we present the results of a
number of fixed-effects model specifications to identify the variables that are important in predicting
player performance. In Section 4 we formulate and test the cubic spline-based model for estimating
player skill. This model becomes the basis for our analysis of player skill, luck and hot hands
summarized in Section 5. A final section provides a summary and concluding comments.

                                                2. DATA
        We have collected individual 18-hole scores for every player in every stroke-play PGA Tour
event for years 1998-2001 for a total of 76,456 scores distributed among 1,405 players. Our data
include all stroke play events for which participants receive credit for earning official PGA Tour
money, even though some of the events, including all four “majors,” are not actually run by the PGA
Tour. The data were collected, primarily, from www.pgatour.com, www.golfweek.com,
www.golfonline.com, www.golfnews.augustachronicle.com, www.insidetheropes.com, and
www.golftoday.com. When we were unable to obtain all necessary data from these sources, we
checked national and local newspapers, and in some instances, contacted tournament headquarters
directly.
        Our data cover scores of players who made and missed cuts. (Although there are a few
exceptions, after the second round of a typical PGA Tour event, the field is reduced to the 70 players,
including ties, with the lowest total scores after the first two rounds.) The data also include scores for
players who withdrew from tournaments and who were disqualified; as long as we have a score, we
use it. We also gathered data on where each round was played. This is especially important for
4

tournaments such as the Bob Hope Chrysler Classic and ATT Pebble Beach National Pro-Am played
on more than one course.
       The great majority of the players represented in the sample are not regular members of the
PGA Tour. Nine players in the sample recorded only one 18-hole score over the 1998-2001 period,
and 565 players recorded only two scores. Most of these 565 players qualified to play in a single US
Open, British Open or PGA Championship, missed the cut after the first two rounds and subsequently
played in no other PGA Tour events.
       As illustrated in the top panel of Figure 1, 1,069 players, 76.1 percent of the players in the
sample, recorded 50 or fewer 18-hole scores. As illustrated in the bottom panel, this same group
recorded 5,895 scores, which amounts to only 7.7 percent of the total number of scores in the sample.
1,162 players, representing 82.7 percent of the players in the sample, recorded 100 or fewer scores.
The total number of scores recorded by this group was 12,849, or 16.8 percent of the sample. The
greatest number of scores recorded by a single player was 457 recorded by Fred Funk.
       The players who recorded 50 or fewer scores represent a mix of established senior players
such as Hale Irwin (19 scores), Jim Thorpe (18) and Larry Nelson (17), relatively inactive middle-
aged players including Bobby Clampett (16), Jerry Pate (16) and Seve Ballesteros (22), “old-timers”
such as Arnold Palmer (32), Raymond Floyd (24) and Gary Player (18), up-and-coming stars
including Adam Scott (32) and Chad Campbell (20), established European Tour players such as
Andrew Coltart (49), Niclas Fasth (48) and Robert Karlsson (46), and a large number of players that
most readers of this paper, including those who follow the PGA Tour, have probably never heard of.
Clearly, this group, which accounts for 76.1 percent of the players in the sample, but only 7.7 percent
of the scores, is not representative of typical PGA Tour participants. Therefore, including them in the
sample could cause distortions in estimating the statistical properties of golf scores on the Tour.
       In estimating the skill levels of individual players, we employ the smoothing spline model of
Wang (1998) which adjusts for correlation in random errors. Simulations by Wang indicate that 50
sample observations is probably too small, and that approximately 100 or more observations are
required to obtain dependable statistical estimates of cubic-spline-based mean estimates. After
examining player names and the number of rounds recorded by each player, we have concluded that a
sample of players who recorded more than 90 scores is reasonably homogeneous and likely to meet
the minimum sample size requirements of the cubic spline methodology.
5

                                   3. MODEL IDENTIFICATION
       The model used in our tests, based on Wang’s smoothing spline, is computationally intensive
and requires approximately a full day to run on a 3.19 GHz PC operating under Windows XP.
Because of its computational requirements, we employ two simpler models to identify the relevant
variables and appropriate model form for our tests. In model 1, we employ a fixed-effects multiple
regression to predict a player’s 18-hole score for a given round as a function of the level of skill
displayed by the player throughout the entire 1998-2001 period and the relative difficulty of each
round. In model 2 each player’s skill level is allowed to change (approximately) by calendar year,
but otherwise, the model is identical to model 1. We show that including approximate calendar year-
based time dependency significantly improves the predictive power of the model. We also estimate
various versions of model 2 to determine whether alternative model specifications might be
warranted.
       Using model 2 we show that the best way to estimate the relative difficulty of a given golf
round is through a round-course interaction dummy variable. Although most tournaments on the
PGA Tour are played on a single course, several are played on more than one course. Using round-
course interactions to predict the relative difficulty of each round is significantly more powerful than
employing a dummy for the round alone (without interacting with the course), the course alone
(without interacting with the round), or the tournament alone (without interacting with either the
course or the round). We also show that individual player performance is not affected by the courses
on which he plays.

3.1. Model 1
       Let si , j denote the 18-hole score of player i (with i = 1,....n , alphabetically) in PGA Tour

round-course interaction j (with j = 1,....m ). A round-course interaction is defined as the interaction
between a regular 18-hole round of play in a specific tournament and the course on which the round
is played. For most tournaments, only one course is used and, therefore, there is only one such
interaction per round. However, over the sample period, 25 of 182 tournaments were played on more
than one course. For example, in the Bob Hope Chrysler Classic, the first four rounds are played on
four different courses using a rotation that assigns each tournament participant to each of the four
courses over the first four days. A cut is made after the fourth round, and a final round is played the
6

fifth day on a single course. Thus, the Bob Hope tournament consists of 17 round-course interactions
– four for each of the first four days of play and one additional interaction for the fifth and final day.
        According to model 1, the predicted score for player i played in connection with round-course
interaction j is determined by the following fixed-effects multiple regression model:
                                                 n             m
                                    si , j = α + ∑ β k pk + ∑ γ c rc + ε i , j                         (1)
                                                k =2           c=2

        In (1), pk is a dummy variable that takes on a value of 1 if player i = k and zero otherwise,

rc is a dummy that takes on a value of 1 if round-course interaction j = c and zero otherwise, and

ε i , j is an error term with E ( ε i , j ) = 0 . In the model, the first round of the 1998 Mercedes

Championships, played on a single course, is round-course interaction j = 1 . Therefore, the
regression intercept, α , represents the expected score of the first player ( k = 1 ) in the first round of
the 1998 Mercedes Championships, β k represents the differential amount that player k > 1 would be

expected to shoot in the first round of the 1998 Mercedes, and γ c denotes the additional amount that
all players, including the first player, would be expected to shoot in connection with any other round-
course combination. The only restriction placed on the data is that a player must have recorded at
least four 18-hole scores, the minimum number required for the regression to be of full rank, to be
included in the regression estimate. With this restriction, the data include 75,054 scores for 810
players recorded over a possible 848 round-course interactions, although no single player participated
in more than 457 rounds.
        Berry (2001) employs a similarly-constructed random effects model that takes into account
the skill level of each player and the intrinsic difficulty of each round to measure the performance of
Tiger Woods relative to others on the PGA Tour. However, Berry’s model does not take account of
the potential for a given round to be played on more than one course.
        It should be noted that when estimating round-course interaction coefficients, no specific
information about course conditions, adverse weather, etc. is taken into account. Nevertheless, if
such conditions combine to produce abnormal scores in a given 18-hole round, the effects of these
conditions should be reflected in the estimated coefficients. Under model 1, the highest round-course
interaction coefficient is associated with the portion of the third round of the 1999 ATT Pebble Beach
National Pro-Am played on the Pebble Beach course. For this particular round-course interaction,
wind and rain combined to make the famed course extremely difficult and forced tournament officials
7

to cancel the event during the fourth round due to unplayable conditions. (Our final analysis shows
that the Pebble Beach course played 5.8 strokes more difficult on the third day than on the first two
days.)

3.2. Model 2
         In the previous model, a player’s skill coefficient is assumed to be constant throughout the
entire 1998-2001 period. In model 2, a player’s skill coefficient is allowed to vary through time on
an approximate calendar year basis.
         In principle, we wish to estimate a single regression equation with separate skill coefficients
estimated for each player in each calendar year. However, experiments with a model of this form
reveal that a number of players whose scores are included in the estimation of model 1 would have to
be removed from the original sample for the regression to be of full rank.
         To overcome the potential problem of singularity, while maintaining the same sample
employed in the estimation of model 1, we employ the following method for allowing an individual
player’s skill coefficient to vary through time. If a full calendar year has passed, at least 25 scores
were used in the estimation of the previous skill coefficient for the same player, and at least 25
additional scores were recorded for the same player, then a new incremental skill coefficient for that
player is estimated. For players who participated actively in all four years of the sample, this
procedure results in the estimation of different mean skill levels in each of the four years. Although
the 25-score criterion is somewhat arbitrary, it is not critical, since the model we will actually use for
estimating player scores is based on Wang’s smoothing spline methodology and does not use the 25-
score criterion.
         As before, let si , j denote the 18-hole score of player i in connection with PGA Tour round-

course interaction j. Define pk ,t ( k ) as a dummy variable that takes on a value of 1 if player i = k and

the player-specific (approximate) calendar year-based time period equals or exceeds t ( k ) = 1,...T ( k )

and zero otherwise, where T ( k ) denotes the total number of player-specific (approximate) calendar

year-based time periods attributable to player k. Thus, if i = k , pk ,1 = 1 in all time periods, pk ,2 = 1

in periods 2 through T ( k ) but zero otherwise, pk ,3 = 1 in periods 3 through T ( k ) but zero otherwise,
8

and pk ,4 = 1 in period 4 only. Finally, let β k ,t ( k ) denote the value of incremental skill coefficient

t ( k ) for player k. Then, the basic version of model 2, denoted as 2.1, is specified as follows:
                                                  T (k )                                  n     T (k )                             m
                                  si , j = α +    ∑          β k =1,t ( k ) pk =1,t ( k ) + ∑   ∑          β k ,t ( k ) pk ,t ( k ) + ∑ γ c rc + ε i , j   (2.1)
                                                 t ( k )=2                              k = 2 t ( k ) =1                          c=2

        As in model 1, the first player in the sample is player k = 1 , and the first round of the 1998
Mercedes Championships is round-course combination j = 1 . Therefore, the regression intercept, α ,
represents the first player’s expected score in the first round of the 1998 Mercedes Championships.
            T (k )
The term    ∑β
           t ( k )=2
                       k =1, t ( k )
                                       pk =1,t ( k ) picks up potential incremental changes in skill for the first player starting

in his second estimation period, typically 1999. In all forms of model 2, the skill coefficient β k ,1 is

estimated in each of the T ( k ) estimation periods for all players k > 1 . Starting in the second period,

the coefficient β k ,2 is estimated for all players, k = 1,....n for whom T ( k ) > 1 , and so on.

        To determine the best way to estimate the relative difficulty of each round, we estimate three
additional forms of model 2, denoted as models 2.2 through 2.4, respectively. In model 2.2 we
substitute 728 round dummies for the 848 round-course interaction dummies in model 2.1. We
define a “round” as a round of play in a specific tournament, regardless of whether the round is
played on more than one course. Therefore, this specification ignores differences in scores for the
same round that are played on different courses. In model 2.3 we substitute 182 tournament dummies
for the round-course interaction dummies model 2.1. Each tournament dummy denotes a particular
tournament played in a given year. For example, there are four tournament dummies associated with
the Masters, played in each of the four years 1998-2001. According to this specification, the relative
difficulty of each round is determined by the tournament and does not vary from day to day within
the tournament or over multiple courses that might be used for the tournament. Finally, in model 2.4
we substitute 77 course dummies, which indicate the courses on which the rounds are played, for the
round-course interaction dummies in model 2.1. As such, this specification assumes that the relative
difficulty of a round is determined strictly by the course on which the round is played and that the
difficulty of the course does not change from day to day within a tournament or even from year to
year. Table 1 summarizes the performance of these models using both a full sample of players
(75,054 observations) and a sample restricted to the players who recorded more than 90 scores
(64,364 observations).
9

       Using the full sample, the adjusted R 2 is 0.2996 for model 1 and 0.3077 for model 2.1. The
F-statistic for testing the difference between model 2.1 and model 1 is 2.44, which, given the large
sample size, is significant at any normal testing level. This indicates that there is added value to
including (approximate) calendar year-based time variation in mean player skill levels. F-statistics
for testing model 2.1 against alternative specifications using dummies for rounds, tournaments and
courses, rather than for round-course interactions, are 5.66, 9.66 and 10.67, respectively, indicating
that model 2.1, which uses round-course interactions, is superior. Very similar results are obtained
for the same tests using the restricted sample. It should be noted that Berry’s (2001) model for
predicting player scores is equivalent to our model 1 using random effects for actual rounds rather
than fixed effects for round-course interactions. The results in Table 1 indicate that including
calendar-year-based time variation in player skill levels and adjusting scores for the relative difficulty
of round-course interactions is a superior model specification.
       We also attempted to estimate a fifth version of model 2 to test whether there is added value
to including interactions among players and courses. With the full sample of players, the number of
coefficients to be estimated increases from 2,224 using model 2.1 to 17,201 when player-course
interactions are included. With the sample restricted to only those players that competed in more
than 90 rounds, the number of coefficients increases from 848 to 13,383. Unfortunately, the
regressions are of such large scale that we were unable to obtain sufficient resources to estimate them
on the university’s statistical mainframe computer. Therefore, we address the issue of player-course
interactions by including course dummies in random effects analyses of individual player residual
scores from our final spline-based estimation model. These tests, described in Section 4.1, show no
evidence of significant player-course interaction effects. Also, tests of player performance in
adjacent rounds of the same tournament, summarized in Table 5 and discussed in Section 5, provides
additional evidence of no significant effects.

                         4. THE SPLINE-BASED ESTIMATION MODEL
       Based on the previous analysis, we conclude that the best way to adjust player scores for the
relative difficulty of each round is to use round-course interactions rather than rounds (ignoring
courses), courses (ignoring rounds) or tournaments (ignoring both rounds and courses). We also
conclude that our estimation model should reflect time-dependent estimates of mean player skill.
However, we recognize that allowing player skill to change only by calendar year is somewhat
10

arbitrary. Therefore, in model 3 we employ a more general specification of time dependency in mean
player skill that does not require skill changes to occur only at the turn of the year or at any other pre-
specified points in time.
        In model 3 we estimate each player’s mean skill level as a function of “golf time” using the
restricted maximum likelihood (REML) version Wang’s (1998) smoothing spline model which
adjusts for correlated random errors. Player k’s “golf time” counts successive competitive PGA Tour
rounds of golf played by player k regardless of how these rounds are sequenced in actual calendar
time. Inasmuch as there are likely to be gaps in actual time between some adjacent points in golf
time, it is unlikely that random errors around individual player spline fits follow higher-order
autoregressive processes. At the same time, however, we do not want to rule out the possibility that
the random errors may be correlated.
        Model 3, which uses Wang’s REML model for estimating mean individual player skill as a
function of golf time, is formulated as follows. Define player k’s golf time as the sequence
g ( k ) = 1,....G ( k ) , and let g k ( j ) denote the mapping of the sequence j = 1,....m to the sequence

g ( k ) = 1,....G ( k ) such that the sequence g k ( j ) represents the “golf time” of all round-course

interactions, j, for which player k recorded an 18-hole score. For example, assume that player k’s
first three scores were recorded in connection with round-course interactions 6, 7, and 16. Then,
g k ( 6 ) = 1, g k ( 7 ) = 2, and g k (16 ) = 3 . With this mapping, model 3 becomes:

                                    (                                 )
                              n                                                m
                     si , j = ∑ f k ⎣⎡ g k ( j ) ⎦⎤ + θ k ⎡⎣ g k ( j ) ⎤⎦ pk + ∑ ( γ c + ξc ) rc + κ i , j
                             k =1                                             c=2

                                    (                                 )
                              n                                                m
                          = ∑ f k ⎣⎡ g k ( j ) ⎦⎤ + θ k ⎡⎣ g k ( j ) ⎤⎦ pk + ∑ ωc rc + κ i , j ,
                             k =1                                             c=2

with ωc = γ c + ξ c and E (κ i , j ) = 0 . In model 3, f k ( g k [ j ]) is the smoothing spline function applied

to player k’s golf scores over golf time g ( k ) = 1,....G ( k ) . θ k ( g k [ j ]) is the random error associated

with the spline fit for player k evaluated with respect to round-course interaction j as mapped into
                                                              ′
                                        (                       )
golf time g k ( j ) , with θ κ = θ k [1] ,....θ k ⎡⎣G ( k ) ⎤⎦ ∼ N ( 0, σ k2 Wk‐1 ) and σ k2 unknown. Note that the

intercept is absorbed by the f k ( g k [ j ]) terms.
11

         It is convenient to re-express θ k ( g k [ j ]) as θ k ( g k [ j ]) = ϕk ( g k [ j ]) + ηk ( g k [ j ]) , where

ϕk ( g k [ j ]) represents the predicted error associated with player k’s spline fit as of golf time g k ( j )

as a function of past errors, θ k (1) ,....θ k ( g k [ j ] − 1) , and ηk ( g k [ j ]) denotes the remaining residual

error. Substituting θ k ( g k [ j ]) = ϕk ( g k [ j ]) + ηk ( g k [ j ]) , model 3 becomes:

                                  (                                                    )
                            n                                                                    m
                   si , j = ∑ f k ⎡⎣ g k ( j ) ⎤⎦ + ϕk ⎡⎣ g k ( j ) ⎤⎦ + ηk ⎡⎣ g k ( j ) ⎤⎦ pk + ∑ ωc rc + κ i , j    (3)
                           k =1                                                                 c=2

         It is important to recognize that in estimating model 3, the autocorrelation in random errors
about each spline fit is not removed. Instead, the spline fits and the autocorrelation parameters
associated with the random error around the spline fits are estimated simultaneously.

         Model 3 represents a generalized additive model with random effects η = (η1 ,....η n )′ and

ξ = (ξ1 ,....ξ m )′ associated with players i = 1,....n and round-course interactions j = 1,....m ,
respectively. Although the smoothing spline functions allow for general autocorrelation structures in
θi , we assume that the θi follow player-specific AR(1) processes. The methodology for estimating
model 3 is described in the appendix.

4.1. General Properties of Model 3
         Throughout the remainder of the paper, frequent reference is made to two different types of
residual errors from the spline fits of model 3. To avoid confusion, we hereby define the two residual
error types. In model 3, θi ( gi [ j ]) is the total residual error associated with spline fit i as of golf

time gi ( j ) . ϕi ( gi [ j ]) represents the predicted error associated with player i’s spline fit as of golf

time gi ( j ) as a function of past errors, θi (1) ,....θi ( gi [ j ] − 1) , and ηi ( gi [ j ]) represents the error

component of the spline fit that is not predictable from past residual errors, θi ( gi [ j ]) . We assume

that the error θi ( gi [ j ]) follows an AR(1) process with first-order autocorrelation coefficient φi . The

spline fit fi and correlation φi are estimated simultaneously using Wang’s (1998) smoothing spline

methodology. For some of our tests we focus on residual errors ηi ( gi [ j ]) , which we refer to as η

errors. If the assumed AR(1) process properly captures the correlation structures of residual player
12

scores, η errors should represent white noise. For other tests, we focus on autocorrelated residual

errors θi ( gi [ j ]) , which we refer to as θ errors.

        In formulating model 3, we assume that an AR(1) process describes the residual error
structure about the cubic spline fit for each player. If this is a valid assumption, the η errors for each
player should be serially uncorrelated. However, if this is not a valid assumption, there may be
additional autocorrelation in the η errors. To check for possible model misspecification in
connection with the AR(1) assumption, we estimated autocorrelation coefficients of order 1-5 on the
η errors of each player. In no instance were any estimated coefficients statistically significant at the
5 percent level. In addition, we computed the Ljung-Box (1978) Q statistic associated with the η
errors for each player for lags equal to the minimum of 10 or 5 percent of the number of rounds
played. Ljung (1986) suggests that no more than 10 lags should be used for the test and Burns (2002)
suggests that the number of lags should not exceed 5 percent of the length of the series. Only 6 of
253 Q statistics were significant at the 5 percent level. As a final diagnostic, we ran random effects
models relating θ and η errors to course dummy variables for each of the 253 players, assigning a
dummy to a course if a player played the course at least twice, the minimum required for a model to
be of full rank. None of the F-tests associated with the 253 random effects tests were significant at
the 5 percent level for either type of error. Therefore, we conclude that the autocorrelation properties
of the spline function have been properly specified and that it is unnecessary to include player-course
interactions in the estimation model.
        Figure 2 shows plots of spline fits for four selected players. All figures in the upper panel
show 18-hole scores reduced by the effects of round-course interactions (connected by jagged lines)
along with corresponding spline fits (smooth lines). The same spline fits (smooth lines) are shown in
the lower panel along with predicted scores adjusted for round-course interactions (connected by
jagged lines), computed as a function of prior residual errors, θi ( gi [ j ]) , and the first-order

autocorrelation coefficient, φi , estimated in connection with the spline fit. The scale of each plot is
different. Therefore, the spline fits in the lower panel appear as stretched versions of the
corresponding fits in the upper panel and visually, plots for the four players are not directly
comparable.
13

        The two plots for Chris Smith reflect 10.7067 degrees of freedom used in connection with the
estimate of his spline function, the largest degrees of freedom value among the 253 players. In
contrast, the spline fit for Tiger Woods, estimated with 2.0011 degrees of freedom, is more typical;
almost 71 percent of the 253 spline fits use 2.25 or fewer degrees of freedom and have the same
general appearance as that estimated for Woods. The fit for Ian Leggatt reflects φ = −0.2791 , the
most negative first-order autocorrelation coefficient estimated in connection with the 253 splines.
Also, the standard deviation of the θ residual errors around the spline fit for Leggatt is 2.14 strokes
per round, the lowest among the 253 players. Interestingly, after adjusting for prior θ errors and
first-order autocorrelation, φ , the predicted score for Bob Estes, shown as the jagged line in the lower
panel, is the lowest among all 253 players at the end of the 1998-2001 sample period.
        Figure 3 shows six histograms that help to summarize the 253 cubic spline-based mean player
skill functions. The first histogram shows the degrees of freedom for the various spline fits. Three of
the fits have exactly two degrees of freedom, while the degrees of freedom for 179, or 71 percent of
the splines, is less than 2.25. This implies that a large majority of the spline fits are essentially linear
functions of time, such as that illustrated in Figure 2 for Tiger Woods. (It should be noted that for
each spline, an additional degree of freedom, not accounted for in the histograms, is used up in
estimating the AR(1) correlation structure of the residuals.) Fifty-three of the spline fits have three or
more degrees of freedom, implying a departure from linearity. Twelve of the splines have five or
more degrees of freedom, and as noted earlier, the largest number of degrees of freedom is 10.71 for
Chris Smith.
        The histogram in the lower left-hand panel of Figure 3 shows the distribution of first-order
autocorrelation coefficients estimated in connection with the 253 splines. As illustrated in the lower
left-hand panel, the residual θ errors of 158, or 62 percent of the spline fits, are positively correlated.
        We employ the bootstrap method to test the significance of individual player spline fits
against alternative specifications of player skill. To maintain consistency in our testing methods, we
apply the same set of bootstrap samples to test the significance levels of the first-order
autocorrelation coefficients estimated in connection with each fit. All bootstrap tests are based on
balanced sampling of 40 samples per player. Wang and Wahba (1995) describe how the bootstrap
method can be used in connection with smoothing splines that are estimated without taking into
account the autocorrelation in residual errors. We modify the method outlined in Wang and Wahba
so that the bootstrap samples are based on η residuals which adjust predicted scores for
14

autocorrelation in prior θ residuals. Forty bootstrap samples is the minimum necessary to estimate
two-sided 95 percent confidence intervals. Although 40 samples is well below the number required
to estimate precise confidence intervals for each individual player, a total of 40 × 253 = 10,120
bootstrap samples are taken over all 253 players, requiring over three days of computation time using
a 3.19 GHz computer running under Windows XP. The 10,120 total bootstrap samples should be
more than sufficient to draw general inferences about statistical significance within the overall
population of PGA Tour players.
       After sorting the correlation coefficients computed via the 40 bootstrap samples from lowest
to highest, the number significantly negative is the number of players for which the 39th correlation
coefficient is negative, and the number significantly positive is the number of players for which the
second correlation coefficient is positive. We find that six of 95 negative autocorrelation coefficients
are significant at the 5 percent level while 32 of 158 positive coefficients are significant. Although
the implications of positive autocorrelation are developed in more detail later, based on the number of
players displaying significant positive autocorrelation, it is clear that a substantial number of players
exhibit hot and cold hands in their golfing performance over the sample period.
       The four remaining histograms in Figure 3 summarize bootstrap tests of the 253 cublic spline
fits against the following alternative methods of estimating a player’s 18-hole score (after subtracting
the effects of round-course interactions):
           1. The player’s mean score.
           2. The player’s mean score in each approximate calendar year period as defined by
              model 2.
           3. The player’s score estimated as a linear function of time.
           4. The player’s score estimated as a quadratic function of time.

                                                                 RSE ( alt ) RSE ( spline )
       For each of the four tests we form a test statistic, ζˆ =            −                 , where G is
                                                                 G − df alt G − df spline − 1

the number of 18-hole scores for a given player, df alt and df spline are the number of degrees of

freedom associated with the estimation of the alternative and spline models, respectively, and
RSE ( alt ) and RSE ( spline ) are total residual squared errors from the alternative and spline models.

We subtract 1 in the denominator of the second term to account for the additional degree of freedom
associated with the estimation of the first-order autocorrelation coefficient. For the purposes of this
test, RSE ( spline ) is based on η errors, since these errors reflect the complete information set
15

derived from each spline fit. The test statistic ζˆ is suggested by Efron and Tibshirani (1998, pp.
190-192 and p. 200 [problem 14.12]) for testing the predictive power of an estimation model
formulated with two different sets of independent variables. The test statistic is computed for each of
40 bootstrap samples per player. In a one-sided test of whether the spline fit is a superior
specification to the alternative model, ζˆ should be positive in 38 or more of the 40 bootstrap
samples.
       The first line of Table 2 shows that the spline model is significantly superior (at the 5 percent
level) to the player’s mean score for approximately 30 percent of the 253 players in the sample. It is
significantly superior to a linear time trend for 8 percent of the players, to the mean score computed
for each (approximate) calendar year for 5 percent, and to a quadratic time trend for 5 percent of the
players. The second line of the table shows that the percentage of players for which the alternative
model is generally superior (but not statistically superior) is 30 percent and 17 percent for the linear
and quadratic time trends, respectively, and only 8 percent and 4 percent for the mean score and mean
score per (approximate) calendar year. (For a given player, the alternative model is generally
superior if ζˆ is positive for less than 20 of 40 bootstrap samples.)
       Although the spline model is superior to the alternative models at the 5 percent level for a
relatively small percentage of players, it “beats” the alternative models a much larger proportion of
the time than could be predicted by chance. Therefore, we conclude that the spline model is superior
to any of the four alternative specifications.

                                    5. PLAYER PERFORMANCE
5.1 Skill
       Based on bootstrap sampling, the preceding analysis shows that approximately 30 percent of
cubic spline-based player score estimates are significantly superior at the 5 percent level to estimates
based solely on mean scores. Moreover, the mean squared residual error for each of the original
                                 RSE ( spline )
cubic spline fits, measured as                     , is less than the mean squared residual about the mean,
                                 G − df spline − 1

              RSE ( mean )
measured as                , for 70 percent of the players in the sample. These results provide strong
                 G −1
evidence that the skill levels of PGA Tour players change through time. For many players such as
16

Tiger Woods, the relationship between the player’s average skill level and time is well approximated
by a linear time trend, after adjusting for autocorrelation in residual errors. For others, such as Chris
Smith, Ian Leggatt and Bob Estes, the relationship between mean player skill and time is more
complex and cannot easily be modeled by a simple parametric-based time relationship.
       The player-specific cubic spline functions can provide point estimates of expected scores,
adjusted for round-course interactions, at the end of the 1998-2001 sample period, and, therefore, can
be used to rank the players at the end of 2001. Table 3 provides a summary of the best players
among the sample of 253 as of the end of the 2001 based on the cubic spline point estimates shown in
column 1 of the table. Column 2 shows estimates of player scores after adjusting for autocorrelation
in θ residual errors around each spline fit. The values in column 1 can be thought of as estimates of
mean player skill at the end of the sample period. In contrast, the values in column 2 can be thought
of as estimates of each player’s last score as a function of his ending mean skill level and the
correlation in random errors about prior mean skill estimates. The Official World Golf Ranking is
also shown for each player as of 11/04/01, the ranking date that corresponds to the end of the official
2001 PGA Tour season. The World Golf Ranking is based on a player’s most recent rolling two
years of performance, with points awarded based on position of finish in qualifying worldwide golf
events on nine different tours. The ranking does not reflect actual player scores. Details of the
ranking methodology can be found in the “about” section of the Official World Golf Ranking web
site, www.officialworldgolfranking.com.
       Despite being based on two entirely different criteria, the two ranking methods produce
similar lists of players. Mike Weir is the only player in the top 10 of the Official World Golf
Ranking (number 10) as of 11/04/01 whose name does not appear in Table 3 (he is actually 21st
according to our ranking), and only five players listed in Table 3 were not ranked among the Official
World Golf Ranking’s top 20.
       Among the 20 players listed in Table 3, the predicted score for Bob Estes, adjusted for
autocorrelation in residual errors (column 2), is the lowest. Although it may come as a surprise that a
player with as little name recognition as Estes would be predicted to shoot the lowest score at the end
of 2001, the plots of Estes’ spline function in Figure 2 show why. Estes exhibited marked
improvement over the last quarter of the sample period, and the improvement was sufficiently
pronounced that his estimated spline function picked it up.
17

       Table 4 provides lists of the 10 players showing the improvement and deterioration in skill
over the 1998-2001 sample period based on differences between beginning and ending spline-based
estimates of player scores adjusted for round-course interactions. Cameron Beckman was the most
improved player, improving by 3.36 strokes from the beginning of 1998 to the end of 2001. Chris
DiMarco and Mike Weir, relatively unknown in 1998 but now recognized among the world’s elite
players, were the fifth and sixth most improved players over the sample period. Among the 10
golfers whose predicted scores went up the most were Lanny Wadkins (8.18 strokes, born 1949),
Fuzzy Zoeller (2.99 strokes, 1951), Feith Fergus (2.85 strokes, 1954), Bobby Wadkins (2.74 strokes,
1951), Craig Stadler (2.65 strokes, 1953), Tom Watson (2.64 strokes, 1949), and David Edwards
(2.32 strokes, 1956). Clearly, deterioration in player skill appears to be a function of age, with a
substantial amount of deterioration occurring during a golfer’s mid to late 40’s. But despite the
natural deterioration that occurs with age, 133 of the 253 players in the sample actually improved
from the beginning of the sample period to the end.

5.2 Luck
       To what extent does luck play a role in determining success or failure on the PGA Tour?
Based on the premise that our model for predicting 18-hole scores is correct, luck represents
deviations between actual and predicted 18-hole scores, either positive or negative, that are not
sustainable from one round to the next. Even if a golfer plays at an unusually high skill level in a
given round and cannot point to any specific instances of good luck to explain his performance, we
would consider him to have been lucky if his high skill level cannot be sustained. In this view, luck
is a random variable with temporal independence.
       5.2.1 Average Residual Scores in Adjacent Rounds. In the analysis that follows, we test
whether deviations between actual and predicted 18-hole scores are sustainable from one round to the
next by examining whether players who record exceptionally good or bad scores in one round of a
tournament can be expected to continue their exceptional performance in subsequent rounds of the
same tournament. For this test we examine the performance of players in all adjacent tournament
rounds that are not divided by a cut. A typical PGA Tour tournament involves four rounds of play
with a cut being made after the second round. Therefore, for a typical tournament, we compare
performance between rounds 1 and 2 and also between rounds 3 and 4. We do not compare
performance between rounds 2 and 3, however, because approximately half of the players who
18

participate in round 2 are cut and do not continue for a third round. For the Bob Hope Chrysler
Classic, which involves five rounds of play with a cut being made after the fourth round, we compare
performance between rounds 1 and 2, rounds 2 and 3 and rounds 3 and 4 but do not compare
performance between rounds 4 and 5. A similar procedure is employed for other tournaments that do
not employ a cut after the second round.
       For the purposes of this test, we sort all θ and η residual scores in the first of each pair of
qualifying rounds and place each player into one of 20 categories of approximately equal size based
on the ranking of residuals in the first of each qualifying two-round pair. Within each sort category
we compute the average residual score in both the first and second of the two adjacent rounds. Table
5 summarizes the results of this test for both θ and η residuals.
       The section of the table that summarizes the analysis for θ residuals shows a very slight
tendency for scores in the 20 sort categories to carry over from one round to the next. In sort
categories 1 – 10, the average first-round θ residual score is negative, and the average second-round
θ residual is also negative in each of these categories. For example, on average players in sort
category 1 have a first-round θ residual score of -5.184. This means that on average, players in the
first group scored 5.184 strokes better (lower) than predicted by model 3. If a portion of the 5.184
strokes represents a change in skill for these same players, or an advantage due to some players
having a competitive advantage on the course they played in the first round, rather than random good
luck, scores in the next round should continue to be lower than predicted. Otherwise, scores of
players in the first sort group should revert back to normal. In the next adjacent round (not divided
by a cut), the average θ residual for these same players is –0.097. This same pattern, negative
average residual first-round scores followed by negative average residual second-round scores, is
present throughout the first ten sort categories. However, all of the average second-round residuals
are very close to zero, with the largest, in absolute value, being 0.135 strokes for sort category 2.
       To shed further light on the relationship between residual scores in adjacent rounds, we run a
simple least squares regression within each sort group in which the second residual in each pair is
regressed against the first. If players within a group tend to continue with similar performance, this
regression coefficient should be positive and significant. However, as shown in Table 5, there is no
discernable pattern among the regression coefficients within the first ten sort groups, and none of the
regressions is significant.
19

        Among sort groups 11 – 20, all average first-round θ residuals are positive, and seven of ten
average second-round residuals are positive. But except in sort categories 19 and 20, with average
second-round residuals of 0.219 and 0.328, respectively, the average second-round residuals would
be of little if any significance in golf.
        One possible explanation for the different pattern of average residuals in categories 19 and 20,
and the significant positive regression coefficient in sort group 20, is that players who have
performed poorly prior to a cut may take more chances in an attempt to make the cut. If riskier play
tends to be accompanied by higher scores, this pattern should emerge. An alternative explanation is
that many players in the last two sort groups may consider it a forgone conclusion that they will miss
the cut and, perhaps, do not give their best efforts in the second of the two adjacent rounds. Although
both explanations are quite different, they both involve players changing their normal method of play
when the second of the two rounds is followed by a cut.
        We can test for this tendency by separating the average residual scores into those for which
the second of the two adjacent rounds is immediately followed by a cut and those for which a cut
does not occur after the second of the two rounds. Although not shown in the table, for all sort
groups, except groups 19 and 20, there is little difference in performance when a cut occurs after the
second of two qualifying adjacent rounds (panel 2) and when it does not (panel 3). However, for sort
categories 19 and 20, the average of the θ residuals in the second of the two adjacent rounds are
0.381 and 0.406, respectively when a cut occurs immediately afterward the second of the two rounds,
and –0.021 and 0.212, respectively, when a cut does not follow the second round. Moreover, the
regression of second-round residuals against first-round residuals is only significant for sort category
20 when the second of the two rounds is followed by a cut. Taken together, this evidence suggests
that tournament participants who perform exceptionally poorly in the earliest round(s) may change
their method of play in the round before a cut to reflect the high probability of being cut. Otherwise,
the threat of being cut does not appear to affect player performance.
        It should be noted that if a player’s exceptionally good (or bad) score as measured by his θ
residual in the first of two adjacent rounds is due to an advantage (or disadvantage) in playing a
particular course, the advantage should carry over into the next round, provided the next round is
played on the same course. Since the large majority of adjacent rounds summarized in Table 5
involve play on the same course, the fact that the average second round residual in each sort category
20

is essentially zero provides additional evidence that player-course interactions do not play a
significant role in determining 18-hole scores on the PGA Tour.
       Table 5 also summarizes the results of identical tests of η residuals. Unless the AR(1) model
does not adequately capture the correlation structure of player scores around their respective spline
fits and/or there are significant player-course interaction effects, η residuals should represent white
noise uncorrelated from one round to the next. Table 5 shows that regardless of the first-round sort
category, average residual scores in the second of two adjacent rounds are very close to zero, and that
there is no discernable relationship between the signs of average first-round scores and those of
adjacent second-round scores. As with θ residuals, we run a least squares regression within each
sort group in which the second residual in each pair is regressed against the first. Only one
coefficient is significant at the 5 percent level – that for sort group 4. Moreover, the fact that the
average second-round residual in group 4 is so close to zero (-0.029) indicates that any tendency for
the direction of abnormal performance to persist in this sort group is not accompanied by a level of
performance that would make much difference in determining a player’s 18-hole score. Therefore, we
conclude that significant differences between actual and predicted scores are due, primarily, to luck,
and that these differences cannot be expected to persist.
       5.2.2 How Much Luck Does it Take to Win a PGA Tour Event? It is interesting to consider
how much luck it takes to win a PGA Tour event or to at least finish among the top players. Table 6
summarizes the actual scores and θ residual scores on a round-by-round basis for the top 40 finishers
in the 2001 Players Championship. We focus on θ residual scores rather than η residuals so that we
can determine the extent to which each player’s scores deviate from their time-dependent mean
values. The pattern of residual scores exhibited in this table is typical of those for all 182
tournaments. Among the top 40 finishers, the total residual score for almost all players is negative;
only Phil Mickelson, Colin Montgomerie and Steve Flesch have positive total residual scores. Thus,
to have won this tournament, and almost all others in the sample, not only must one have played
better than normal, but one must have also played sufficiently well (or with sufficient luck) to
overcome the collective good luck of many other participants in the same event.
       Over all 182 tournaments, the average total θ residual score for winners and first-place ties
was –9.92 strokes, with the total residual ranging from –2.14 strokes for Tiger Woods in the 2000
ATT Pebble Beach Pro-Am to –23.48 strokes for Mark Calcavecchia in the 2001 Pheonix Open.
(Similarly, the average η residual was –9.80 strokes.) Table 7 summarizes the highest 20 total θ
21

residual scores per tournament for winners and first place ties. It is noteworthy that no player won a
tournament after recording a positive total θ residual score. Although not shown in the table, Tiger
Woods’ total η residual score for the 2000 ATT Pebble Beach event was +0.32 strokes, but this is
the only positive total residual using either the θ or η measure. It is also noteworthy that Woods’
name appears 11 times in Table 7 and that all the other players on the list (Phil Mickelson, Sergio
Garcia, David Duval, and Davis Love III) are among the world’s most recognizable players. Thus,
over the 1998-2001 period, only Tiger, and perhaps a handful of other top players, were able to win
tournaments without experiencing exceptionally good luck.
       5.2.3 Standard Deviation of Residual Scores. Figure 4 summarizes the standard deviation of
θ residual errors among all 253 players in the sample. The range of standard deviations is 2.14 to
3.45 strokes per round, with a median of 2.69 strokes. John Daly and Phil Mickelson, both well-
known for their aggressive play and propensities to take risks, have the third and 15th highest standard
deviations, respectively. Ian Leggatt has the lowest standard deviation. Chris Riley and Jeff Sluman,
both known as very conservative players, have the second and fifth lowest deviations, respectively.
       It is interesting to consider whether average scores and standard deviations of θ residual
errors are correlated. A least squares regression of standard deviations against the mean of each
player’s spline-based estimate of skill over the entire sample period yields:
expected score = 68.22 + 1.14 × standard deviation , with adjusted R2 = 0.067, F = 19.02 and p-value
= 0.00002. Thus, there is a tendency for greater variation in player scores to lead to slightly higher
average scores.
       5.2.4 Effect of Round-Course Interactions. Figure 5 summarizes the distribution of 848
random round-course interaction coefficients estimated in connection with model 3. The coefficients
range in value from –3.924 to 6.946, implying almost an 11-stroke difference between the relative
difficulty of the most difficult and easiest rounds played on the Tour during the 1998-2001 period.
       Over this period, 25 tournaments were played on more than one course. In multiple course
tournaments, players are (more or less) randomly assigned to a group that rotates among all courses
used for the tournament. By the time each rotation is completed, all players will have played the
same courses. At that time a cut is made, and participants who survive the cut finish the tournament
on the same course. Although every attempt is made set up the courses so that they play with
approximately the same level of difficulty in each round, tournament officials cannot control the
22

weather and, therefore, there is no guarantee that the rotation assignments will all play with the same
approximate levels of difficulty.
       Figure 6 shows the distribution of the difference in the sum of round-course interaction
coefficients for the easiest and most difficult rotations for the 25 tournaments that were played on
more than one course. For seven of 25 tournaments, the difference was less than 0.50 strokes. Thus,
on average the difference for these tournaments was sufficiently small that a player’s total score
should have been the same regardless of the course rotation to which he was assigned. At the
extreme, there was a 5.45 stroke differential between the relative difficulties of the easiest and most
difficult rotation assignments in the 1999 ATT Pebble Beach Pro-Am. Within this tournament, two
of six possible rotations played with round-course interaction coefficients that totaled 10.41 and
10.23 strokes, while the total for the remaining four rotations fell between 4.96 and 5.72 strokes. The
two difficult rotations involved playing the famed Pebble Beach course on the third day of the
tournament. Described as one of the nastiest days since the tournament was started in 1947
(www.golftoday.co.uk/tours/tours99/pebblebeach/round3report.html), the adverse weather conditions
had a much greater effect on scores recorded on the Pebble Beach course than on the other two
courses, Spyglass and Poppy Hills. According to David Duval (www.golftoday.co.uk), "This is the
stuff we stopped playing in last year. … It's the type of day you don't want, for the sole reason that
luck becomes a big factor.''
       It is interesting that the top nine finishers in this tournament all played one of the four easiest
rotations, as did four of the five players who tied for 10th place. Among the top 20 finishers, only two
played one of the two difficult rotations. Clearly, for this particular tournament Duval was right. The
luck of the draw had more to do with determining the top money winners than the actual skill
exhibited by the players.

5.3 Hot and Cold Hands
       Hot and cold hands represent the tendency for abnormally good and poor performance to
persist over time and has been the focus of a number of statistical studies of sports. While some had
argued for the presence of a hot hand in basketball, Gilovich, Vallnoe and Tversky (1985) argued that
this belief might exist even though actual shooting was consistent with a random process of
hits/misses (wins/losses). In other words, people may find systematic patterns in what is actually
random data. Larkey, Smith, and Kadane (1989) disputed this finding. Wardrop (1995) notes that
You can also read
Next slide ... Cancel