CAN STATISTICAL MODELS BEAT BENCHMARK PREDICTIONS BASED ON RANKING IN TENNIS?

Page created by Jim Schmidt
 
CONTINUE READING
CAN STATISTICAL MODELS BEAT BENCHMARK PREDICTIONS BASED ON RANKING IN TENNIS?
CAN STATISTICAL MODELS BEAT
 BENCHMARK PREDICTIONS BASED ON
 RANKING IN TENNIS?

 Submitted by
 William Svensson

A thesis submitted to the Department of Statistics in partial fulfillment
 of the requirements for a one-year Master of Arts
 degree in Statistics in the Faculty of Social Sciences

 Supervisor
 Lars Forsberg

 Spring, 2021
ABSTRACT

The aim of this thesis is to beat a benchmark prediction of 64.58 percent based on player
rankings on the ATP tour in tennis. That means that the player with the best rank in a tennis
match is deemed as the winner. Three statistical model are used, logistic regression, random
forest and XGBoost. The data are over a period between the years 2000-2010 and has over 60
000 observations with 49 variables each. After the data was prepared, new variables were
created and the difference between the two players in hand taken all three statistical models
did outperform the benchmark prediction. All three variables had an accuracy around 66
percent with the logistic regression performing the best with an accuracy of 66.45 percent. The
most important variable overall for the models is the total win rate on different surfaces, the
total win rate and rank.

Keywords: Logistic Regression, Random Forest, XGBoost, ATP tour
Contents
1 Introduction .................................................................................................................................................. 1

2 Theory ........................................................................................................................................................... 2
 2.1 Logistic regression................................................................................................................................ 2
 2.2 Random Forest...................................................................................................................................... 2
 2.3 XGBoost ............................................................................................................................................... 3
 2.4 Min-max normalization ........................................................................................................................ 3
 2.5 Sensitivity and specificity..................................................................................................................... 4

3 Methodology ................................................................................................................................................. 4
 3.1 Data....................................................................................................................................................... 4
 3.2 Data Preparation ................................................................................................................................... 5

4 Results ........................................................................................................................................................... 8
 4.1 Logistic regression................................................................................................................................ 8
 4.2 Random Forest...................................................................................................................................... 9
 4.3 XGBoost ............................................................................................................................................. 10

5 Discussion .................................................................................................................................................... 11

6 Further research......................................................................................................................................... 12
1 Introduction

Statistics and sports are two things that have gone hand in hand with each other for a long time.
Managers, coaches and players have probably used statistics for a long time to try and figure
out an opponent’s strengths and weaknesses or to improve on their own strengths and
preferably weaknesses. Even movies, based on a true story, have been made about this with
the Oscars nominated movie Moneyball, from 2011, which follows Oakland Athletics general
manager Billy Beane trying to assemble the best team by using sabermetrics.

The betting world is growing bigger and fantasy leagues, where a team gets points based on
the performance and statistics of a player, in all sports are also growing. This means that sports
statistics have now become an important part of the life of a sports fan or betting professional.
Even though this thesis does not take into account any odds it could still be seen as a support
for betting or becoming a better fantasy tennis manager.

The aim of this thesis is to try and beat a benchmark prediction on the ATP tour, the men’s
highest-ranked tour in tennis, with statistical models. The benchmark prediction is based on
the two opposing players rank with the player with the best ranked deemed as the winner. The
statistical models that will be used to try and beat this benchmark prediction are logistic
regression, random forest and XGBoost. After the data is cleaned and prepared from unwanted
matches the benchmark prediction is calculated to 64.58 percent of matches correctly
classified.

To use statistical models to predict the outcome of tennis matches is not something new. Clarke
and Dyte (2000) used the players ranking and a logistic regression model to try and predict the
outcome of the 1998 Wimbledon championship. Del Corral and Prieto-Rodríguez (2010) used
Probit models by looking at players past performances and physical characteristics to
investigate their role in predicting the outcomes of tennis matches.

The aim of this paper will not only be to try and beat the benchmark prediction but also try and
find out what variables are important for tennis matches to be deemed as the winner from the
statistical models. Therefore, it will not try to predict a specific tournament or a specific year

 1
it will instead be trained and predicted on randomly chosen matches during the period between
the years 2000 and 2020.

The choices of statistical models were based on that the logistic regression is easy to use and
one of the most common models used for regression. Random forest and XGBoost are also two
popular models for classification and have the advantage of being able to give, after
classification, the importance of every variable of why it classified as it did.

2 Theory

2.1 Logistic regression

The aim of the logistic regression is to model the posterior probabilities of the K classes via
linear functions in the input variables x and keeping sure that they sum to one and remain in
[0,1] (Hastie, Tibshirani and Friedman, 2009).

The equation for the conditional probability, p, would be as follows:

 exp( ! + " ∗ " + # ∗ # + ⋯ + $ ∗ $ )
 =
 1 + exp( ! + " ∗ " + # ∗ # + ⋯ + $ ∗ $

Where ! , " , … , $ denotes the unknown parameters that need to be estimated and
 " , # , … , $ denotes the input variables.

2.2 Random Forest

The idea behind the random forest comes from the theory behind bagging. Many decision trees
are made based on a fixed amount of randomly chosen variables from the data that are picked
to create the decision trees. This is to make the decision trees have as little correlation with
each other as possible and in the end, an average is taken from the total of all decision trees
(Breiman, 2001).

“A random forest is a classifier consisting of a collection of tree-structured classifiers
{ℎ( , Θ$ ), = 1, … } where the {Θ$ } are independent identically distributed random vectors
and each tree casts a unit vote for the most popular class at input ” (Breiman, 2001, p. 6).

 2
There are no specific criteria’s when it comes to choosing the number of trees that the random
forest should use, however, there is a threshold where adding more trees does not give any
significant gain to the random forest (Oshiro, Perez and Baranauskas, 2012). The choice for
the number of variables, m, that should be used in each decision tree should be less than the
total number of variables, p, included in the final data, e.g., m ≤ p. The rule of thumb here is
that values for m are √ but can go as low as 1 (Hastie, Tibshirani and Friedman, 2009).

One advantage that the random forest model has, compared to similar models, is that it can
show the importance of every variable for the classification process. The variable importance
is measured using mean decrease accuracy and mean decrease gini. Mean decrease accuracy is
computed by looking at the variables that are not part of the decision tree in hand and calculated
from the out-of-bag error. The more the accuracy decreases, when a variable is left out of the
decision tree, the higher the mean decrease accuracy will be for that variable. The higher the
value in mean decrease accuracy is, the more important the variable is.
Mean decrease gini, on the other hand, measures the node impurity when the decision tree is
made. It looks at how deep the variable goes into the tree to be able to classify correctly. Just
as for the mean decrease accuracy, a higher value in mean decrease gini shows a more
important variable (Han, Guo and Yu, 2016).

2.3 XGBoost

XGBoost, which is short for eXtreme Gradient Boosting, is an efficient and scalable
implementation of gradient boosting (Chen and He, 2021).

It is a decision tree-based ensemble method, i.e., it combines the predictive power of multiple
learners, but the resultant is a single model. Just as in the case of the random forest model it
has the possibility to give an output of the variable importance.

2.4 Min-max normalization

In order to avoid that the models give more importance to variables with a bigger difference in
range, normalization will be used to transform all variables having a range between 0 and 1
while keeping the distributions of the variables the same.

 3
Min-max normalization will be used to normalize the data before starting with the models.
The formula for the min-max normalization is:

 % − min ( )
 % =
 max( ) − min ( )

2.5 Sensitivity and specificity

The models will be measured first and foremost on their accuracy. But the sensitivity and
specificity will be taken into account. In this thesis the sensitivity is the proportion of predicted
losses compared to the true number of losses. The specificity is the proportion of predicted
wins compared to the true number of wins.

3 Methodology
3.1 Data

The data is obtained from Jeff Sackmanns GitHub file named “tennis_atp” (Sackmann, 2021).
It consists of a lot of different data from the men’s tennis tour including challengers, futures,
doubles and the ATP tour. For the ATP tour, there is data going back to 1968 but for this thesis
the data that will be used is all ATP tour matches ranging from the years 2000-2020. The choice
of years was based on that there were fewer missing values during the past few decades
compared to earlier decades. Tennis has also changed a lot during these decades where the style
of play has changed, the surfaces have become slower, the rackets have been upgraded and
new technology such as hawk-eye has been included into the game.

The original data from the ATP tour matches ranging from the years 2000-2020 consists of
63 136 matches with 49 variables each. First off there are different variables describing the
tournament such as where it was played, on what surface, how many players and what kind of
tournament it was. Then there are variables describing the winner of the match such as his
name, age, height, country, rank and if they are left- or right-handed. The same information is
also given for the loser. Then there are variables describing the stats for the winner and the
same information is included for the loser. How many aces a player did during the match,
double faults, serve points, 1st and 2nd serves won and the number of breakpoints faced and
saved during the match are some of these stats.

 4
3.2 Data Preparation

Some tournaments in tennis such as the Davis Cup, Olympics and the ATP tour finals have a
different format in tennis than all the other tournaments and are excluded from the data. After
taking a look at some missing values the data has 552 missing values in 18 different variables
and these all belong to the exact same matches. Therefore, due to the lack of information from
these matches they are also excluded from the data. The variables “winner_rank” has 25
missing values and “loser_rank” has 141 missing values and as the benchmark prediction will
be based on these variables these matches are also excluded from the data. This leaves the data
consisting of 55 598 observations which is the final number of matches that will be used in the
research of this paper.

The variables “winner_seed” and “loser_seed” have around 54 percent and respectively 75
percent missing values in total. The reason for this is because a seed is only giving to the top
players in different tournament and is usually based on the players rank or previous
performance in the specific tournament and thus, they are also excluded from the data. Three
variables still have missing values in the data. These are “winner_ht”, “loser_ht” and “minutes”
which are the height of the winner/loser and the number of minutes the match in hand took.
The variable for minutes will not be included and therefore nothing is done with those missing
values. On the other hand, height does feel like an important variable and as there are a lot of
missing values it needs to be dealt with. This will be done by manually including every players
height by retrieving the data from the ATP tours official homepage. After a quick look at all
the players height, it was observed that the player David Goffin was registered as the shortest
player with a height of 163 cm which is in fact not the case as he is 180 cm and was changed
to that.

Three new variables are created. The first one is “match_id” which will be a unique variable
for every match ranging from 1 to 55 598. The second one is “h2h_id” which is simply the
variable “winner_id” combined with the variable “loser_id”. The third variable is
“h2h_id_match”, this variable is also a combination between “winner_id” and “loser_id” but
it does not take into account who won or lost the match, instead the player with the lowest
value on their winner/loser id comes first and the player with the higher value comes second.
Let’s use Nadal and Federer as an example, Nadal’s player id is 104745 and Federer’s player
id is 103819. Nadal won the game and their “h2h_id” becomes 104745_103819 but the variable

 5
“h2h_id_match” becomes 103819_104745. Every time Nadal and Federer meet, they have the
same “h2h_id_match” and this is because the variable does not take into account who won or
lost the game, it just takes into account which players that played the match. The player with
the lowest id, the one who is written first in this variable, then becomes Player 1 whereas the
one that is written last becomes Player 2.

The data is then divided so that the information about the winner and loser becomes 2
observations instead of 1. The variables “winner_id” and “loser_id” then becomes a new
variable called “player_id”. “winner_id” and “loser_id” were always the same, independent if
they won or lost the match and are instead connected to who the name of the player. The data
is then ordered in chronological order to be able to create the new variables. There are 9 new
variables created from the original variables and are added to the three original variables height,
age and rank. All of the variables are presented, explained and given a purpose of why they are
included in table 3.1 on the next page.

 6
Variable Meaning Purpose
 Shows the players ability to
 Avg_ace The average ace per match.
 win easy points.
 The average double faults per Shows the players ability to
 Avg_df
 match. lose easy points.
 The average ratio of
 breakpoints that are saved Shows the mental strength of
 Avg_BPsaved_faced
 compared to the number of the player.
 faced ones.
 The average ratio of serve Shows the ability of the
 Avg_first_won points won when the player’s player to convert first serves
 own first serve went in. into points.
 The average ratio of serve
 Shows the ability of the
 points won compared to the
 Avg_svpt_won player to hold their own
 player’s number of own serve
 serve.
 points played.
 The percentage of matches
 Shows how good the player
 Winrate won compared to matches
 has been during his career.
 played.
 The percentage of matches
 Winrate10 won during the last 10 Shows the form of the player.
 matches.
 The percentage of matches
 won compared to matches Shows what kind of surface
 Winratesurface
 played on the specific the player prefers.
 surface.
 The percentage of matches
 Shows if a player has
 won compared to matches
 Winrate_h2h problems or not with another
 played against a specific
 player.
 player.
 Height could give an
 Height The height of the player. advantage with reach and
 service power.
 Shows how experienced a
 Age The age of the player
 player is.
 Shows the overall strength of
 Rank The rank of the player.
 the player.
 Table 3.1. The variables that will be used for the statistical models with their meaning and purpose

All of the 55 598 observations are used to create these variables. After they are created every
player with less than 10 matches and every players first 10 matches are excluded from the data
that are used for the statistical models. The reasoning behind this is that some of the variables
created are extremely dependent of having some matches played to give a correct number. If
the player has only played 1 match the numbers could be very misleading. For example, the

 7
win rate would either be 100 percent or 0 percent for the next match and the same actually
applies for all newly created variables.

Two new datasets are then created, one for player 1, which is the one with the lower value in
their player id and one for player 2 with the higher value in their player id. This is to be able to
create a third data set called “Player_diff” which is the stats of player 1 subtracted with the
stats of player 2 to get the difference between the players. The last variable created is a binary
variable called “Player1_win” which becomes 1 if player 1 wins the match and 0 if player 1
loses the match. The data is now almost ready to use for the statistical models. But the range
between some of the variables are very different. For example, the difference in rank ranges
between -1798 and 1815 whereas some of the other variables ranges between -1 and 1.
Therefore, min-max normalization is applied to the variables so that all variables will range
between 0 and 1. This is so that the statistical models do not get mislead about the importance
of every variable. Now the data is ready to be trained and tested.

4 Results
4.1 Logistic regression

Most of the variables had a p-value under the threshold of 0.05 to be statistically significant
but the three variables that showed the difference in average double faults, average serve points
won when the first serve went in and height all had higher p-values. A second logistic
regression without these variables was tested and showed a higher accuracy of 0.07 percent
and therefore the results of the reduced logistic regression will be presented.

Table 4.1 shows the number of incorrectly and correctly classified prediction compared to the
true value. The next table 4.2 shows that the accuracy of the predictions for the model were
66.45 percent. The sensitivity is 69.68 percent and the specificity is 62.91 percent which tells
us that the logistic regression had an easier time classifying losses than wins.

 8
Reference
 Prediction 0 1
 0 3452 1675
 1 1492 2841
 Table 4.1. Prediction vs reference for the logistic regression

 Accuracy Sensitivity Specificity
 66.52 % 69.82 % 62.91 %
 Table 4.2. The logistic regression accuracy, sensitivity and specificity

4.2 Random Forest

Figure 4.1 shows the importance of every variable to be able to classify the response variable.
The top 3 most important variables, both for the mean decrease accuracy and mean decrease
gini are the difference in total win rate, the win rate on the different surfaces and the rank. On
the other hand, the two other stats variables height and age, but also the difference in head-to-
head matches does not appear to be as important for the random forest to classify.

 Figure 4.1. Variable importance for classification for the Random Forest model

 9
Table 4.3 shows how many of the losses respectively wins were classified correctly. Table 4.4
shows that the accuracy of the predictions ended at 65.96 percent whereas the sensitivity and
specificity were 69.03 respectively 62.60 percent. It appears that it was a little bit easier for the
random forest model to classify losses correctly than wins.

 Reference
 Prediction 0 1
 0 3413 1689
 1 1531 2827
 Table 4.3. Prediction vs Reference for the Random Forest

 Accuracy Sensitivity Specificity
 65.96 % 69.03 % 62.60 %
 Table 4.4. The Random Forests accuracy, sensitivity and specificity

A second model was tested by taking away the least important variables according to the
random forest but the model did not show any improvement nor deterioration and therefore the
original model is presented.

4.3 XGBoost

Figure 4.2 shows the variable importance for the XGBoost model to be able to do its
classification. It is very similar to the important variables for the random forest. Total win rate,
win rate on the different surfaces are all in the top 3. Whereas height and head-to-head matches
are in the bottom.

 10
Figure 4.2. Variable importance for classification for the XGBoost model

Table 4.5 and Table 4.6 shows the predicted values compared to the true values, the accuracy,
sensitivity and specificity. The best optimized XGBoost model showed an accuracy of 66.19
percent, a sensitivity of 69.13 percent and a specificity of 62.98. As for the other two models
it appears that the XGBoost model is better at classifying losses than wins.

 Reference
 Prediction 0 1
 0 3418 1672
 1 1526 2844
 Table 4.5. Prediction vs Reference for the XGBoost

 Accuracy Sensitivity Specificity
 66.19 % 69.13 % 62.98 %
 Table 4.6. The XGBoost accuracy, sensitivity and specificity

5 Discussion
The aim of this paper was to try and beat a benchmark prediction based on a tennis players
rank on the ATP tour. The models that were chosen were a logistic regression, random forest
and XGBoost. As seen in table 5.1 all models performed a little bit better than the benchmark

 11
prediction. The difference in accuracy between the models was very little with the logistic
regression performing the best but all models performing around 66 percent.

 Model Accuracy

 Benchmark Prediction 64.58 %
 Logistic Regression 66.52 %
 Random Forest 65.96 %
 XGBoost 66.19 %
 Table 5.1. Comparison of the model performances

One factor why the models didn’t perform that much better than the benchmark prediction
could be that the variables chosen does not allow the model to see the actual form the player is
in before the game. The idea was that the win rate for the last 10 matches would be the variable
that would show if a player were in form or not, but it unfortunately did not appear to catch
that. The best ranked players are probably the ones with the highest overall averages and win
rates. Therefore, it would have been more interesting to use the same variables but to use a
rolling average of 10 matches on every variable to see if the models can catch losing or winning
streaks instead of using the career average to the upcoming match in hand.

The head-to-head variable was surprisingly enough one of the least important variable in the
random forest and XGBoost model. This could be due to the fact that tennis players do not
actually meet each other that much except for the top ranked players who usually go far in
tournaments and therefore have the opportunity to meet each other more times as they play
more matches on the ATP tour.

6 Further research
One way of dealing with not having to exclude every players first 10 matches or players with
less than 10 matches would be to include the challenger tour or futures tournaments. The
challenger tour is usually where player’s that are not good enough for the ATP tour play and
the futures tournaments are for the younger players that have not made it to the ATP tour yet.

 12
There are databases that can be found about this from Jeff Sackmans GitHub as well. A possible
problem is that the challenger tour and futures tournaments could have a different playstyle,
and this would be needed to examine before including the data.

With the betting market growing bigger and bigger it would be interesting to first of all try and
beat the different bookmakers according to whom they have as the winner, e.g., the one with
the lowest odds. It would probably be a much higher percentage than the one for the benchmark
prediction based on rankings. If it would beat the bookmakers, it would then be interesting to
see if it is possible to make any profits. When it comes to betting, especially in tennis, the best
players usually have very low odds and would give very little profit even though you put a lot
of money on them. To make any profit it would probably be important to beat the bookmakers
favorite to win the match by quite a margin. It would also be preferable for the model to predict
correctly on the matches where the odds are close to each other or be able to find the big upsets
and not just only predicting correct on the player that is deemed to be the favorite.

Other variables could also be included in the models such as if a player is left- or right-handed.
How many minutes a player have been on the tennis court for the past month to try and show
the tiredness of the player or how many matches they have played during the past year to try
and figure out if there have been any injuries.

It would also be of interest to only use the top 20-30 ranked players as they have the opportunity
to meet each other a lot of times and therefore see if the variable for head-to-head matches
would be a more important variable.

 13
References

Breiman, L. (2001) ‘Random Forests’, Machine Learning, 45(1), pp. 5–32.

Chen, T. and He, T. (2021) ‘xgboost: eXtreme Gradient Boosting’, p. 4.

Clarke, S. R. and Dyte, D. (2000) ‘Using official ratings to simulate major tennis
tournaments’, International Transactions in Operational Research, 7(6), pp. 585–594.

del Corral, J. and Prieto-Rodríguez, J. (2010) ‘Are differences in ranks good predictors for
Grand Slam tennis matches?’, International Journal of Forecasting, 26(3), pp. 551–563.

Han, H., Guo, X. and Yu, H. (2016) ‘Variable selection using Mean Decrease Accuracy and
Mean Decrease Gini based on Random Forest’, in 2016 7th IEEE International Conference
on Software Engineering and Service Science (ICSESS). 2016 7th IEEE International
Conference on Software Engineering and Service Science (ICSESS), pp. 219–224.

Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning. New
York, NY: Springer New York (Springer Series in Statistics).

Oshiro, T. M., Perez, P. S. and Baranauskas, J. A. (2012) ‘How Many Trees in a Random
Forest?’, in Perner, P. (ed.) Machine Learning and Data Mining in Pattern Recognition.
Berlin, Heidelberg: Springer Berlin Heidelberg (Lecture Notes in Computer Science), pp.
154–168.

Sackmann, J. (2021) JeffSackmann/tennis_atp. Available at:
https://github.com/JeffSackmann/tennis_atp (Accessed: 18 May 2021).

 14
You can also read