Ranking Prediction of Premier League Study on Regression Modelling

Page created by Shane Porter
 
CONTINUE READING
Ranking Prediction of Premier League Study on Regression Modelling
Ranking Prediction of Premier League

 Study on Regression Modelling

 By
 LAI Cheuk Kwan
 17226791

A thesis submitted in partial fulfillment of the requirements for the degree
 of

 Bachelor of Science (Honours)
 in Mathematics and Statistics

 at

 Hong Kong Baptist University

 Data 4th Dec 2020
Ranking Prediction of Premier League Study on Regression Modelling
Content

1 ABSTRACT 4
2 BACKGROUND INTRODUCTIONS 5
3 METHODOLOGIES – REGRESSION MODEL 6

4 DATA AND MODEL I
 4.1 Data Description 7
 4.2 Variables Description 8
 4.3 Standardization 11
 4.4 Model Description 12

5 MODEL EVALUATIONS I
 5.1 Residual Analysis I 14
 5.2 Coefficient of Determination I 19

6 RESULT OF MODEL I 20

7 MODEL REFORMING
 7.1 Multi-collinearity 22
 7.2 Stepwise regression method 25

8 MODEL EVALUATIONS II
 8.1 Residual Analysis II 27
 8.2 Coefficient of Determination II 29

9 DATA AND MODEL II 30
10 RESULTS OF MODEL II 31
11 COMPARISON OF MODEL I AND II 32

12 PREDICTIONS OF 20/21 SEASON FINAL TABLE 34
13 DISCUSSIONS OF LIMITATIONS 35
14 CONCLUSIONS 37

15 REFERENCE 38
Ranking Prediction of Premier League Study on Regression Modelling
ACKNOWLEDGEMENT

Part of the work presented in this thesis was done in collaboration with my supervisor,
Dr. C.K. Yau, the Lecturer of Department of Mathematics, Hong Kong Baptist
University. Without his guidance and help, the completion of this project would not be
finished.

 Signature of Student

 Student Name

 Department of Mathematics
 Hong Kong Baptist University

 Date:
Ranking Prediction of Premier League Study on Regression Modelling
1 Abstract

This thesis is studying the ranking prediction of Premier League by regression method.
The aim of this thesis is to analyze the key variables that should be considered for the
final ranking table. And to carry out the final ranking prediction of Premier League
2020 – 2021 season.

To begin with, this thesis would focus on developing the regression model of ranking
prediction. This thesis would identify what variables should be considered under a
football league. Then I would test the model and check whether trustful or not.

Secondly, this thesis would focus on improving the accuracy and effectiveness of the
model. This thesis would apply different techniques, i.e. checking collinearity,
normalization, and stepwise method, in order to fit the regression model. After
constructing a new model, this thesis would compare the accuracy of the new model
and original one.
Ranking Prediction of Premier League Study on Regression Modelling
2 Introduction

Premier League (PL)

The Premier League (PL) is the top tier of football pyramid in England and one of the
top five football leagues in Europe because of its popularity, quality of competition, as
well as the football stars and coaches.

Each season starts from mid-August and ends in mid-May. Each football team needs to
battle with every single team twice, once at home, which is their home stadium, and
once playing as an away team in their opponent’s stadium. There are 20 teams
contesting the honor of champion each season. Each team needs to play 38 matches,
and in total, there are 380 matches per season.

The fundamental aspect of each football game, no doubt that, is scoring the goals.
Despite other aspects, such as defending, saving, and possessing, most of the
memorable and impressive moments of each match, is putting the ball into the net.
Three are three results after a match: win, draw, or lose.

In PL, each team would award 3 points for a win, 1 point for a draw and none for a
loss. In the end of the season, the team which has the highest points, would be awarded
the champion. The final point table is ranked in order. If more than 1 clubs receive the
same points, then it would be ranked by the goal difference in order.

At the end of each season, the top 4 teams in the table are qualified to join the UEFA
Champions League (UCL), a world class league involving all the top football clubs from
different leagues in Europe. The 5 – 7 teams have a chance to join the UEFA Europe
 th th

League (UEL), a tier two international league involving different clubs. On the other
hand, PL implements a system called promotion and relegation. It means that 4 teams
on the bottom need to drop down into the tier 2 league in England, called English
Football League Championship (EFL). And the top 4 teams in EFL can be promoted
to PL.
Ranking Prediction of Premier League Study on Regression Modelling
It shows us the importance of predicting the final table that we can determine which
teams can be qualified to join international leagues and which teams are needed to drop
down.

Prediction of Premier League Table

There are many prediction of football games by different methods in the world, that
wanting to predict the result of each game. It is useful for people to bet and gamble.

 Figure 1 – Cap Screen of footballpredictions.com

These predictions want to find out the result of each game by analyzing their
performance. However, for our prediction, we want to find out the final ranking of PL.
It is very useful because it can determine the qualification of international leagues.
Getting a higher ranking in PL and the qualification of international leagues, can earn a
large profit. Joel. O. (2009) mentions that UEFA Champions League (UCL) will pay
clubs between €2 million to €15 million for a fix reward, and up to €40 million for
bonus reward base on the value of market. On the other hand, getting higher ranking in
PL would get more supporters. The supporters of the team will buy the tickets to watch
the game, or buy some souvenirs of the team. It shows us the prediction of the table is
very useful for the financial planning of each team.

Joel. O. (2009) had done a similar regression model to predict the final ranking, with 6
variables, which were % Goals to shot, ratio of short/long pass, etc. The aim of this
thesis is to find a more precise model.
Ranking Prediction of Premier League Study on Regression Modelling
3 Methodology - Regression Analysis

Regression analysis is a set of statistical methods, it is useful for estimation of the
relationship between a dependent (target) variable and a set of independent variables
(predictors). Multiple linear regression method, one of the statistical methods of
regression analysis, is a powerful tool for predicting our dependent variable ( ), by
more than one predictors ( 1 , 2 , … , ).

Suppose we have a data set, the data is obtained as the followings:
 Predictor variable ( 1 , 2 , … , )
 1 11 12 ⋯ 1 
 2 21 22 ⋯ 2 
 ⋮ ⋮
 1 2 ⋯ 
 Table 1 – Simple of Data Set

The multiple linear regression model can be written as
 = 0 + 1 1 + 2 2 + ⋯ + + 

The n-tuples of observations also follow the same model. Like
 1 = 0 + 1 11 + 2 12 + ⋯ + 1 + 1
 2 = 0 + 1 21 + 2 22 + ⋯ + 2 + 2
 ⋮
 = 0 + 1 1 + 2 2 + ⋯ + + 

while 0 , 1 , 2 , … , are a set of unknown parameters, denoted as ( ), which are
the regression coefficients associated with ( 1 , 2 , … , ). We are going to estimate the
parameters ( ) from the dataset.

There are some assumptions of error ( ) are needed in the regression model in order
to draw the statistical inferences, as
 ~ (0, 2 )
and
 , = 0, ≠ 
means the ’s are uncorrelated with 0 covariance.
4 Data Description and Setting Up Model

This Chapter will mainly focus on data description and introducing how to construct
the multiple linear regression model. Firstly, this chapter would introduce the source of
data set. Then, it would focus on the predictors we use.

4.1 Data Description

The sources of the data are from www.whoscored.com. As there are no data sets
consisting all the predictors we need, so it is needed to construct myself by recording all
the predictors in every single match we considered individually.

 Figure 2 – Source of one of a Match

The data set what this research use, is consisting 11 seasons, from 09/10 to 19/20 of
Premier League, including the predictors from different football teams which played in
PL. For the construction of the data set, we are going to use the predictors of all teams
in first half seasons, there are 19 matches each team. After that, we would predict the
final points of each team.

Firstly, we would use the data of 09/10 – 17/18, including 9 seasons, to test the model
for 18/19 prediction. Then, we would use 09/10 – 18/19, including 10 seasons to test
for 19/20 prediction. The aims of that is to check whether the regression model is
precise or not.
4.2 Variables Description

For the principle of multiple linear regression, we need a set of independent
variables ( 1 , 2 , … , ), in order to predict the target variables ( ).

Dependent Variable ( )

In this research, as the ranking table is ranked by the final point, we are going to predict
the final point of each teams in whole season. So that, the target variable is the final
points for whole season.

 Dependent Variable Data Description
 Final Points each team for whole season
 Table 2 – Dependent Variable ( )

Independent Variable ( , , … , )

There are 25 predictors we considered. It can be separated into 3 parts. The first part is
the points in half season. The second part is the performance each team. The third part
is about the different types of streaks.

First, as we are going to predict the final points each team by the statistic of first half
season. We are going to use the points in half season as a predictor of our model.
Points can be generated by won, drew and loss. As the sum of number of won, drew
and loss is equal to 19 exactly, we can only consider two of them. So, we let 1 , 2 be
the number of won and drew.

 Independent Variable Data Description
 1 Number of Won
 2 Number of Drew
 Table 3 – Independent Variables ( 1 , 2 )
Secondly, we are going to consider the performance of each team in the first half
season. Joel. O. (2009) mentioned that football team performance can be separated
into five groups generally, which are attacking, passing, defending, possession and
discipline.

 Team Performance

 attacking passing defending possession discipline

 Figure 3 – Team Performance

For attacking, we are going to consider 7 predictors 3 , 4 , 5 , 6 , 7 , 8 , 9 be goal
difference, number of shots, number of shot on targets, number of aerials won, % of
aerials won, number of corners, and % of corner accuracy, respectively.

For passing, we consider 3 predictors 10 , 11 , 12 be number of pass, number of key
pass, and % of pass success, respectively.

For defending, we consider 5 predictors 13 , 14 , 15 , 16 , 17 be number of successful
tackles, % of tackle success, number of clearances, number of interception, and
dispossessed, respectively.

For possession, we consider 1 predictor 18 be % of possession.

For discipline, we consider 3 predictors 19 , 20 , 21 be number of fouls, number of
yellow cards, and number of red cards, respectively.
Independent Variable Data Description
 3 Goal Difference
 4 Number of Shots
 5 Number of Shot on Targets
 6 Number of Aerials Won
 7 % of Aerials Won
 8 Number of Corners
 9 % of Corner Accuracy
 10 Number of Pass
 11 Number of Key Pass
 12 % of Pass Success
 13 Number of Successful Tackles
 14 % of Tackle Success
 15 Number of Clearances
 16 Number of Interception
 17 Number of Dispossessed
 18 % of Possession
 19 Number of Fouls
 20 Number of Yellow Cards
 21 Number of Red cards
 Table 4 – Independent Variables ( 3 , ⋯ , 21 )

Thirdly, we are going to consider the different types of streaks. Streaks mean a
continuous period of specified terms. It is essential to the morale and performance of
each team. We consider 4 types of streaks, which are longest winning streak, longest
unbeaten streak, longest no-won streak, and longest losing streak, denoted
as 22 , 23 , 24 , 25 , respectively.

 Independent Variable Data Description
 22 Longest Winning Streak
 23 Longest Unbeaten Streak
 24 Longest No-won Streak
 25 Longest Losing Streak
 Table 5 – Independent Variables ( 22 , ⋯ , 25 )
4.3 Standardization

 In data analysis, especially in this thesis, we need to deal with a various type of data,
 which including different dimensions. Standardization, as well as Z-score normalization,
 can makes each branch of data have zero-mean. After we have done the
 standardization, we can have a friendlier regression model than the original. The
 general calculation is as the following,
 − 
 ′ =
 
 where is the original vector, = . is the standard deviation of .

 In this thesis, our data set is as the following, let’s have a part of example.

Final Point ( ) ⋯ Goal Difference ( 3 ) ⋯ % of Aerials Won ( 7 ) ⋯ Number of Pass ( 10 ) ⋯

 86 ⋯ 28 ⋯ 0.4934303 ⋯ 9481 ⋯

 85 ⋯ 22 ⋯ 0.4883383 ⋯ 10087 ⋯

 75 ⋯ 30 ⋯ 0.4625078 ⋯ 9536 ⋯

 64 ⋯ 12 ⋯ 0.5130489 ⋯ 6263 ⋯

 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

 Table 6 – Sample of Original Data Set

 We can see that there is huge difference between different predictors, for example, for
 the Number of Pass ( 10 ), the data are in thousands digit, but for the % of Aerials
 Won ( 7 ), the data are in decimal digit. In order to make our model to be more user-
 friendly, we need to do standardization. The result is as the following.

Final Point ( ) ⋯ Goal Difference ( 3 ) ⋯ % of Aerials Won ( 7 ) ⋯ Number of Pass ( 10 ) ⋯

 86 ⋯ 1.98011 ⋯ -1.10244 ⋯ 0.79245 ⋯

 85 ⋯ 1.55605 ⋯ -1.13118 ⋯ 1.17913 ⋯

 75 ⋯ 2.12146 ⋯ -0.61389 ⋯ 0.82755 ⋯

 64 ⋯ 0.84929 ⋯ 0.52129 ⋯ -1.2609 ⋯

 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

 Table 7 – Sample of Standardized Data Set
After doing data standardization, we can see that all the predictors containing the data,
which are in same dimension of digit. It is very useful for us to make the regression
model to be less ridiculous.

4.4 Model Description

In the beginning of model construction, we would use the data of 09/10 – 17/18,
including 9 seasons, to test the model for 18/19 season by SAS. It includes 180
observations, which is the performance of each football club, in order to calculate the
regression coefficients ( ) associated with ( 1 , 2 , … , 25 ).

The following table 8 summarizes the coefficients of our model.

 Regression Coefficients ( )

 0 52.1222 13 0.65934

 1 10.1837 14 0.04247

 2 1.31957 15 -0.51823

 3 2.25778 16 -0.18936

 4 0.523 17 -0.13833

 5 0.58895 18 0.78929

 6 0.7976 19 0.06089

 7 -0.49193 20 -0.05777

 8 -0.13768 21 -0.09977

 9 -0.6441 22 0.64542

 10 0.23822 23 1.11766

 11 1.94281 24 0.55455

 12 0.75817 25 -0.58511

 Table 8 – Regression Coefficient of Model I
Now, we get the least square equation of our regression model from SAS:

 = 52.122 + 10.1837 1 + 1.31957 2 + 2.25778 3 + 0.523 4 + 0.58895 5
 + 0.7976 6 − 0.49193 7 − 0.13768 8 − 0.6441 9 + 0.23822 10
 + 1.94281 11 + 0.75817 12 + 0.65934 13 + 0.04247 14
 − 0.51823 15 − 0.18936 16 − 0.13833 17 + 0.78929 18
 + 0.06089 19 − 0.05777 20 − 0.09977 21 + 0.64542 22
 + 1.11766 23 + 0.55455 24 − 0.58511 25

 Figure 4 – SAS Report (1) of Model I

From the report of regression in SAS, we found that the Root MSE (Mean-square
error) is 6.39332, which is our random error ( ). It means the difference between
predicted final points and real final points is within 2 = 2 6.39332 = 12.78664.
5 Model Evaluation I

This chapter would mainly focus on determine whether our regression model is precise
or not. This chapter is consisting of two parts: residual analysis and coefficient of
determination.

5.1 Residual Analysis I

Residual analysis is a process that determine whether our regression model is precise or
not. Recall that there are some assumptions of error ( ) are needed in the regression
model in order to draw the statistical inferences, as
 ~ (0, 2 )
We need to check whether the assumptions satisfied or not.

Residual, which is our error ( ), means the difference between our target value and
the observed value ( ). Every single data point has only one residual. The equation is
as the following.
 = − 
 
Both mean and sum of residual are exactly equal to 0. Which means =0 = 0 and
 = 0.

We need to check whether the residuals be normally distributed and uncorrelated or
not by normality and homogeneity.
For normality, we are going to check the normal probability plot. Normal probability
plot is describing the residuals against our target values given their rank. If the
residuals are normally distributed, which means it has a straight line.

 Figure 5 – Normal Probability Plot of Model I

We can clearly see that, in the right plot, it shows us against . In the left plot, the
residual is normal distributed as the plot contains a straight line.

 Figure 6 – Normal Distribution Bar Graph of Model I

In figure 5, We can clearly see that the residual is following normal distribution, which
is satisfied our first part of assumption.
For homogeneity, we are going to check the residual plots. The residual plot should be
homogeneity if there is no pattern or trend. It would be separated into two parts, one is
residual against plot is describing the residual against our target values . The
second part is describing the residual against our predictors ( 1 , 2 , … , 25 ).

 Figure 7 – Residual Plot of ε against of Model I

In figure 6, we can clearly see that there are no trend or pattern between the residual
 and our target values .
Figure 8 – Residual Plot of against 1 , 2 , … , 6 of Model I

 Figure 9 – Residual Plot of against 7 , 8 , … , 12 of Model I

Figure 10 – Residual Plot of against 13 , 14 , … , 18 of Model I
Figure 11 – Residual Plot of against 19 , 20 , … , 24 of Model I

 Figure 12 – Residual Plot of against 25 of Model I

For won 1 , drew 2 , red 21 , the longest winning streak 22 , the longest
unbeaten streak 23 , the longest no-won streak 24 , and the longest losing
streak 25 , these might look different from the others plot since we got different
measurements of every single . However, there is no any systematic trend or
pattern that is non-linearity.

So, from figure 4 to figure 11, we can conclude that our regression model fulfills the
assumption of residual such that our model is trustful.
5.2 Coefficient of Determination I

The coefficient of determination 2 is also a key process of regression analysis. It is
useful to check whether the regression model is precise or not.

For R-squared 2 , it is a statistical measure that measuring the variance proportion
for target variable which can be explained by our predictors ( 1 , 2 , … , ). The
equation is as the following.
 
 2 = 1 −
 
with = =1( − )2 and = =1( − )2 .

The 2 is the square of correlation between our target variable and the
observed value . The range is from 0 to 1. If 2 = 0, it means can’t be predicted
by the predictors. If 2 = 1, it means can be predicted by the predictors without any
error. So, if 2 is close to 1, it means the regression model is precise.

Recall that, from figure 3, we found that for the model of 09/10 – 17/18 season,
the 2 = 0.8809, it means that 88.09% of observed variation can be explained by our
regression model.

For the data of 09/10 – 18/19 season, which is used to predict the 19/20 result,
the 2 = 0.8866 as the following figure 12. It means that both regression models are
trustworthy for our thesis.

 Figure 13 - SAS Report (2) of Model I
6 Result of Model I

This chapter would focus on the result of prediction from the regression model. There
are two results below. One is the prediction of 18/19 season, second one is the
prediction of 19/20 season. For 18/19 season, it is tested by 180 observations from
09/10 to 17/18. For 19/20 season, it is tested by 200 observations from 09/10 to 18/19.

There are 20 clubs each season and the ranking are rank by points, which means
getting higher points, higher rank. In order to make the prediction table to be easier to
understand, this chapter will use different color to separate the ranking in reality.

 Top Rank 1 – 4
 Rank 5 – 8
 Rank 9 – 12
 Rank 13 – 16
 Last Rank 17 – 20
 Figure 13 – Sample of Real Ranking Table

For top 4 football teams in real, I would use green color to mark them, which means
they are qualified to join UCL.

For rank 5 – 8 teams, I would use yellow to mark them, which means they have a
chance to join UEL.

For 9 – 12 teams, I would use red color to mark them. For 13 – 16 teams, I would use
blue color to mark them.

For the last 17 – 20 teams, I would use grey color to mark them, which means they
need to drop down to tier 2 English football league.
19/20 18/19
 Team Prediction Real Team Prediction Real
 Rank Points Rank Points Rank Points Rank Points
Liverpool 1 97.7 1 99 Liverpool 1 85.9 2 97
Man City 2 76.1 2 81 Man City 2 82.7 1 98
Lei City 3 74.6 5 62 Chelsea 3 74.8 3 72
Chelsea 4 66.9 4 66 Hotspur 4 70.5 4 71

Man United 5 58.6 3 66 Arsenal 5 67.8 5 70
Wolves 6 55.4 7 59 Man United 6 58.2 6 66
Hotspur 7 55.4 6 59 Everton 7 54.3 8 54
Arsenal 8 51.5 8 54 Wolves 8 53.1 7 57

Sheffield 9 51.3 10 56 Lei City 9 51.6 9 52
Everton 10 46.6 12 49 Watford 10 51.1 11 50
C. Palace 11 46.3 14 43 West Ham 11 51.0 10 52
Southampton 12 45.2 11 52 C. Palace 12 50.0 12 49

West Ham 13 44.0 16 39 Bournemouth 13 44.9 13 45
Brighton 14 43.2 15 41 Southampton 14 42.4 16 39
Burnley 15 42.3 9 54 Brighton 15 40.6 17 36
Newcastle 16 42.1 13 44 Newcastle 16 38.4 14 45
Aston Villa 17 41.4 17 35 Huddersfield 17 33.9 20 16
Bournemouth 18 37.1 18 34 Cardiff 18 33.7 18 34
Norwich 19 35.5 20 21 Fulham 19 32.2 19 26
Watford 20 33.9 19 34 Burnley 20 30.4 15 40

 Table 9 – Result of Model I
7 Model Reforming

This chapter would focus on improve and develop a new regression model by our
original model. The aim of this chapter is to try to construct a more effective and more
precise model. The techniques we use are (1) checking multi-collinearity and (2)
stepwise regression method.

7.1 Multi-collinearity

Multi-collinearity is an occurrence that there exist high inter-correlations between two
or more predictors in the regression model. Multi-collinearity would make the
predicted result to be misleading. The result of model would be unstable and given
some changes.

The method we use is Variance Inflation Factor (VIF). It is a method to check multi-
collinearity for every single predictor. The higher VIF value means higher correlation
between the predictors. The equation we use is as following.

 1
 = , = 1,2, ⋯ 
 1 − 2

where 2 is the coefficient of determination for ℎ predictor.

If > 10, it means that there exists high correlation between those and other
predictors which needs to be fixed.
Parameter Estimates

 1 21.19002 10 28.39601 19 2.01099

 2 4.09251 11 18.70659 20 1.47441

 3 12.1252 12 7.78576 21 1.21164

 4 14.06843 13 2.28117 22 4.13565

 5 6.70586 14 3.89876 23 3.92448

 6 3.04517 15 2.148 24 2.97426

 7 1.64342 16 1.6085 25 3.12884

 8 3.35813 17 1.74578

 9 1.50899 18 19.67835

 Table 10 – VIF

By table 10, after calculating the VIF value of each predictors by SAS program, we
found that the number of won ( 1 ), goal differences ( 3 ), number of shot ( 4 ),
number of pass ( 10 ), number of key pass ( 11 ), and % of possession ( 18 ), those
VIF values are greater than 10. Which means that those predictors have a high
correlation with other variables.

So, we need to determine which pair of predictors should us dealing with. We need to
check their correlation between every single predictor.
Person Correlation Coefficients, = 180
 > under 0 : ℎ = 0
 Won ⋯ Goal Difference Number of Shot ⋯ Number Number of Key ⋯ Possession ⋯
 of Pass Pass %

 Won 1 ⋯ 0.91793 0.70557 ⋯ 0.64143 0.67610 ⋯ 0.63823 ⋯

 ⋮ ⋮ ⋯ ⋮ ⋮ ⋯ ⋮ ⋮ ⋯ ⋮ ⋯

 Goal 0.91793 ⋯ 1 0.73560 ⋯ 0.68836 0.71599 ⋯ 0.68778 ⋯
Difference

 Number 0.70557 ⋯ 0.73560 1 ⋯ 0.72686 0.93659 ⋯ 0.32428 ⋯
 of Shot

 ⋮ ⋮ ⋯ ⋮ ⋮ ⋯ ⋮ ⋮ ⋯ ⋮ ⋯

 Number 0.64143 ⋯ 0.68836 0.72686 ⋯ 1 0.68405 ⋯ 0.94265 ⋯
 of Pass

 Number 0.67610 ⋯ 0.71599 0.93659 ⋯ 0.68405 1 ⋯ 0.76185 ⋯
 of Key

 Pass

 ⋮ ⋮ ⋯ ⋮ ⋮ ⋯ ⋮ ⋮ ⋯ ⋮ ⋯

Possession 0.63823 ⋯ 0.68778 0.32428 ⋯ 0.94265 0.76185 ⋯ 1 ⋯

 %

 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱

 Table 11 – Correlation Coefficient

 After calculating the coefficient by SAS, we found that number of won 1 has high
 correlation with goal difference ( 3 ). As the number of won and goal difference are
 two different indicators that very essential to a football league, so we would not deal
 with them.

 For the number of shot 4 and the number of key pass 11 , as key pass is denoted
 as assist or pass-cum-shot, means when key passing number increases, shot number
 increases, So, we would only consider number of shot ( 4 ). For number of
 pass 10 and % of possession ( 18 ), when a team has a higher possession rate, which
 means they get the ball for a longer time. In a word, they would have a higher passing
 number. So, we would only consider one of them. We have chosen number of
 pass 10 for consideration.
Parameter Estimates

 1 21.18311 10 923417 21 1.20611

 2 4.06739 12 7.50583 22 4.12053

 3 12.07777 13 2.21051 23 3.85683

 4 4.98750 14 3.80409 24 2.97043

 5 5.84718 15 2.09526 25 3.10425

 6 3.02930 16 1.56892

 7 1.64193 17 1.58008

 8 3.02930 19 1.83172

 9 1.39665 20 1.46409

 Table 12 – Updated VIF

After dealing with multi-collinearity, we have considered less two predictors which have
higher correlation to each other.

7.2 Stepwise Regression Method

Stepwise regression method is a combination of backward and forward regression
selection method. It is a useful method for choosing the predictors for fitting the
regression model. In stepwise method, the predictors are considered for addition or
subtraction. The goal for this method is to select the most powerful predictors that
affecting the result.

We are going to choose the result by having the highest coefficient of
determination ( 2 ), since the highest 2 means most of the target variables can be
explained.
Figure 14 – Summary of Stepwise Selection

 Figure 15 – Stepwise Result

By SAS, we found that, when adding up number of won 1 , number of drew 2 ,
goal difference 3 , number of shot 4 , and number of pass 10 , the 2 is the
highest which is 0.8740.
8 Model Evaluation II

This chapter is focusing on the determining whether the new regression model is
precise or not. The method we use is residual analysis and coefficient of determination.

8.1 Residual Analysis II

For normality, we are going to check the normal probability plot.

 Figure 16 – Normal Probability Plot of Model II

We can clearly see that, in the left plot, the residual is normal distributed as the plot
contains a straight line.

 Figure 17 – Normal Distribution Bar Graph of Model II

In figure 17, we can clearly see that the residual is following normal distribution, which
is satisfied our first part of assumption.
For homogeneity, we are going to check the residual plots.

 Figure 18 - Residual Plot of against of Model II

In figure 18, we can clearly see that there are no trend or pattern between the residual
 and our target values .

 Figure 19 - Residual Plot of against 1 , 2 , 3 , 4 10 of Model I

In figure 10, we can clearly see that there are no trend or pattern between the residual
 and our predictors.
8.2 Coefficient of Determination II

Recall that, from figure 14, we found that for the model of 09/10 – 17/18 season,
the 2 = 0.8740, it means that 87.4% of observed variation can be explained by our
regression model.

For the data of 09/10 – 18/19 season, which is used to predict the 19/20 result,
the 2 = 0.8780 as the following figure 12. It means that both regression models are
trustworthy for our thesis.

 Figure 12 - SAS Report (2) of Model II
9 Model Description II

By checking multi-collinearity and stepwise selection, we have constructed a new
regression model with 5 predictors.

Dependent Variable ( )

 Dependent Variable Data Description
 Final Points each team for whole season
 Table 13 – Dependent Variable ( )

Independent Variable ( )

 Independent Variable Data Description
 1 Number of Won
 2 Number of Drew
 3 Goal Difference
 4 Number of Shot
 5 Number of Pass
 Table 14 – Independent Variable ( )

Regression Coefficient 
 Regression Coefficients ( )
 0 52.1222
 1 11.01653
 2 2.11723
 3 2.94299
 4 2.48799
 5 1.45327
 Table 15 – Regression Coefficient of Model II

Least Square Equation
 = 52.1222 + 11.01653 1 + 2.11723 2 + 2.94299 3 + 2.48799 4 + 1.45327 5
10 Result of Model II

 This chapter would focus on the result of prediction from the new regression model.
 There are two results below. One is the prediction of 18/19 season, second one is the
 prediction of 19/20 season. For 18/19 season, it is tested by 180 observations from
 09/10 to 17/18. For 19/20 season, it is tested by 200 observations from 09/10 to 18/19.

 19/20 18/19
 Team Prediction Real Team Prediction Real
 Rank Points Rank Points Rank Points Rank Points
Liverpool 1 93.8 1 99 Liverpool 1 84.0 2 97
Man City 2 80.9 2 81 Man City 2 81.0 1 98
Lei City 3 72.5 5 62 Chelsea 3 74.1 3 72
Chelsea 4 65.7 4 66 Hotspur 4 70.8 4 71

Man United 5 58.4 3 66 Arsenal 5 66.7 5 70
Wolves 6 56.6 7 59 Man United 6 59.7 6 66
Hotspur 7 56.5 6 59 Everton 7 53.9 8 54
Sheffield 8 52.8 10 52 Lei City 8 52.5 9 52

Arsenal 9 49.3 8 56 Wolves 9 52.4 8 57
C. Palace 10 47.4 14 43 West Ham 10 50.4 10 52
Everton 11 46.3 12 49 Watford 11 50.3 11 50
Brighton 12 45.3 15 41 Bournemouth 12 48.3 13 45

Newcastle 13 44.0 13 44 C. Palace 13 44.1 12 49
Burnley 14 43.8 9 54 Southampton 14 42.1 16 39
Southampton 15 42.7 11 52 Brighton 15 41.8 17 36
Bournemouth 16 42.0 18 34 Newcastle 16 38.9 14 45

Aston Villa 17 41.7 17 35 Cardiff 17 34.6 18 34
West Ham 18 40.5 16 39 Fulham 18 34.5 19 26
Norwich 19 33.3 20 21 Huddersfield 19 32.8 20 16
Watford 20 31.6 19 34 Burnley 20 29.2 15 40

 Table 16 – Result of Model II
11 Comparison of Model I and Model II

In this chapter, it would focus on the comparison of model I and model II, which are
before development and after development. We can determine which of the model is
more accuracy.

We would consider in 6 indicators: (1) number of correct predicted team in top 4 and
last 4, (2) number of correct predicted team in 5 – 8, (3) number of correct team in 9 –
16, (4) absolute point difference, (5) ranking difference, (6) number of correct position.

For (1), it is important that, for top 4 football team, it explained that they are qualified
to join UEFA Champions League. For last 4 football team, it explained that they need
to drop off to EFL Championship. The higher number means the higher accuracy.

For (2), it is important that, for rank 5 – 8 team, it explained that they have a chance to
join UEFA Europa League. The higher number means the higher accuracy.

For (3), for team 9 – 16, it explained the remaining team rank. The higher number
means the higher accuracy.

For (4), as our target variable is final point, so it is necessary to determine the accurate
of model. The lower number means the higher accuracy.

For (5), we determine the rank difference between our predicted ranking and real
ranking. The lower number means the higher accuracy.

For (6), we determine the number of correct position of our predicting position. The
higher number means the higher accuracy.
For Model I,
 Model I
 (1) (2) (3) (4) (5) (6)
 18/19 7 4 7 99 20 9

 19/20 7 3 6 99 28 5

 Total 14 7 13 198 48 14

 Table 17 – Efficacy of Model I

For Model II,
 Model II
 (1) (2) (3) (4) (5) (6)
 18/19 7 3 4 109 21 6

 19/20 6 2 2 98 32 4

 Total 13 5 6 207 53 10

 Table 18 – Efficacy of Model II

After comparison, we found that model I is more accuracy than model II while model I
have a greater performance of prediction power.

However, we can see that for (1), model II can predict correctly for 13 football teams
while model I has 14. But, model II only consider 5 independent variables while model
I considers 25. It shows us model II is effective also.

All in all, in this thesis, we can conclude that model I is more accurate and model II is
more effective.
12 Prediction of 20/21 Final Ranking Table

Here is my prediction of 20/21 final ranking table by applying model I.

 Team Prediction
 Rank Points
 Man City 1 68.6
 Liverpool 2 67.3
 Man United 3 66.5
 Lei City 4 63.2
 Hotspur 5 62.9

 Everton 6 61.5
 Chelsea 7 61.2
 Aston Villa 8 60.2

 West Ham 9 57.0
 Southampton 10 56.2
 Leeds 11 52.8
 Wolves 12 52.5

 Newcastle 13 48.1
 Arsenal 14 46.5
 Crystal Palace 15 46.3
 Fulham 16 40.8

 Burnley 17 39.8
 Brighton 18 34.8
 West Brom 19 32.1
 Sheffield United 20 29.3

 Table 18 – 20/21 Prediction Table
13 Discussion of Limitation

This chapter would mainly focus on the limitation of our regression model for
predicting the final table.

Unexpected Incidents in Second Half Season

First, as we are going to use only the data of the first half season, which is the first 19
matches for each team, the unexpected incidents happening in the second half season
cannot be considered.

For instance, every single team can change the manager or coach in order to get a better
result. The tactics and strategy can be changed due to the different coaches. Also, each
club can sign new players or some players leave in the winter transfer window, which is
held from the middle of season. It may lead to a different score between the first half
and second half.

To take Manchester United F.C. (Man United) in 19/20 as an example, in the first half
of season, they were in rank 8 in real. For my prediction, Man United would get rank 5.
However, Man United signed a new attacking midfielder, Bruno Fernandes, in winter
transfer window. He made a huge influence on the club. He had 10 scores and 6
assists, which had a very virtuous contribution for club. He had got two Player of the
Month awards and one Goal of the Month award in 19/20 season. His contribution
helped Man United to gain a high position in final, which was rank 3.

Apart from the changing of teams, there may be some incidents that happening in the
second half, may affect the whole football league. To take the 19/20 season as an
example, on March 20, Premier League needed to be suspended for 3 months because
of COVID-19. It leads to a huge different performance of each team between before
suspension and after resumption that we cannot count in. The health of players has a
huge influence on their performance. Also, during the suspension, every single player
only could train himself, but not with the team. It might lead to a bad result for them
and for our prediction.
To take Leicester City F.C. (Lei City) in 19/20 as an example, before suspension, Lei
City had played 29 games and was in rank 3, which was close to our prediction. The
percentage of won was 55%. However, after resumption, Lei City had played 9 games
which only had 2 wins. The winning percentage was 22%. It made a huge bad influence
on the performance after resumption.

It was an unexpected and unconsidered incident that affected our accuracy, that we may
overestimate or underestimate the performance of each team.

Unconsidered Tactics of Every Team

For our regression model, number of shots ( 4 ), number of shot on targets ( 5 ),
number of pass ( 10 ), number of key pass ( 11 ), and % of Possession ( 18 ) are
essential predictors. It can easily explain as the chance for scoring. While a team wants
to score, they need to organize their attacking plan, and try to put the ball into the net.
If a team has higher number of shots on targets, the team has a higher probability to
win.

However, there are some special cases that a team having lower % of possession, and
lower shot numbers, can win the game and get 3 points. It is due to the tactics and style
of different coaches. For example, José Mourinho, a manager and head coach of PL
club Hotspur. However, his tactic is kindly different from other coaches. His tactic can
be explained as counter-attack, which abandons possession %, shot number, etc. His
tactic is focusing on fewer chances and trying to get the points.

It means that different tactics may lead to a different performance of data. It may affect
the accuracy of our model.

Unconsidered Factors

Every club would have a different length of time break between two games, due to the
international league. The length of break affects their performance in PL. However, as
our model is focusing on the final table of PL, we have not considered the break length
of each team. It may affect the accuracy of our model.
Insufficient of Prediction of Midstream Clubs

Since the performances of clubs in midstream are quite similar, i.e. in rank 9 – 16, for
example, the number of shots, the times of won, the goal differences, etc. The
difference between their points were small, which means their ranking could be
changed easily after few games. It is hard for our model to predict precisely.

 Team 19/20 First Half
 Rank Points
 ⋮ ⋮ ⋮

 C. Palace 9 26
 Newcastle 10 25
 Arsenal 11 24
 Burnley 12 24
 Everton 13 22
 Southampton 14 21
 ⋮ ⋮ ⋮

 Table 19 – 19/20 Point Table of First Half of Season

For example, in table 19, we can see that the points of midstream clubs are quite
similar. If Burnley (rank 12) got one won later, which means they could get 3 points
more, their rank would be increased into rank 8 or even higher.

It is hard for our model to predict precisely the ranking of midstream clubs as their
performances are similar.
14 Conclusion

In this thesis, we have set up a linear regression model in order to predict the final table
of Premier League.

First, we have collected the useful data of each team from the 09/10 season to 19/20
season. We have set up a regression model I by those data and evaluate the model.

Then, in order to improve the accuracy and efficiency of our model, we have checked
the multi-collinearity and have done the stepwise regression method. We have set up a
new model II that only contains 5 predictors.

After that, we have compared the accuracy between model I and model II. We found
that model I is more precise while model II is more effective.

Finally, we used model I to predict the result of the 20/21 final ranking table.
15 Reference

Joel, O., “Differentiating the Top English Premier League Football Clubs from
 the Rest of the Pack: Identifying the Keys to Success” (2009), Journal of Quantitative
 Analysis in Sports: Vol. 5, No. 3, Article 10.

Christos, T. and Victor, C., “Sports Analytics for Football League Table and Player
 Performance Prediction” (2020), School of Science and Technology, International
 Hellenic University.

Carlos, P. B. and Stephanie, L., “Performance evaluation of the English Premier
 Football League with data envelopment analysis” (2006), Applied Economics,
 Vol. 38.

Wray, V., “Creating the English Premier Football League: A Brief Economic
 History with Some Possible Lessons for Asian Soccer” (2017), International Journal
 of the History of Sport, Vol. 34, pp. 17-18.

Mike, W., “Determining the Best Strategy for Changing the Configuration of a
 Football Team” (2003), Journal of the Operational Research Society, Vol. 54.
You can also read