Regression Analysis on NBA Players Background and Performance using Gaussian Processes

Page created by Rita Brown
 
CONTINUE READING
Regression Analysis on NBA Players Background
  and Performance using Gaussian Processes

Can NBA-drafts be improved by taking socioeconomic background into consideration?

                          LUDVIG PERSSON LEJON
                               ludper@kth.se

                            FREDRIK BERNTSSON
                               fbernts@kth.se

  Degree Project in Engineering Physics at KTH School of Computer Sciences and
                                 Communications
                                   Supervisors:
                              Petter Ögren (General)
                        Carl-Henrik Ek (Machine Learning)
                             Examiner: Mårten Olsson

                               TRITA xxx yyyy-nn
Abstract
In the modern society it is well known that an individual’s
background matters in her career, but should it be taken
into consideration in a recruiting process in general and
a recruiting process of NBA-players in particular? Pre-
vious research shows that white basketball players from
high-income families have a 75% higher chance of becom-
ing an NBA player compared to a white basketball player
from a low-income family. In this paper, we have examined
whether there is a connection between NBA-player back-
ground and the chances of succeeding in the NBA given
that the player has been picked in the NBA-draft. The
results have been carried out using machine learning al-
gorithms based on Gaussian Processes. The results show
that draft decisions will not be improved by taking socioe-
conomic background into consideration.
Referat

 Regressionsanalys med Gaussiska processer
      av NBA spelares framgång och
         socioekonomiska bakgrund

I dagens samhälle finns en medvetenhet om att bakgrund
spelar roll för individens karriär, men bör den tas i be-
aktning i rekryteringsprocesser i allmänhet och rekrytering
av NBA-spelare i synnerhet? Forskning har tidigare visat
att amerikanska vita basketspelare från en höginkomstbak-
grund har 75% större chans att nå NBA jämfört med vita
basketspelare från en amerikanska låginkomstbakgrund. Vi
har i denna rapport undersökt huruvida det finns en kopp-
ling mellan uppväxtmiljön och möjligheten att lyckas som
NBA-spelare givet att spelaren blivit vald i NBA-draften.
Resultaten har tagits fram med hjälp av maskininlärnings-
algoritmer som härstammar från Gaussiska processer. Des-
sa resultat visar att valet av spelare i draften inte förbättras
genom att ta hänsyn till socioekonomisk bakgrund.
Contents

1 Introduction                                                                                                       1
  1.1 Background . . . . . . . . .     . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1
  1.2 Data Analysis . . . . . . . .    . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
  1.3 NBA-Draft . . . . . . . . .      . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
  1.4 Previous Research on Player      Background        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
  1.5 Research Question . . . . .      . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
  1.6 Delimitations . . . . . . . .    . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4

2 Parameters                                                                                                          5
  2.1 Socioeconomic Background Parameters . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .    5
  2.2 Successful Draft Pick Parameter . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .    7
      2.2.1 How to Measure General Success . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .    7
      2.2.2 Introducing BPL-index and BPL-level                  .   .   .   .   .   .   .   .   .   .   .   .   .    7

3 Method                                                                                                              9
  3.1 Data Scraping Method . . . . . . . . . . . . . . . . . . .                         .   .   .   .   .   .   .    9
  3.2 Regression Method Requirements . . . . . . . . . . . . .                           .   .   .   .   .   .   .    9
  3.3 Possible alternative Approach: Neural Networks . . . .                             .   .   .   .   .   .   .   10
  3.4 Non Technical Description of Gaussian Processes . . . .                            .   .   .   .   .   .   .   11
  3.5 Technical Description of Gaussian Processes . . . . . . .                          .   .   .   .   .   .   .   13
      3.5.1 Important Definitions . . . . . . . . . . . . . . .                          .   .   .   .   .   .   .   13
      3.5.2 What This Means . . . . . . . . . . . . . . . . .                            .   .   .   .   .   .   .   14
      3.5.3 Worth Mentioning About the Multivariable Case                                .   .   .   .   .   .   .   14
  3.6 How to use Gaussian Processes as Regression . . . . . .                            .   .   .   .   .   .   .   15

4 Results                                                                                                            17
  4.1 Games Playes versus Single Hometown Parameter . . . . . . . . . .                                              17

5 Analysis                                                                                                           19
  5.1 Single Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . .                                      19
  5.2 Multivariable Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .                                     19

6 Conclusions                                                                                                        23
6.1   Recommendations . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
  6.2   General Discussion . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
        6.2.1 Sociology . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
        6.2.2 Data . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
        6.2.3 Method . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
        6.2.4 Success measure      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25

Bibliography                                                                                                                           27

A Graphs                                                                                                                               31

B Data variables                                                                                                                       39
1. Introduction

In this introductory chapter, a detailed background on how the research question is
chosen to ”Can the NBA-draft picks, by looking at players’ performance and socioe-
conomic background data, be improved using Gaussian Processes as regression?” is
provided.

1.1    Background

In businesses all around the world, recruiting competent people is a key factor for
success [27]. When modern companies recruit people, they do not only take hard
facts (e.g. number of sold products previous year, number of employees or grades)
into consideration while hiring. They also take personality traits of their prospective
employees into consideration before hiring[11] and the standard way of doing this
is by interviews and references.

From the field of sociology, the personality traits taken into consideration in job
interviews could partly be backtracked to socioeconomic background. In "Le sens
pratique" [6] French sociologist and antropologist Pierre Bourdieu thouroughly de-
scribes the habitus term. Sociologist Donald Broady provides a short version where
habitus is explained as a result of social experiences, collective memories, a way
of move and think that imprints peoples minds and bodies [8]. A tangible obser-
vation directly related to the habitus is that students’ different career paths and
levels of success could be backtracked to their background [7]. Furthermore, studies
do connect specific city traits with achievements in different areas; one example
is that wages are systematically higher in cities with rich linguistic diversity [21].
With this in mind we ask if it is possible to determine any specific examples that
connects chances of succeeding with hometown traits using machine learning algo-
rithms. This is however too broad for this paper, which is why we will narrow it
down to something more feasible size-wise.

One way of limiting the scope of our research is to focus on a niche. In search of
that niche, we look for one niche with:

                                          1
CHAPTER 1. INTRODUCTION

i) The possibility to compare individuals’ level of success. This is important since
     success is a term that could not be measured easily in general.

ii) Easy access to the information that let us compare success. This is important for
     mainly two reasons; firstly, with an easy accessed dataset the confirmability
     increases and with that the trustworthiness of the research follows. Secondly,
     it is time saving to use easy accessed data.

On i), to measure success in sports has been done previously [23]. Measuring success
in other fields than sports is often done by comparison of salaries, for example
the research made by Thomas W. H. Ng, Lillian T. Eby, Kelly L. Sorensen and
Daniel C. Feldman on carreer success [28]. Fulfilling wish list item ii) with the
given timeframe of this research, finding individuals’ salaries and connect them
with their hometowns would lead to a tremendous amount of work which would not
be especially confirmable since we do not want to put these individuals’ personal
data up for disposal. However, sports statistics are easily accessed and fit right on
the second criteria.

We will examine how hometown traits affects the possibility to succeed in a certain
sport since there is a connection between city traits and success, via the common
denominator known as personality.

There are over 3,000 sports in the World Sports Encyclopedia [17] so which one to
choose? There are certain criteria we look for; we would like a sport that has:

i) Valid statistics. Not every sport is a good choice statistic wise. An extreme
     example is fishing where statistics doesn’t tell us as much as in other sports
     since local conditions heavily influence the outcome.

ii) Teams and not individuals. This is since activities in many organizations are
      entirely designed around groups [19] which could make it possible to connect
      our results from the world of sports to the normal work life.

iii) Players which are able to track to their home towns, where hometown data is
      accessible.

iv) A considerable number of practitioners. Since a larger sport minimizes the bias
     of selection of practitioners.

v) One highest league. This is important for comparison reasons, some of the
    European football leagues are all considered to be really good, but it is hard
    to tell which is the best scorer between a player who scored 12 goals in the
    Spanish Primera Division in 1996 and a player who scored 16 goales the same
    year in the German Bundesliga.

                                         2
1.2. DATA ANALYSIS

This wish list gives us a couple of alternatives but we chose basketball since it has i)
statistics from all players back to 1947 [4], ii) an overwhelming majority of players
are from the US [4] which makes it iii) easier to backtrack them than if they would
be of different nationalities. iv) The NBA-league had over 17.000.000 visitors the
season 2011-2012 and is v) considered substantially better than other leagues, an
indication of this is that the average NBA franchise is now worth $634 million [3]
and arena capacities taking over 20,000 visitors [30] compared to the 5,000 in the
Italian league [29].

1.2     Data Analysis

When NBA talent scouts seek to find new talents, they look for players which fulfill
a combination of subjective and objective criterions on what a great basketball
player should have in order to become a top player. The gut feeling on which the
subjective feelings are based upon is described and discussed by Nobel price laureate
Daniel Kahneman [15] where he claims that we are not able to take all facts into
consideration while making decisions which often leads to biased decisions.

In recent years pattern recognition have developed and is used in everything from
the US Postal Service where they automatically read hand written letters [25] to
determining which Bordeaux wine will become the most expensive twenty years
from the production date [1].

These often biased decisions described by Kahneman could be diminished by analysing
patterns, and since computer analysis tools are strong at identifying patterns we
will use modern computer analysis tool based on Gaussian Processes to investigate
the connections between home town traits and the success level of the player. This
should lead to decisions less influenced by the, some times, misleading gut feeling.

1.3     NBA-Draft

The NBA-draft is an event that occurs every year. The purpose of this event is to
enroll non-NBA-players into the NBA. Basketball players who have not previously
played in the NBA are eligible for the NBA-daft. Typically this means that foreign
players and players in the college leagues apply for the NBA-draft where a total of
about 50 players are recruited to an NBA-team each year.

The NBA-teams pick players in an order made up in a lottery. The probability
distribution of the pick order is heavily dependent on the score the previous season,
a NBA-team with low score is likely to pick among the first teams.

                                           3
CHAPTER 1. INTRODUCTION

The players picked early in the draft are expected to perform better than players
picked late in the draft. This will be discussed further and showed in section 2.2.2.

1.4    Previous Research on Player Background

A number of studies have been conducted in the specific area of sports and back-
ground. One example is the studies of the likeliness of reaching the NBA which tells
us that players from low income African American families have 37% lower chance
of reaching the NBA than players from high income African American families. The
difference in chance between white low- and high-income families is 75% in favor of
the high-income families.[10]

However after interviewing sports journalist Mattias Lühr (Expressen), we found
out that North American teams are looking almost only at on-field performance
and physical performance (e.g. points made, +/- statistics or 40 yard time) rather
than personality traits before they draft players. Despite this previous research we
do not know if a player from a low income family is more or less likely to succeed in
the NBA than a player from a high income family given that they both have been
drafted.

1.5    Research Question

Given all this background, the question we ask is ”Can the NBA-draft picks, by
looking at players’ performance and socioeconomic background data, be improved
using Gaussian Processes as regression?".

1.6    Delimitations

The scope of this research will only contain players born in the US and players who
have played minimum one game in the NBA.

                                         4
2. Parameters

In order to measure draft pick success we will in this chapter introduce the Berntsson
Persson Lejon (BPL) level and BPL index. An in-depth explanation of why the
chosen variables for socioeconomic are income, education, housing and criminal
activity is also provided.

The players examined are all drafted players in the years 1990-1999 who have played
one or more games in the NBA and are born in the US.

2.1    Socioeconomic Background Parameters

Determining a player’s chance of succeeding is not only a question of physiological
profiling, it can contribute to determine likeliness of success, but determinants of
success are multi-factorial [14] so we will take socioeconomic factors into consider-
ation.

To be able to focus on the socioeconomic background we have to know what socioe-
conomic background actually means. The definition of the word “socioeconomic”
from the Oxford Dictionaries is “Relating to or concerned with the interaction of
social and economic factors” [26]. But what is a viable way of measuring socioe-
conomic background from a statistical point of view? Previous research that have
measured socioeconomic status have often viewed income, education and occupation
as factors to define socioeconomic status [32] but there are also well cited articles
that uses variables such as social support [18] in their definition of socioeconomic
background.

The data we have collected concerning the NBA players in our study’s background
contains the variables in appendix B. However as we increase the number of data
points, we have to increase the number of players. Increasing the number of Amer-
ican NBA players drafted in the years between 1990 and 1999 is, unfortunately for
our research, impossible to do. So we have to choose certain points from our data

                                          5
CHAPTER 2. PARAMETERS

set to test against the player’s likelihood of success.

G.A. Kaplan and J.E. Keil claims that the most frequent way of measuring socioeco-
nomic status is to look only at educational level, this is because of the easy accessed
education data. On the other hand they also stress that a combination of ways of
measuring socioeconomic data have merit [16]. Another simple and powerful way
to distinguish different socioeconomic statuses is the housing tenure [13].

Research has also shown that the level of criminal activity is correlated to class
status [2].

If we now summarize what other socioeconomic research base their arguments on,
we say that the way of measure socioeconomic background is:

i) Income

ii) Education

iii) Occupation

iv) Housing tenure

v) Criminal activity

vi) Population

And the quantitative measurements of these background traits will be based on the
following parameters from the player’s hometown

i) Population

ii) Income per Capita

iii) Household Income

iv) Home Appreciation

v) Homes Owned

vi) Housing Vacant

vii) Homes Rented

viii) Violent Crime

ix) 2 Year College

x) 4 Year College

                                           6
2.2. SUCCESSFUL DRAFT PICK PARAMETER

xi) Graduate Degree

xii) Highschool Graduate

2.2     Successful Draft Pick Parameter

2.2.1   How to Measure General Success

Games played has previously been used to measure success in team sports. Other
measurement method favors a certain player type. While number of games played
favors the player which the coach believes is good enough to play. We follow the
example of Tingling, Masri and Martell’s research on analyzing the NHL-draft [23]
where they speak in favor of measuring number of games. Three of these arguments
are also valid for basketball

i) It is verifiable and easy to measure

ii) It is easy to compare players across positions and teams

iii) Players who do not contribute to the team are unlikely to play

Which is why games played is a part of the way we measure the success level of
the draft pick. The vast majority of players that constitutes our data have retired
hence draft-year will not affect the success measure.

An in-depth discussion on success is made in chapter 6.2.4.

2.2.2   Introducing BPL-index and BPL-level

The BPL-index is the expected number of games played for a player drafted at a
specific draft position. The BPL-index is calculated from a set of drafted players
where two numbers represent every player, these numbers are

i) Draft pick number.

ii) Number of games played in career.

From this set of data, an exponential regression is performed with x as draft pick
number and y as number of games played. The exponential function ybpl (x) = eax is
now obtained. This function is the expected number of games played for a specific
draft number and denoted the BPL-index.

                                          7
CHAPTER 2. PARAMETERS

The BPL-level is a measurement of the draft pick success. The BPL-level is obtained
by using the formula 2.1 where nd is the number of games played by the drafted
player, ybpl (p) is the exponential function defining the BPL-index and p is the draft
pick number of the measured player .

                                                            nd − ybpl (p)
                                         BPL-level =                                                  (2.1)
                                                               ybpl (p)

The BPL-level is a constant, and a positive BPL-level represents a player who have
played more games than the expected number of games and a negative BPL-level
represents a draft pick who have played less games than the expected number of
games. The BPL-level is the chosen method of measuring draft pick success.

Figure 2.1 shows a regression where the BPL-index is shown in blue.

                                      BPL−index for NBA−players drafted 1990−1999
                      1400
                                                                                     BPL−index
                                                                                     Player
                      1200

                      1000
       Games Played

                      800

                      600

                      400

                      200

                        0
                             0   10          20           30          40            50           60
                                                     Pick Number

Figure 2.1: BPL index in blue calculated with the the drafted NBA players the
years 1990-1999

                                                        8
3. Method

The choice of method that will address the research question (see 1.5) will be dis-
cussed in this chapter, including a brief motivation and some limitations of the
method. The method used to collect data is data scraping and the method used to
analyse the collected data is regression based on Gaussian Processes (GP).

3.1    Data Scraping Method

Given all the information available online and the computing power that can be
bought at affordable prices, a datum is to try to find the relevant information
through data scraping. Data scraping is the practice of examining large pre-existing
databases in order to generate new information [26].

Collecting selected variabels from online content is made through data scraping.
There are many different types of programs that goes under the wide definition
on data scrapers which is why we will not describe data scraper in general but
describing our particular data scraper on a conceptual level. In the popular general-
scripting language PHP, there are some built in functions that makes it very easy
to program data scrapers [24]. The data scraper is first set to visit a certain page
and scan through all the HTML on that page, the HTML is parsed as an XML-
document. from this XML-document we can identify the interesting variables and
save them into our dataset and we can also follow links and redo the same procedure
on that following page. Having all the data collected we present it in a MATLAB-
compatible way so we can analyse the collected data in MATLAB.

3.2    Regression Method Requirements

In order to answer the research question 1.5 we will predict the outcome of a player’s
rate of success given some traits of his hometown. In order to accomplish this we
want to use a method that:

                                          9
CHAPTER 3. METHOD

i) Gives a quantitative result

ii) Is robust

iii) Has a straight forward work flow

The criteria i) and ii) are close to self-explaining, and we will not discuss them
further. The third criteria however will rule out some algorithms as we will see
when we discuss Neural Networks in section 3.3. The method Gaussian Processes
meets these criteria and is therefore the method of choice, this field will be discussed
further in section 3.4 to 3.6.

3.3     Possible alternative Approach: Neural Networks

In this section will we give a brief introduction to the field of Neural Networks,
discuss its properties and explain why we chose GP insted of Neural Networks.

In order to describe Neural Networks (NN), we begin with describing the smallest
part of the NN, namely the neuron. The neuron can be described by the function
                                         n
                                         X
                                  yj =         wi x i + w0                         (3.1)
                                         i=1

wi is the weight for the input xi and w0 is a biased term. If we put together several
Neurons we get what is called a Neural Network. This may look like in figure 3.1.
By convention is the biased term w0 ommited in the illustration. The left most
column of circles in 3.1 is the input data, denoted by xi . In the middle column do
we have a hidden layer of neurons,the values generated by a neuron in the hidden
layer is denoted by yj , observe that the biased term in equation 3.1 is not explicity
expressed. The last column is the output data, denoted by zk , the biased term is
not explicity expressed for those columns either. If we now apply the equation of a
single neuron onto a NN, we get the equation
                          0
                    zk = wkj yj + c0k → zk = wkj
                                              0
                                                 (wij xi + ci ) + c0k              (3.2)
                                                                         0 as well as the
For a set of training data [xi , ok ], we can train the weights wij and wkj
                         0
biased terms ci and ck so that the output zk of the NN correpsonds to the targets
ok . If this training is done correctly we will be able to predict the outcome yk∗ from
some in-data x∗i .

Some very interesting applications have been made using Neural Networks, such
as, identifying handwritten characters [25]. But since no rigorous theory has been
developed concerning how to design a Neural Network, the construction of a Neural
Network is closely correlated with a trial and error methodology as well as black box

                                               10
3.4. NON TECHNICAL DESCRIPTION OF GAUSSIAN PROCESSES

               Figure 3.1: Schematic illustration of a neural network

programming [31]. Hence is the iii) criteria, to have a straight forward workflow,
not fulfilled for the Neural Network.

3.4    Non Technical Description of Gaussian Processes

Machine learning is a type of artificial intelligence where the computer program is
able to predict a certain outcome based on examples given to the program. Gaussian
Processes is a machine learning method where the prediction is based on normal
distributions as illustrated in figure 3.3. In Gaussian Processes we say that the
likeliness of a certain outcome is normal distributed, and based on the training set
(i.e. the example given to the program) we are able to approximate the standard
deviation and from there describe a function that best fits the training set.

A simple explanation is that we start with an arbitrary chosen value in x0 and from
there say that the next value x1 is randomly set from a normal distribution with
based on the previous point x0 , and so it continues for x2 , ..., xn . The more values
generated, the less impact does x0 have on the outcome. The functions of different
colors in fig 3.2 are examples of random functions generated by this method.

                                          11
CHAPTER 3. METHOD

                   Figure 3.2: Some random functions

Figure 3.3: An intuitive illustration of the concepts of gaussian processes

                                    12
3.5. TECHNICAL DESCRIPTION OF GAUSSIAN PROCESSES

3.5     Technical Description of Gaussian Processes

3.5.1   Important Definitions

In this section we introduce some important definitions and some basic expla-
nations that are necessary for further discussions of the subject Gaussian Pro-
cesses. Assume that we are about to perform a measurement of some data points
[x1 , y1 ], [x2 , y2 ], ...[xn , yn ]. We believe that yi can be described by

                                    yi = [wj φ(xj )]i                          (3.3)

Where φ(x) is fixed, but typically non-linear, and is named basis-function. w has a
prior normal distribution
                                 wi ∈ N (µi , α−1 I)                          (3.4)

There is no restriction to assume that the mean of wi is zero since we can add a
biased contribution to φi (xj ). Now we are able to calculate the expected value and
the covariance of yi

                       E[yi ] = E[wi φ(xi )] = φ(xj )E[wj ] = µi               (3.5)

             Cov[yi ] = E[yi yj ] = φ(xi )E[wi wj ]φ(xj ) = α−1 φ(xj )φ(xi )   (3.6)

We define
                              k(xi , xj ) = α−1 φ(xj )φ(xi )                   (3.7)

which we name the kernel function. Now we have everything we need in order to
define a specific Gaussian process.

                                yi ∈ GP (µi , k(xi , xj ))                     (3.8)

The kernel we will use is a variant of the squared exponential, which is a common
used kernel
                                               − 1 |x −x |2
                             k(xi , xj ) = θ2 e θ1 i j                        (3.9)
Where θ1 and θ2 are hyperparameters. Some functions generated using this kernel
can bee seen in figure 3.2.

In any practical use though, will the kernel k(xi , xj ) only be evaluated at a fi-
nite number of points, and the kernel may be represented as a matrix, which we
will denote as
                                 Cij = k(xi , xj )                          (3.10)

                                           13
CHAPTER 3. METHOD

3.5.2    What This Means

In this section we will discuss what Gaussian processes do in practice, with focus
on an intuitive understanding.

Assume that we are about to measure some data points ([x1 , y1 ], . . . , [xn , yn ]). We
want to find some function yi = f (xj ), but we do not have any idea how this
relation may look like. A naïve approach to find the best function f (xj ) is to try
every available function, it turns out that this approach is not as crazy at it may
seem, because here we can make use of the tools provided by the Gaussian processes.

We can make a prior guess of the first value y1 , since we prior do not know anything
about y1 we might as well guess that it is zero, see point x0 in figure 3.3. This itself
is no limitation since we can choose our zero level arbitrary. What was proposed
in the previous paragraph is not an implementation without conditions, but we say
that similar points (i.e. k ≈ l) xk and xl will generate similar values yk and yl . In
our case this means that the point x1 in figure 3.3 is more like x0 than x2 and x3 .
We can observe this decrease in dependency by observing the increasing variance of
the shaded areas in figure 3.3.

3.5.3    Worth Mentioning About the Multivariable Case

The theory of Gaussian Processes is not limited to be used with one dimensional
features, but allows in theory any number of dimensions of features. In section 3.5.1,
the theory described is based on a single variable case but very small changes are
required in order to describe the multivariable case. The complexity is O(x3 ) since
we calculate the inverse of a matrix and hence will the computional load set the
limit for how many data points and features we can add. Another issue that arises
when the dimensionality increases is that the Eucledian norms may become unsat-
isfactory for high dimension vector spaces.

One kernel that is commonly used when fitting multivariate data, and incorporates
the use of Automatic Relevance Determination (ARD) is the kernel

                                               D
                                                                     !
                                            1X
                     k(xi , xj ) = θ0 exp −       θn (xin − xjn )2                (3.11)
                                            2 n=1

Where D denotes the number of dimensions of the input data and the hyperparam-
eters θn are the weights determining the relevance of a certain dimension. In the
case of a high dependency on the outcome from a certain feature, that feature’s θ-
value will be large. The hyperparameters are determined by a maximum likelihood
algorithm.

                                           14
3.6. HOW TO USE GAUSSIAN PROCESSES AS REGRESSION

3.6     How to use Gaussian Processes as Regression

In our case and in most other cases where Gaussian Processes are appliable, are we
not interested in generating a random function from scratch, but instead to do a
regression of some measured data. We want to find the most likely distribution of
weights wj given some data ti . We assume that ti is

                                      ti = yi + i                               (3.12)

Where i is a noise term which we assume to be normal distributed i ∈ N (0, β −1 )
We may express the conditional probability as p(wj |ti ). This can be rewritten, using
Baye’s rule to
                                            p(ti |wj )p(wj |θk )
                          p(wj |ti , θk ) =                                     (3.13)
                                                  p(ti |θk )
Where θk are some hyperparameters such as characterstic length or internal weights
between the different dimensions of input data. We see that the denominator is
independent of wj , it is hence possible to ommit the denominator and instead express
the probability as
                             p(wj |ti , θk ) ∝ p(ti |wj )p(wj |θk )            (3.14)
and reintroduce the normalization constant when convenient, typically when all
calculations are made. The first probability factor is normal distributed since i
is normal distributed. The second probabilty factor is normal distributed as well,
as stated in equation 3.4. Hence is the probability distribution in equation 3.14 is
normal distributed too.

Now, by using an algorithm to find the set of wj that is most likely, we can calculate
                   (posterior)                           (posterior)
its mean function µi           and covariance function Cij           and get a posterior
prediction at some points xj

For further reading about GP we recommend the books by Rasmussen, Williams
[9] and Bishop [5] and visiting the homepage gaussianprocesses.org.

                                          15
4. Results

4.1         Games Playes versus Single Hometown Parameter

Results from all twelve hometown parameters is presented in appendix A. With
figure 4.1 we explain what the graph shows. Starting with the information on the

                                         Highschool Graduate vs Success
                     4                                                                   90 % Confidence interval
                                                                                         80 % Confidence interval
                    3.5                                                                  70 % Confidence interval
                                                                                         60 % Confidence interval
                     3                                                                   51 % Confidence interval
                                                                                         Player
                                                                                         Mean games played per draft position
                    2.5
                                                                                         US median (vertical)
                                                                                         NBA−players median(vertical)
                     2
       BPL−level

                    1.5

                     1

                    0.5

                     0

                   −0.5

                          −3    −2             −1              0             1       2
                                Quantitative representation of Highschool Graduate

                               Figure 4.1: Success vs High School Graduates

two axes, the x-axis shows the distribution on high school graduates in respective
players home town in a normalized manner, the normalization is made by

High school graduates in hometown (%)-Mean high school graduates in the US (%)
                   Mean high school gratuates in the US (%)

The y axis shows the BPL level with 0 as BPL index (see section 2.1).

The red dot represents a player, this means that if you look at a dot with positive
BPL level, it means that the player has played more games than mean games played
for that player’s draft position.

                                                                           17
CHAPTER 4. RESULTS

The vertical red line represents the mean number of high school graduates among
NBA players hometowns. So a red dot positioned with x-value greater than the
vertical red line-value represents a player from a town with higher percentage of
high school graduates than the arithmetic mean amongst NBA-players.

The vertical black line represents the mean number of high school graduates in the
whole US. So a red dot positioned with x-value greater than the vertical black-
line value represents a player from a town with a higher percentage of high school
graduates than the mean in the US. A black vertical line positioned left of the red
vertical line means that the mean percentage of high school graduates in the US
is lower than the mean percentage of high school graduates in the NBA-players’
hometowns.

The different confidence intervals are marked with different colors. The confidence
intervals represent the probability that a function from the distribution will lie
within the coloured area.

                                        18
5. Analysis

Only by inspection and with very little background knowledge it is possible to anal-
yse the results with the single socioeconomic variables. The multivariable analysis
demands a deeper understanding of GP.

5.1    Single Parameter Analysis

If we study each and every different level of confidence interval we see that the
BPL-index is enclosed by almost every level of confidence intervall plotted in the
graph. This means that there is no significant correlation between the socioeconomic
parameters and the BPL-level.

5.2    Multivariable Analysis

Since we were not able to see any correlation between the rate of success and the fea-
tures independently, we seek a joint distrubution such that the success is a function
of the background variables x1 , ..., x12 described in section 2.1.

                            success = success(x1 , ..., x12 )                    (5.1)

In this multivariate case we will use the kernel 3.11 in order to take advantage of
the ARD benefits.

Since it is difficult for us to prior determine the values of the hyperparameter, we
will vary the hyperparameters and find the combination of hyperparameters that
minimizes the error when calculating the GP.

The error of the GP can be seen in figure 5.1. We can clearly see that the er-
ror is converging and hence that the hyperparameters are optimal. The weights of
the different features can be seen in Table 5.2 The magnitude does not differ by at

                                           19
CHAPTER 5. ANALYSIS

Figure 5.1: The error plotted versus the number of evaluations, when minimizing
hyperparameters

               0.8872     0.8206   1.1392    1.3566   1.1973   1.2648
               1.0076     1.4203   1.3271    1.3759   1.0337   1.2628
                        Figure 5.2: Weights of different features

least one order of magnitude and hence can we not neglect any of them at this stage.

The mean and standard deviation can not easily be displayed in this high dimen-
sional case in the same way as in the single variable case. In order to view the result
will we instead sample from this GP.

We can see the result of the sampling in figure 5.3. The data is the same for
the two left plots and the two right plots, the difference between them is the scale
of the axises. On the x-axis we see the expected outcome of the mean on the
subfigures 1 and 3 and standard deviation (std) on the subfigures 2 and 4 . The
corresponding sampled values are plotted on the y-axis. The black dots correpsond
to a serie of samples of identic data, its x-position shows the expected value of the

                                            20
5.2. MULTIVARIABLE ANALYSIS

Figure 5.3: The mean and std, both expected and sampled, of the multivariable
GP. We will enumerate the subfigures with 1,2,3,4 where 1 is the upper left, 2 is
the upper right, 3 is the lower left and 4 the lower right.

certain in-data and its y-position shows the mean of the samples with this certain
in-data.

The red lines are the function f (x) = 1 · x. In figure 1 and 2 can we see two
points located on those lines. Those points are in fact several points located very
close to each other, as we can see in the figures 3 and 4 where the scale is smaller.

We see that the expected mean value does not depend on the in-data, since the
function-values are indifferent of the in-data. The sampled means and std lie very
close to the expected values of the same reason.

In conclusion we can with great confident say that there is no correlation between
the different hometown traits and the chances of succeeding of beating the BPL-
index.

                                         21
6. Conclusions

As described in the analysis, we can with great confidence state that there is no
correlation between hometown traits and beating the BPL-index. The answer to
our research question in section 1.5 is that we cannot improve draft picks with this
method and these parameters.

This result is not surprising since we know from previous research that the difference
in resources has taken its toll before the NBA draft. Despite the lack of spectacular
results, these types of studies are important in order to understand how individuals
are affected by the society in which they grow up.

6.1    Recommendations

An interesting next step in this interdisciplinary study with one leg in machine
learning and the other in sociology would be to use a similar machine learning
methods on background parameters in other industries than sports. Since data
scraping is a very powerful tool for fast data collection and Gaussian Processes are
well suited for pattern recognition, only an analysis on which variables to collect is
needed in order to apply this method on other industry analysis. It would also be
interesting to use the same regression on the players’ physical traits.

6.2    General Discussion

The General discussion and criticism could be divided into four parts: sociology, the
data, the chosen method and relying so heavily on played games in the measurement
of success.

                                         23
CHAPTER 6. CONCLUSIONS

6.2.1   Sociology

Beginning with the sociology criticism, we want be transparent with our lack of
background and previous knowledge in the field of sociology. We have conducted
a brief interview with Prof. Emer. at the Department of Sociology in Uppsala
University just to make sure that we are not way off our interpretations on the
sociology part. The main criticism consists in the fact that we haven’t been taken
the individual player’s family background but the background in the area in which
they grew up.

6.2.2   Data

We have used the data from which the towns in which the players where born.
Some of the players have most certainly moved to a different town while they where
young, which would have been a more accurate town to define as their hometown.
We have also used current hometown data. Since the players where drafted between
1990 and 1999 it means that they grew up in the years roughly between 1975 and
1995 which in turn means that our analysis is based on data that is up to almost
40 years off.

6.2.3   Method

The kernel we have used is the squared exponential (SE), or one closely related to
SE
                                                                
                         k(xi , xj ) = θ0 · exp θn (xi − xj )2                  (6.1)

This kernel is commonly used and generates a smooth function, as we could see in
figure 3.2, even though these characteristics makes this kernel a good choice, we can
not guarantee that our result hold for every possible kernel.

We also do not know if we preserved characteristics of the data when we did the
pre-processing. The pre-processing is often necessary if the data is too irregular.
For example, if we examine figure A.2,almost all points are located to the very left,
this is because the well populated cities like New York or Los Angeles are few but
large in comparison to most other US cities. Translation and scaling preserves many
characteristics of the data and are hence very convenient ways to pre-process the
data. However this is also the only way we have pre-processed our data.

                                           24
6.2. GENERAL DISCUSSION

6.2.4   Success measure

Having games played as the one measurement of success is something that we in
our research have had quite some inquiries about. In our opinion it is the “least
bad” choice. Questions we had to answer to ourselves where most importantly

i) Why we do not use salaries as a measurement of success?

ii) Why we do not use more stats, such as points made, +/- statistics etc.?

iii) Some players are really great but have injuries for a long time on their careers;
      this will not show in our data.

These are all relevant questions but are more problematic than they first appear to
be.

Our response to i) is that the players salaries are first limited by the “Collective
Bargaining Agreement Between the National Basketball Assiciation (NBA) and the
National Basketball Players Association (NBAPA)” [20]. This agreement includes
salary caps, revenue distribution, player contracts among other things, which leads
to a situation where salaries are not a good measurement of success level.

To answer to ii) we do not use other statistics because it tends to favour a certain
type of players and could be misleading. One example is that Steve Nash was elected
to most valuable player (MVP) the season 05-06 when he scored an average of 18.8
points per game (ppg)[4] and despite the fact that Kobe Bryant scored an average
of 35.0 ppg[4]. Both Nash and Bryant are held to be great players and where both
top choices for the MVP award 2006 [22] even though Bryant averaged 86% higher
ppg than Nash. What they also had in common was that they played lots of games,
Nash played 78 games[4] and Bryant played 80 games [4] that season. To continue
on ii) we do not use a combination of statistics since it would be to much of a project
to determine how the most valid combination would look like, but we do encourage
that research since it could be of great help in the future.

As for iii) we try to measure how a certain background can reflect a player’s chances
of succeeding, and we have in this research presumed that there is no correlation
between chances of becoming injured and hometown background as a NBA-player.

                                          25
Bibliography

[1]   Orley Ashenfelter. Predicting the quality and prices of bordeaux wine*. The
      Economic Journal, 118(529):F174–F184, 2008.

[2]   W F Gabrielli S A Mednick B McGarvey, P M Bentler. Rearing social class,
      education, and criminality - a multiple indicator model. Journal of Abnormal
      Psychology, 90:354–365, 1981.

[3]   Kurt Badenhausen.        As stern says goodbye,     knicks,   lak-
      ers set records as nba’s most valuable teams.               http:
      //www.forbes.com/sites/kurtbadenhausen/2014/01/22/
      as-stern-says-goodbye-knicks-lakers-set-records-as-nbas-most-valuable-teams/.
      [Online; January 2014].

[4]   Basketball-Reference.com. Nba and aba basketball statistics and history. http:
      //www.basketball-references.com. [Online; february 2014].

[5]   Chriopher M. Bishop. Pattern Recognition and Machine Learning. Springer,
      2006.

[6]   Pierre Bourdieu. Le Sens commun. Les Editions de Minuit, 1980.

[7]   Pierre Bourdieu. Les Héritiers : Les étudiants et la culture. Les Editions de
      Minuit, 1984.

[8]   Donald Broady. Sociologi och epistemologi. Pierre Bourdieus forfattarskap och
      den historiska epistemologin. HLS Forlag, 1991.

[9]   Christoffer K. I Williams Carl Edward Rasmussen. Gaussian Processes for
      Machine Learning. The MIT Press, 2006.

[10] Joshua Kjerulf Dubrow and jimi adams. Hoop inequalities: Race, class and
     family structure background and the odds of playing in the national basketball
     association. International Review for the Sociology of Sport, 2010.

[11] Jac Fitz-enz. ROI of Human Capital. AMACOM, American Management
     Association, 2000.

                                         27
BIBLIOGRAPHY

[12] gaussianprocesses.org. The gaussian processes web site.           http://www.
     gaussianprocess.org/. [Online; may 2014].

[13] Robert M. Hauser. Measuring socioeconomic status in studies of child devel-
     opment. Child Development, 65:1541–1545, 1994.

[14] Deborah G Hoare. Predicting success in junior elite basketball players — the
     contribution of anthropometic and physiological attributes. Journal of science
     and medicine in sport, 3:391–405, 2000.

[15] Daniel Kahneman. Thinking Fast and Slow. Farrar, Straus and Giroux, 2011.

[16] G A Kaplan and J E Keil. Socioeconomic factors and cardiovascular disease:
     a review of the literature. Circulation, 88(4):1973–98, 1993.

[17] Wojciech Liponski. World Sports Encyclopedia. Quarto Publishing Group USA,
     2003.

[18] Jane D. McLeod and Ronald C. Kessler. Socioeconomic status differences in
     vulnerability to undesirable life events. Journal of Health and Social Behavior,
     31:162–172, 1990.

[19] RichardL. Moreland, Linda Argote, and Ranjani Krishnan. Training people
     to work in groups. In R.Scott Tindale, Linda Heath, John Edwards, EmilJ.
     Posavac, FredB. Bryant, Yolanda Suarez-Balcazar, Eaaron Henderson-King,
     and Judith Myers, editors, Theory and Research on Small Groups, volume 4 of
     Social Psychological Applications to Social Issues, pages 37–60. Springer US,
     2002.

[20] NBA.com. Highlights of the 2011 collective bargaining agreement between
     the national basketball association (nba) and the national basketball players
     association (nbpa). http://www.nba.com/media/CBA101_9.12.pdf. [Online;
     march 2014].

[21] Gianmarco I.P. Ottaviano and Giovanni Peri. Cities and cultures. Journal of
     Urban Economics, 58:304–337, 2005.

[22] Kevin Pelton. Numbers don’t lie. http://sportsillustrated.cnn.com/
     2006/writers/82games/04/13/mvp/2.html. [Online; April 2014].

[23] Matthew Martell Peter Tingling, Kamal Masri. Does order matter? an em-
     pirical analysis of nhl draft decisions. Sport, Business and Management: An
     International Journal, 1:155–171, 2011.

[24] php.net. Php: Documentation. http://php.net/docs.php. [Online; March
     2014].

                                         28
BIBLIOGRAPHY

[25] Sargur N. Srihari and Edward J. Kuebert. Integration of hand-written address
     interpretation technology into the united states postal service remote computer
     reader system. In Proceedings of the 4th International Conference on Document
     Analysis and Recognition, ICDAR ’97, pages 892–896, Washington, DC, USA,
     1997. IEEE Computer Society.

[26] Angus Stevenson. Oxford Dictionary of English. Oxford University Press, 2010.

[27] Robert I Sutton. The No Asshole Rule: Building a Civilized Workplace and
     Surviving One That Isn’t. Business Plus, 2006.

[28] K. L. Sorensen K Feldman T. W. H. Ng, L. T. Eby. Predictors of objective and
     subjective career success: A meta-analysis. Journal of Personnel Psychology,
     58:367–408, 2005.

[29] Wikipedia. Lega basket serie a. http://en.wikipedia.org/wiki/Lega_
     Basket_Serie_A. [Online; may 2014].

[30] Wikipedia.    List of national basketball association arenas. http:
     //en.wikipedia.org/wiki/List_of_National_Basketball_Association_
     arenas. [Online; may 2014].

[31] wikipedia.com. Artificial neural network. http://en.wikipedia.org/wiki/
     Artificial_neural_network#Criticism. [Online; April 2014.

[32] M A Winkleby, D E Jatulis, E Frank, and S P Fortmann. Socioeconomic status
     and health: how education, income, and occupation contribute to risk factors
     for cardiovascular disease. American Journal of Public Health, 82:816–820,
     1992.

                                        29
A. Graphs

This appendix contains graphs representing single hometown parameter versus suc-
cess, where both axis are dimensionless and normalized.

                                               Figure A.1: Homes Owned
                                       Highschool Graduate vs Success
                    4                                                                  90 % Confidence interval
                                                                                       80 % Confidence interval
                   3.5                                                                 70 % Confidence interval
                                                                                       60 % Confidence interval
                    3                                                                  51 % Confidence interval
                                                                                       Player
                                                                                       Mean games played per draft position
                   2.5
                                                                                       US median (vertical)
                                                                                       NBA−players median(vertical)
                    2
      BPL−level

                   1.5

                    1

                   0.5

                    0

                  −0.5

                         −3   −2             −1              0             1       2
                              Quantitative representation of Highschool Graduate

The legends on the following figures are referred to the legend on figure A.1. A
more meticulous figure description is made in Chapter 4.

                                                                         31
APPENDIX A. GRAPHS

                                   Figure A.2: Population

                                                  Population vs Success
                     4

                    3.5

                     3

                    2.5

                     2

                    1.5

                     1

                    0.5

                     0

                   −0.5

                               0    0.5     1      1.5        2           2.5   3   3.5   4

                          Figure A.3: Income per Capita

                                            Income per Capita vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                          −1         0            1         2          3         4        5
                                   Quantitative representation of Income per Capita

                                                         32
Figure A.4: Household Income

                                            Household Income vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                        −1          0          1         2          3         4        5       6
                                   Quantitative representation of Household Income

                         Figure A.5: Home Appreciation

                                            Home Appreciation vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                   −3         −2             −1            0          1            2       3
                                   Quantitative representation of Home Appreciation

                                                       33
APPENDIX A. GRAPHS

                               Figure A.6: Homes Owned

                                             Homes Owned vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                   −2           −1              0            1          2         3          4
                                     Quantitative representation of Homes Owned

                           Figure A.7: Housing Vacant

                                             Housing Vacant vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                   −1.5   −1    −0.5    0      0.5     1     1.5     2    2.5     3    3.5
                                 Quantitative representation of Housing Vacant

                                                      34
Figure A.8: Homes Rented

                                        Homes Rented vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                   −3      −2              −1              0            1          2
                                Quantitative representation of Homes Rented

                          Figure A.9: Violent Crime

                                         Violent Crime vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                   −2.5   −2      −1.5        −1       −0.5         0        0.5   1
                                Quantitative representation of Violent Crime

                                                 35
APPENDIX A. GRAPHS

                         Figure A.10: 2 Year College

                                        2 Year College vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                    −3    −2        −1            0          1          2      3   4
                               Quantitative representation of 2 Year College

                         Figure A.11: 4 Year College

                                        4 Year College vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

               −2        −1             0             1              2         3
                               Quantitative representation of 4 Year College

                                                 36
Figure A.12: Graduate Degree

                                      Graduate Degree vs Success
              4

             3.5

              3

             2.5

              2
BPL−level

             1.5

              1

             0.5

              0

            −0.5

                   −1     0          1          2          3         4       5   6
                              Quantitative representation of Graduate Degree

                                                37
B. Data variables

Population                                  Pop. Density
Pop.Change                                  Median Age
Households                                  Household Size
Male Population                             Female Population
Married Population                          Single population
Air Quality                                 Water Quality
Superfund Sites                             Physicians per 100k
Unemployment Rate                           Recent Job Growth
Future Job Growth                           Sales Taxes
Income Taxes                                Income per Cap.
Household Income                            Income Less Than 15K
Income between 15K and 25K                  Income between 25K and 35K
Income between 35K and 50K                  Income between 50K and 75K
Income between 75K and 100K                 Income between 100K and 150K
Income between 150K and 250K                Income between 250K and 500K
Income greater than 500K                    Management, Business, and Financial Operations
Professional and Related Occupations        Service
Sales and Office                            Farming, Fishing, and Forestry
Construction, Extraction, and Maintenance   Production, Transportation, and Material Moving
Median Home Age                             Median Home Cost
Home Appreciation                           Homes Owned
Housing Vacant                              Homes Rented
Property Tax Rate                           Property Tax Rate Less Than $20,000
Property Tax Rate $20,000 to $39,999        Property Tax Rate $40,000 to $59,999
Property Tax Rate $60,000 to $79,999        Property Tax Rate $80,000 to $99,999
Property Tax Rate $100,000 to $149,999      Property Tax Rate $150,000 to $199,999
Property Tax Rate $200,000 to $299,999      Property Tax Rate $300,000 to $399,999
Property Tax Rate $400,000 to $499,999      Property Tax Rate $500,000 to $749,999
Property Tax Rate$1,000,000 or more

                                    39
APPENDIX B. DATA VARIABLES

1999 to October 2005     1995 to 1998
1990 to 1994             1980 to 1989
1970 to 1979             1960 to 1969
1950 to 1959             1940 to 1949
1939 or Earlier          Violent Crime
Property Crime           Rainfall (in.)
Snowfall (in.)           Precipitation Days
Sunny Days               Avg. July High
Avg. Jan. Low            Comfort Index (higher=better)
UV Index                 Elevation ft.
School Expend.           Pupil/Teacher Ratio
Students per Librarian   Students per Counselor
2 yr College Grad.       4 yr College Grad.
Graduate Degrees         High School Grads.
Commute Time             Auto (alone)
Carpool                  Mass Transit
Work at Home             Commute Less Than 15 min.
Commute 15 to 29 min.    Commute 30 to 44 min.
Commute 45 to 59 min.    Commute greater than 60 min.
Overall                  Food
Utilities                Miscellaneous
Percent Religious        Catholic
Protestant               LDS
Baptist                  Episcopalian
Pentecostal              Lutheran
Methodist                Presbyterian
Other                    Christian
Jewish                   Eastern
Islam                    Democrat
Republican               Independent
Other

                                    40
You can also read