Working with sports data for the first time: Building a hockey model

Several people have approached me since I started this blog asking about projects involving predictions in sports, either in real sports or in fantasy sports leagues. Sadly I do not follow any sports and lack some of the critical domain knowledge necessary to carry out this task from scratch, however, my friend and colleague Waco has been doing sports betting as a hobby for some time and was kind enough to share some of his hockey data and knowledge so that we could work together on improving his current model. Note that he should take most of the credit for the insights leading to these results. This article contains my first impressions about using data science for the building of sports models and some of the results we have been able to obtain.

Learning curve of a classifier model trying to predict if the home team will win. Our model reaches an impressive 77% accuracy in cross-validation (blue line is expectation from chance)

Sports data has several characteristics that make the prediction of outcomes difficult. The events have strong statistical drivers – such as how good the players are – yet they have significant contributions from randomness, such as whether players get injured during a game or whether players are having internal conflicts. Things like the weather, game attendance or even the family lives of the players can all impact games irrespective of how good or bad the players are on a leveled statistical ground. Whether the sport being played is a truer team sport or a more individual centered sport – like baseball – also plays a huge role in determining which variables are useful for prediction and what sources of randomness can be more important.

In sports, domain knowledge plays a huge role. The dataset I started with has more than 40 variables that correspond to statistics known before the beginning of each game at each point in time for both teams playing, with data for more than 10K games spanning more than 10 years of data. The chance of winning by picking a team randomly is close to 50% but increases to 54% if you pick the home team! That’s a very simple edge that comes from “very basic” domain knowledge.

The home/away team is a very important discriminator. If we try to predict games without discriminating which set of statistics belongs to the home or the away team we get scores that are close to what we expect from randomness, regardless of what fancy model we try to use, while aligning our statistics to have separate variables for the home and away teams ensures that we get a very substantial statistical edge. This is because some statistics matter way more when you’re playing as the home team while others matter more when you’re the away team. Knowing if the team is playing as a home or away team allows us to pay attention to a specific set of features instead of another, depending on each team’s home/away status.

Importance as measured by an extra tree classifier for 25 randomly sampled variables from our total set of features. Note that this model was just used to measure importances.

Interestingly the predictive ability of the variables is all fairly homogeneous and the variable importance – as showed in the image above – is fairly well distributed across the variables we have. The above shows only 25 randomly drawn variables from all those in our set, but it does retain the overall shape of the complete curve. Although there are around 10 features with significantly higher importance, using only those ten restricts the accuracy of generated models very substantially.

After properly grouping our variables we then proceeded to build models, with our best shots achieving a mean accuracy of around 77%+ in cross-validations, significantly higher than the expectation from chance (54%). The learning curve that is shown in the first image in this post represents how our model learns as a function of the number of examples. This curve shows that we gain most of our insights from the first 2K games and then only have very marginal improvements from the rest of the data. This is a good example of irreducible error, where we are unlikely to get better results with more examples or more complicated models but we basically require data that gives insights not present within our current statistics. This is expected to become impossible at some point, mainly due to the random component inherent to sports in general.

Distribution of accuracy scores in 50-fold cross-validation

The image above also shows that our model is unlikely to be curve-fitted as its dispersion in scores in 50-fold cross validation is small, showing that the insights it has gained can provide good generalization over the problem of hockey games, at least in the historical period we have studied. We will see if it is able to make predictions that are this great during the next hockey season!

Of course, you might have noticed that the data source and actual statistics/models we used have been purposefully obscured through this post – as they are a part of Waco’s IP – but if you would like to gain more information about this model or gain additional insights into hockey or other sports I would encourage you to contact Waco either on his twitter or linkedin accounts.

Leave a Reply

Your email address will not be published. Required fields are marked *