This year (well, since July anyway) I have been looking at the various statistical and mathematical models relating to sporting events. The idea is simple, use statistics to identify probabilities and patterns and see whether the accuracy of such models can exceed the ‘gut call’ approach. Now I have a fair amount of the gut call data to hand, and I know it isn’t that good! The main problem is that there are a lot of models out there, and even the most straightforward concepts have been refined by people with far more experience and statistical knowledge than me. It can be very daunting, and so rather than trying to start with a refined model I thought I would start with a base model and see how well that performs, and how to amend it.
I am working on a model based on the concept of ELO Rankings, but that will wait until later in the year as far more games are required before an effective rating emerges for each team. Consequently I will start with a basic prediction model based on ‘Goal Expectancy’. This model predicts the number of goals a team will score, and using Poisson distribution can be used to predict the likelihood of various scorelines as well as (naturally) the likely winner of a game. In addition the data can be used to indicate the status of a game in the typical Over/Under markets.
As indicated above, I am not trying to take credit for the base formulas and calculations. In the case of goal expectancy and Poisson distribution I have taken the work from an excellent blog by SoccerDude. I heartily recommend his blog to anyone interested in football statistics, whilst acknowledging any errors or omissions as my own!
The principle underlying this method uses the number of goals a team has scored thus far, either in total or home/away, and combining that with a Poisson distribution to get a probability for the number of goals that will be scored by a specific team. By repeating this for both the home and away team a list of probabilities for the number of goals scored and multiplying them together gives the probability for that scoreline. Simple really!
Goal expectancy has three values: the overall rate (total goals scored/matches played), the home rate (goals scored at home/goals scored away) and the away rate (goals scored away/goals scored at home). The goals expected from any team when playing at home is calculated as ‘overall rate * home rate’, and for a team playing away it is (if you can’t guess…) ‘overall rate * away rate’.
So an example shows that, as of this weekend, Fulham have scored three times in five matches (overall rate 0.60), with two goals at home and one goal away (home rate 2.00). This gives an expectancy of 0.60*2.00 or 1.20. Cardiff have scored four in five games (overall rate 0.80), with three at home and one goal away (away rate 0.33). Their expectancy is 0.80*0.33 or 0.27. By plugging the number of goals we are interested in, along with the respective rate, we can get a probability for the number of goals the team might score, as shown below. (The reason these are pasted in as images is that I really haven’t got a good handle on tables in WordPress yet!)
The probability of any scoreline is therefore calculated by multiplying one by the other, e.g. 1-2 is 0.36143*0.02723 or 0.00984. This equates to 1%, or odds of 101.60 (using the 1/x formula). The grid is shown below.
So what does this mean for the weekend’s fixtures? Well the three most probable results (a dubious definition!) are shown below along with the probability and implied odds for each. The Spurs-Chelsea games has no data simply because Chelsea haven’t scored away from home yet!
But, before we all go and put the mortgage payments on some nice accumulators, let’s be clear that there are some significant issues here. Firstly, and in my personal opinion with the level of understanding I have of statistics at the moment the most important, is that this approach doesn’t account for the quality of the opposition that the goals were scored against. If two teams have played at home twice, and both have scored 10 goals whilst conceding 2, there is something of a difference if one team was playing Chelsea and Man Utd and the other was playing two teams at the bottom of the table. There is also – I am reliably informed – a problem with Poisson distributions that they don’t marry up well at their default values (0-0 and 1-1 is under forecast whilst 1-0 and 0-1 are over forecasted – that worries me as 6/9 of my predictions are 0-0!). And finally the most obvious problem – basing goal expectancy after five games of the season isn’t going to provide a very large data pool to work with.
But this is a first shot into this world, so let’s keep it simple for now and run this base model a few times to see how it performs before looking at amendments.