Simple Bayesian Model for Euro Cup score prediction
Score prediction for a game of football is a complex modelling task and would usually involve umpteen variables (from the managerial style of the current manager to availability of in-form players etc.) In this article, a simple Bayesian model has been created to predict scores based on the difference in rankings of the two teams in a match.
We shall try to answer
- Given the difference in ranking, what is the probability of the first team scoring a certain number of goals (say, 1 to 10)
- Given the difference in ranking, what is the probability of the second team scoring a certain number of goals (say, 1 to 10)
Like all Bayesian analysis, we start with a prior distribution of goals scored in a football match. We make the assumption that the goals scored by one team does not exceed 10. For prior probabilities, we can either consider a uniform distribution (which is highly unlikely; how often do you hear a team scoring 8 or 9 goals!) or assume our own probabilities based on the our knowledge and understanding. I have take the following probabilities as depicted in the figure below. (You can also model this as Poisson distribution.)
Now for the observed data for obtaining likelihoods. Fortunately, a very good data set has been shared on kaggle by martj42. To this dataset, I have added elo ratings for each year for both away team and home team to obtain rank difference (which will serve as our variable of interest).
This is what are processed data looks like
Using this dataset, we calculate the likelihood that given a certain number of goals, how many times did that score come by when the rank difference between the team and its opponent was the same as that in our match of interest (the match for which we are making a prediction).
Let’s say we are trying to predict France vs Germany (already taken place in Munich). We have a prior distribution of goals which is not necessary based on the matches between France vs Germany or between opponents with a specific rank difference. We calculate the likelihood via the following function.
def compute_likelihood(df,country,rank_diff,prior):
likelihood=pd.Series(0, index =prior.index).astype('float')
j=0
if ((rank_diff==1) | (rank_diff==-1)):
for i in prior.index:
total_len=len(df[(df['country']==country) & (df['goals']==i)])+1
len_rank_diff=len(df[(df['country']==country) & (df['goals']==i) & (df['rank_diff']==rank_diff)])
likelihood[j]=(len_rank_diff)/total_len
j=j+1
else :
for i in prior.index:
total_len=len(df[(df['country']==country) & (df['goals']==i)])+1
len_rank_diff_plus1=len(df[(df['country']==country) & (df['goals']==i) & (df['rank_diff']==(rank_diff+1))])
len_rank_diff_minus1=len(df[(df['country']==country) & (df['goals']==i) & (df['rank_diff']==(rank_diff-1))])
len_rank_diff=len(df[(df['country']==country) & (df['goals']==i) & (df['rank_diff']==rank_diff)])
likelihood[j]=(len_rank_diff_plus1+len_rank_diff_minus1+len_rank_diff)/total_len
j=j+1
return likelihooddef normalize(my_series):
my_series_normalized = my_series.copy()
sum_value = my_series.sum()
my_series_normalized = my_series / sum_value
return my_series_normalized
[I have included a rank difference of rank_diff+1 and rank_diff-1 for better approximation.]
Now suppose we want to calculate our posterior distribution of goals when France and Germany are facing each other. We know that France is ranked 3 and Germany 13. The rank difference for France is -10 and for Germany +10. We insert the values in our function for calculating likelihood and multiply with the prior (pmf_goals) followed by normalizing it to obtain our posterior distribution.
The following are the results for France and Germany respectively .
It can be seen that France has a maximum chance of scoring 1 goal and Germany 0 goals (which in fact was the result; yay!).
The prior and posterior distribution for France has been plotted below for the rank difference of -10.
Now that you will accuse of me of deliberately taking a favorable example, I am attaching the prediction obtained from this model for the matches before the knock out stage.
Indeed, a few results are not the same as the prediction. Let’s see how the remaining matches pan out. You can add more relevant variables to fine tune this.