Analyzing and Predicting European Soccer Match Outcomes
The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
Soccer, in my opinion, is not only the most popular but the best sport in the world. I always wake-up early on Saturday and Sunday mornings to watch the matches on television. I love the emotion, the skills, the drama, and everything about it.
How to Use Odds to Predict Soccer
That is why for my Capstone project, I wanted to find out if I could create something of value from the numerous hours I have devoted to watching my favorite sport. I decided to create a shiny app in order to visualize the data and use the numerous machine learning algorithms I had learned in an attempt to correctly predict the outcome of soccer matches. Below I describe where I obtained my data, the data cleansing, feature selection, interactive plots of the data, and the algorithms used to predict the outcome of soccer matches.
Data Source
I was able to identify a comprehensive data source of football matches on Kaggle. The data source was a .sqlite file which contained 7 tables:
Table Name | Table Description |
Country | A table containing the country name of soccer teams |
League | A table containing the name of all the soccer leagues |
Match | A table containing all the match details from 2007 to 2016 |
Player | A table containing all player ID and player name for all the teams |
Player Attributes | A table containing additional information of all the players such as attacking ability, defensive ability, strength, etc. |
Team | A table containing all the team ID and team name |
Team Attributes | A table containing additional information of all the teams such as their attacking and defensive ability, etc. |
Using the RSQLite library, I was able to transfer all the tables to a R file and into a data table.
Data Cleansing
When given a new data set, the first check performed is the number of missing inputs in the raw data. As we will see in the figures below, missingness was a big issue, especially with the match, team attributes and player attributes table. Either the data had incomprehensible information or the data was missing.
The figure above contains a description of the missingness in the player attributes table. The histogram on the left is a percentage of data that is missing per feature. The plot on the right is a grid that shows the combination of features most prevalent in the data with red indicating features that are missing and blue signifying available data. So we can see that for a big portion of the data, ~98%, there is no missingness (all the features area available). Although a small percentage of some features are missing in this table, it is still something we will need to handle in order to not blindly throw away observations.
As we can see from the figure above, only one feature from the team attributes table is missing a significant amount of data. There are more observations with that data missing than not, and since each team is different, it does not make sense to replace the missing observations within the feature with a mean value or a randomly imputed value.
Finally in the match table, we can see that a huge percentage of some features are missing. Like the team attributes table, combinations of features with missing data are more prevalent than combinations of features with no missing data. The reason why three features appear to be missing altogether is because most of the features indicated above deal with betting data for winning, losing, and drawing a game from different betting companies.
For each company, it appears that if one of the observation is missing (odds of winning, losing or drawing a game), then there is a good chance that the remaining 2 features will be missing. In that case, we can say that the feature (odds of winning, losing and drawing a match) is missing at random (MAR) since the probability of one of the odds feature missing depends heavily on the availability of the remaining odds.Since the remaining combination of missing data appears to be random, we can conclude that the remaining missing features are missing completely at random (MCAR) since the probability of a value missing does not depend on another feature value (MAR).
Also we can definitely rule out missing not at random (MNAR) since the feature value itself has no bearing on whether or not the value will be missing. As we will discuss in the upcoming section, a lot of the betting features from different companies are highly correlated with one another, so we can drop certain features without losing significant information. This allows us to keep more observations and prevent any bias that might have been introduced from dropping observations with missing data.
Since the match table will need to be merged with the player attributes table and team attributes table, it is vital to select the right features from the three tables in order to decrease the number of observations with missing data and to develop custom functions to properly handle missing values rather than using mean imputation, random imputation, or some form of regression imputation.
To perform preliminary feature selection that accounts for missing data, I decided to use a correlation plot to find the correlation between all the features in their respective tables. If two or more features are highly correlated, then there is a good chance they carry the same information. Consequently, I would only need to pick only one of those features.
From the figure above, we can see that although there is some correlation between the attacking and defensive attributes in the team attributes table, none of the features were highly correlated with one another. I decided to merge all the features from this table with the match table. As I still had to deal with the missingness with some features, I decided to write a custom function that performed in the following manner:
- Perform a left join between match table and team attributes table
- Run a for loop for each team
- From 2016 to 2007, check to see if each team’s features has a missing value
- If feature contains a missing value, I followed these steps:
- Replace it with the previous year’s value
- If previous year value is missing, replace value with one of previous year that is not null
- If value is still missing run another for loop from 2007 to 2016 that accomplishes the same task as described above
- For example, if null in 2010, look from 2009 to 2007 to find a value that is not null. If respective values are null then look from 2011 to 2016 for value that is not null.
- If value is still null, leave, as null as observation will be discarded later.
In the correlation plot in the player attributes table, we can see that attacking features are highly correlated with each other and defense features as well, the same result we observed in the team attributes table. Due to a shortage of time, I decided to use only the overall player rating feature from the player attributes table since it was a good representation of all the features.
Another reason I decided to use only the overall player ratings feature was to avoid piling on too many features.Each player, per year had a corresponding value, and as each team has 11 players, selecting only one feature from this table would translate into adding 22 features to the match table (home and away team per match). So if each player had two features from the player attributes table, it would double to 44 features added to match table.
As different players would need different features (attackers to attacking features, defenders to defenders features, etc.), it made sense to use overall player ratings for now and based on model results, see if adding more features would lead to improved results. For the player attributes table, I performed a slightly different custom function in terms of identifying and replacing missing data:
- Perform a left join between match table and player attributes table
- Run a for loop for each player
- From 2016 to 2007, check to see if each player’s features has a missing value
- If feature contains a missing value, I followed this steps:
- Replace with the previous year’s value
- Replace a value with one of previous year that is not null
- If a value is still missing, run another for loop from 2007 to 2016 that accomplishes the same task
- For example, if null in 2010, look from 2009 to 2007 to find a value that is not null. If respective values are null, then look from 2011 to 2016 for value that is not null.
- If the value is still null, replace null value with mean rating of players for selected team.
The match attributes table was interesting; missing data corresponded to features related to betting odds (Odds of home team winning, away team winning, and draw). As we can see from the figure above, there are a number of marked correlations. All the odds related to home teams from different companies are highly correlated with each other.All the odds related to the away team from different companies are highly correlated with each other.
Even all the odds related to the match ending a draw from different companies are highly correlated with each other. Due to this I decided to use only the odds from the betting company B365 because it was the one with the least missing data. Some features also contained garbage information (incorrectly scraped from respective websites) so I dropped those features from match table.
After merging the match table, player attributes table and team attributes table, I was left with the overall ratings per player, all the team attributes, the betting numbers for home team win, away team win and draw odds, and the goals scored by each team in a game.
Without some form of imputation, only ~7% of the data had complete cases. But with the custom functions and after analyzing the missing data, I was able to retain ~ 68% of the data (complete cases).
Data Visualization
For the data visualization section, I decided to create a shiny app that showed trends of wins/losses/draws for each team, home and away from 2008-2016, trends of the team attributes from from 2008 - 2016 and a box plots highlighting the overall ratings of the 11 players on each team from 2008 - 2016. Rather than highlighting only one league, the user will be able to look at the English, French, Belgian, Spanish, German, Italian, Netherlands, Scottish, and Portuguese leagues to see which teams had the better most wins per year, the ratings of their players and of the overall teams.
Predictive Models
I decided to create models that predict the outcome of the home team winning/losing/drawing a game. This is a multi-class classification problem since there are three outcomes, win (W), loss (L), and draw (D). Below is the distribution of classes.
DRAW | LOSS | WIN |
4479 | 5084 | 8179 |
The win category has almost twice as many outcomes as the other classes, so this is something we will need to be wary of, especially when splitting for a train-test. We want the train and test set results to be similarly distributed. To do this, I used the createDataPartition function in the caret class. We should be wary of this distribution of classes since predicting a win always provides an accuracy of 46%. Consequently, any model that we build needs to be better than this accuracy. For analyzing the results, I will be using the following metrics:
- Overall Accuracy
- Overall Accuracy is important because we want to make sure that overall, we are predicting better than the null case (predicting all results as wins).
- Sensitivity
- This metric indicates out of all “True” outcomes. How many did we correctly predict as True. True, in this case, would be winning a match. Having a high sensitivity value is important because a model could have a great overall accuracy but a poor sensitivity value, meaning that that our model is doing a poor job of predicting the class. We want to make sure that both values are as high as possible without overfitting to the training set.
- Specificity
- This metric indicates out of all “False” outcomes. How many did we correctly predict as False. “False” is this case, would be losing/drawing a match. Having a high specificity value is important because a model could have a great overall accuracy but a poor specificity value, meaning that that our model is doing a poor job of predicting the class. As in the sensitivity case, we want to make sure that this value is as high as possible without overfitting to the training set.
Although having all parameters above as high as possible is the best case scenario, I will be tuning for overall accuracy because it does whether the result is a win, draw, or loss. All that matters is correctly predicting the outcome of the soccer matches.
For my first model, I chose xgboost because it is quick, works well with classification, and does not force the model to assume a certain shape like regressions of all types. For xgboost, I used a 10 fold cross validation with the following parameters:
Max Depth | Seq(1,10, by = 4) |
Learning Rate | Seq(0.05,0.3,length,out=6) |
Gamma | Seq(0,6,by = 2) |
Minimum child weight | Seq(0.5,2.5, by = 5) |
After 10 fold cross validation, I ended up with the following parameters for xgboost:
Training Percentage | Max Depth |
Objective | # of classes | Eval Metric |
Early Stopping Rounds |
Minimum Child Weight |
Gamma | Learning Rate |
90% | 5 | SoftMax | 3 | Multi
logloss |
7 | 1 | 0 | 0.01 |
One thing I noticed using xgboost was that removing the individual player ratings as features had negligible effects on the results of the model (those features essentially had little variable importance values). I reran the model with the grid provided above and arrived at the same optimal parameters mentioned above. After training and testing the model, I obtained the following results:
Draw(Real) | Loss (Real) | Win (Real) | |
Draw (Predicted) | 2 | 0 | 2 |
Loss (Predicted) | 120 | 241 | 115 |
Win (Predicted) | 325 | 267 | 700 |
Sensitivity for Win/Draw | Specificity for Win/Draw | Sensitivity for Win/Loss | Specificity for Win/Loss | Overall Accuracy |
99% | 1% | 86% | 47% | 53% |
Right away we can see that the overall accuracy is better than the null case. It also appears that since the win category almost doubles the other categories, the model is predicting a lot of wins when the results are draws or losses (high sensitivity and low specificity).
Next I decided to use neural networks from the nnet library. Although I am able to use neural networks which is fast and has a history of being very accurate, the downside of the nnet library is that it allows only 1 deep layer. Again I tried this algorithm with and without the individual player ratings, and I got extremely similar results. I trained the model with 90% of the data, and with no player individual player ratings as attributes, I obtained the following results:
Draw (Real) | Loss (Real) | Win (Real) | |
Draw Predicted) | 5 | 4 | 2 |
Loss (Predicted) | 103 | 217 | 106 |
Win (Predicted) | 339 | 287 | 709 |
Sensitivity for Win/Draw | Specificity for Win/Draw | Sensitivity for Win/Loss | Specificity for Win/Loss | Overall Accuracy |
99% | 1.5% | 87% | 43% | 52.5% |
The results obtained from this algorithm perform better than predicting all matches as wins, but it is not an improvement on the results of xgboost. There is a slight improvement in the specificity for Win/Draw category, but it is still very poor. This again is due to the fact that the distribution of the results is heavily weighted towards the win category (high sensitivity, low specificity).
Rather than using only one algorithm, I decided to use a stacking method that used the following optimized algorithms to create meta features and use those meta features along with the initial features presented as inputs to an xgboost model:
- K Nearest Neighbors (grid)
Size | Decay | maxit |
1 to 10 | Exp([-15 to -5 with by = 5]) | [200 to 1000] by = 100 |
- Neural Networks (grid)
Size |
[1 to 134] by = 2 |
- LDA (Linear Discriminant Analysis)
- QDA (Quadratic Discriminant Analysis)
- Multinomial logistic regression
Once the meta features were created, I used xgboost to predict the results of the matches and got the following results:
Draw (Real) | Loss (Real) | Win (Real) | |
Draw Predicted) | 0 | 3 | 0 |
Loss (Predicted) | 133 | 248 | 110 |
Win (Predicted) | 314 | 257 | 707 |
Sensitivity for Win/Draw | Specificity for Win/Draw | Sensitivity for Win/Loss | Specificity for Win/Loss | Overall Accuracy |
100% | 0% | 86% | 49% | 54% |
Although this model performed better than predicting all wins, for its complexity, it is not an improvement on the results obtained from using only xgboost and neural network algorithms.
Conclusion/Recommendation
For this project, I collected data from Kaggle, cleaned it up to deal with null cases, merged certain tables, and performing feature selection in order to visualize the data and perform some machine learning algorithms in an attempt to correctly predict the outcome of the soccer games.
Although many simple and complicated models were created to accurately predict the outcomes of soccer games, and we were able to predict better than the null case, it appears that the features need to be revisited in order to obtain better model results. We could consolidate certain features such as player attributes, or drop certain features to simplify model since model complexity. Another issue could be that more data needs to be collected to better reflect a more even distribution of the win/loss/draw classes. This could potentially assist in correctly predicting the outcome of the soccer matches.