NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Analyzing and Predicting European Soccer Match Outcomes

Analyzing and Predicting European Soccer Match Outcomes

Efezino Erome-Utunedi
Posted on Oct 18, 2017

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Soccer, in my opinion,  is not only  the most popular but  the  best sport in the world. I always wake-up early on Saturday and Sunday mornings to watch the  matches on television. I love the emotion, the skills, the drama, and everything about it.

How to Use Odds to Predict Soccer

That is why for my Capstone project, I wanted to find out if I could create something of value from the numerous hours I have devoted to watching my favorite sport. I decided to create a shiny app in order to visualize the data and use the numerous machine learning algorithms I had learned in an attempt to correctly predict the outcome of soccer matches. Below I describe where I obtained my data, the data cleansing, feature selection, interactive plots of the data, and the algorithms used to predict the outcome of soccer matches.      

Data Source

I was able to identify a comprehensive data source of football matches on Kaggle. The data source was a .sqlite file which contained 7 tables:

Table Name Table Description
Country A table containing the country name of soccer teams
League A table containing the name of all the soccer leagues
Match A table containing all the match details from 2007 to 2016
Player A table containing all player ID and player name for all the teams
Player Attributes A table containing additional information of all the players such as attacking ability, defensive ability, strength, etc.
Team A table containing all the team ID and team name
Team Attributes A table containing additional information of all the teams such as their attacking and defensive ability, etc.

Using the RSQLite library, I was able to transfer all the tables to a R file and into a data table.

Data Cleansing

When given a new data set, the first check performed is the number of missing inputs in the raw data. As we will see in the figures below, missingness was a big issue, especially with the match, team attributes and player attributes table. Either the data had incomprehensible information or the data was missing.

The figure above contains a description of the missingness in the player attributes table. The histogram on the left is a percentage of data that is missing per feature. The plot on the right is a grid that shows the combination of features most prevalent in the data with red indicating features that are missing and blue signifying available data. So we can see that for a big portion of the data, ~98%, there is no missingness (all the features area available). Although a small percentage of some features are missing in this table, it is still something we will need to handle in order to not blindly throw away observations.

As we can see from the figure above, only one feature from the team attributes table is missing a significant amount of data. There are more observations with that data missing than not, and since each team is different, it does not make sense to replace the missing observations within the feature with a mean value or a randomly imputed value.

Finally in the match table, we can see that a huge percentage of some features are missing. Like the team attributes table, combinations of features with missing data are more prevalent than combinations of features with no missing data. The reason why three features appear to be missing altogether is because most of the features indicated above deal with betting data for winning, losing, and drawing a game from different betting companies.

For each company, it appears that if one of the observation is missing (odds of winning, losing or drawing a game), then there is a good chance that the remaining 2 features will be missing. In that case, we can say that the feature (odds of winning, losing and drawing a match) is missing at random (MAR) since the probability of one of the odds feature missing depends heavily on the availability of the remaining odds.Since the remaining combination of missing data appears to be random, we can conclude that  the remaining missing features are missing completely at random (MCAR) since the probability of a value missing does not depend on another feature value (MAR).

Also we can definitely rule out missing not at random (MNAR) since the feature value itself has no bearing on whether or not the value will be missing. As we will discuss in the upcoming section, a lot of the betting features from different companies are highly correlated with one another, so we can drop certain features without losing significant information. This allows us to keep more observations and prevent any bias that might have been introduced from dropping observations with missing data.

Since the match table will need to be merged with the player attributes table and team attributes table, it is vital to select the right features from the three tables in order to decrease the number of observations with missing data and to develop custom functions to properly handle missing values rather than using mean imputation, random imputation, or some form of regression imputation.

To perform preliminary feature selection that accounts for missing data, I decided to use a correlation plot to find the correlation between all the features in their respective tables. If two or more features are highly correlated, then there is a good chance they carry the same information. Consequently, I would only need to pick only one of those features.

From the figure above, we can see that although there is some correlation between the attacking and defensive attributes in the team attributes table, none of the features were highly correlated with one another. I decided to merge all the features from this table with the match table. As I still had to deal with the missingness with some features, I decided to write a custom function that performed in the following manner:

  • Perform a left join between match table and team attributes table
  • Run a for loop for each team
  • From 2016 to 2007, check to see if each teamโ€™s features has a missing value
  • If feature contains a missing value, I followed these steps:
    • Replace it with the previous yearโ€™s value
    • If previous year value is missing, replace value with one of previous year that is not null
    • If value is still missing run another for loop from 2007 to 2016 that accomplishes the same task as described above
      • For example, if null in 2010, look from 2009 to 2007 to find a value that is not null. If respective values are null then look from 2011 to 2016 for value that is not null.
    • If value is still null, leave, as null as observation will be discarded later.

In the correlation plot in the player attributes table, we can see that attacking features are highly correlated with each other and defense features as well, the same result we observed in the team attributes table. Due to a shortage of time, I decided to use only the overall player rating feature from the player attributes table since it was a good representation of all the features.

Another reason I decided to use only the overall player ratings feature was to avoid piling on too many features.Each player, per year had a corresponding value, and as each team has 11 players, selecting only one feature from this table would translate into adding 22 features to the match table (home and away team per match). So if each player had two features from the player attributes table, it would double to 44 features added to match table.

As different players would need different features (attackers to attacking features, defenders to defenders features, etc.), it made sense to use overall player ratings for now and based on model results, see if adding more features would lead to improved results. For the player attributes table, I performed a slightly different custom function in terms of identifying and replacing missing data:

  • Perform a left join between match table and player attributes table
  • Run a for loop for each player
  • From 2016 to 2007, check to see if each playerโ€™s features has a missing value
  • If feature contains a missing value, I followed this steps:
    • Replace with the previous yearโ€™s value
    • Replace a value with one of previous year that is not null
    • If a value is still missing, run another for loop from 2007 to 2016 that accomplishes the same task
      • For example, if null in 2010, look from 2009 to 2007 to find a value that is not null. If respective values are null, then look from 2011 to 2016 for value that is not null.
    • If the value is still null, replace null value with mean rating of players for selected team.

The match attributes table was interesting; missing data corresponded to features related to betting odds (Odds of home team winning, away team winning, and draw). As we can see from the figure above, there are a  number of marked correlations. All the odds related to home teams from different companies are highly correlated with each other.All the odds related to the away team from different companies are highly correlated with each other.

Even all the odds related to the match ending a draw from different companies are highly correlated with each other. Due to this I decided to use only the odds from the betting company B365 because it  was the one with the least missing data. Some features also contained garbage information (incorrectly scraped from respective websites) so I dropped those features from match table.

After merging the match table, player attributes table and team attributes table, I was left with the overall ratings per player, all the team attributes, the betting numbers for home team win, away team win and draw odds, and the goals scored by each team in a game.

Without some form of imputation, only ~7% of the data had complete cases.  But with the custom functions and after analyzing the missing data, I was able to retain ~ 68% of the data (complete cases).

Data Visualization

For the data visualization section, I decided to create a shiny app that showed trends of wins/losses/draws for each team, home and away from 2008-2016, trends of the team attributes from from 2008 - 2016 and a box plots highlighting the overall ratings of the 11 players on each team from 2008 - 2016. Rather than highlighting only one league, the user will be able to look at the English, French, Belgian, Spanish, German, Italian, Netherlands, Scottish, and Portuguese leagues to see which teams had the better most wins per year, the ratings of their players and of the overall teams.

Shiny demonstration

Predictive Models

I decided to create models that predict the outcome of the home team winning/losing/drawing a game. This is a multi-class classification problem since there are three outcomes, win (W), loss (L), and draw (D). Below is the distribution of classes.

DRAW LOSS WIN
4479 5084 8179

The win category has almost twice as many outcomes as the other classes, so this is something we will need to be wary of, especially when splitting for a train-test. We want the train and test set results to be similarly distributed. To do this, I used the createDataPartition function in the caret class. We should be wary of this distribution of classes since predicting a win always provides an accuracy of 46%. Consequently, any model that we build needs to be better than this accuracy. For analyzing the results, I will be using the following metrics:

  • Overall Accuracy
    • Overall Accuracy is important because we want to make sure that overall, we are predicting better than the null case (predicting all results as wins).
  • Sensitivity
    • This metric indicates out of all โ€œTrueโ€ outcomes. How many did we correctly predict as True. True, in this case, would be winning a match. Having a high sensitivity value is important because a model could have a great overall accuracy but a poor sensitivity value, meaning that that our model is doing a poor job of predicting the class. We want to make sure that both values are as high as possible without overfitting to the training set.
  • Specificity
    • This metric indicates out of all โ€œFalseโ€ outcomes. How many did we correctly predict as False. โ€œFalseโ€ is this case, would be losing/drawing a match. Having a high specificity value is important because a model could have a great overall accuracy but a poor specificity value, meaning that that our model is doing a poor job of predicting the class. As in the sensitivity case, we want to make sure that this value is as high as possible without overfitting to the training set.

Although having all parameters above as high as possible is the best case scenario, I will be tuning for overall accuracy because it does whether the result is a win, draw, or loss. All that matters is correctly predicting the outcome of the soccer matches.

For my first model, I chose xgboost because  it is quick, works well with classification, and does not force the model to assume a certain shape like regressions of all types. For xgboost, I used a 10 fold cross validation with the following parameters:

Max Depth Seq(1,10, by = 4)
Learning Rate Seq(0.05,0.3,length,out=6)
Gamma Seq(0,6,by = 2)
Minimum child weight Seq(0.5,2.5, by = 5)

After 10 fold cross validation, I ended up with the following parameters for xgboost:

Training Percentage Max
Depth
Objective # of classes Eval
Metric
Early
Stopping
Rounds
Minimum
Child Weight
Gamma Learning
Rate
90% 5 SoftMax 3 Multi

logloss

7 1 0 0.01

One thing I noticed using xgboost was that removing the individual player ratings as features had negligible effects on the results of the model (those features essentially had little variable importance values). I reran the model with the grid provided above and arrived at the same optimal parameters mentioned above. After training and testing the model, I obtained the following results:

Draw(Real) Loss (Real) Win (Real)
Draw (Predicted) 2 0 2
Loss (Predicted) 120 241 115
Win (Predicted) 325 267 700
Sensitivity for Win/Draw Specificity for Win/Draw Sensitivity for Win/Loss Specificity for Win/Loss Overall Accuracy
99% 1% 86% 47% 53%

Right away we can see that the overall accuracy is better than the null case. It also appears that since the win category almost doubles the other categories, the model is predicting a lot of wins when the results are draws or losses (high sensitivity and low specificity).

Next I decided to use neural networks from the nnet library. Although I am able to use neural networks which is fast and has a history of being very accurate, the downside of the nnet library is that it allows only 1 deep layer. Again I tried this algorithm with and without the individual player ratings, and I got extremely similar results. I trained the model with 90% of the data, and with no player individual player ratings as attributes, I obtained the following results:

Draw (Real) Loss (Real) Win (Real)
Draw Predicted) 5 4 2
Loss (Predicted) 103 217 106
Win (Predicted) 339 287 709
Sensitivity for Win/Draw Specificity for Win/Draw Sensitivity for Win/Loss Specificity for Win/Loss Overall Accuracy
99% 1.5% 87% 43% 52.5%

The results obtained from this algorithm perform better than predicting all matches as wins, but it is not an improvement on the results of xgboost. There is a slight improvement in the specificity for Win/Draw category, but it is still very poor. This again is due to the fact that the distribution of the results is heavily weighted towards the win category (high sensitivity, low specificity).

Rather than using only one algorithm, I decided to use a stacking method that used the following optimized algorithms to create meta features and use those meta features along with the initial features presented as inputs to an xgboost model:

  • K Nearest Neighbors (grid)
Size Decay maxit
1 to 10 Exp([-15 to -5 with by = 5]) [200 to 1000] by = 100
  • Neural Networks (grid)
Size
[1 to 134] by = 2
  • LDA (Linear Discriminant Analysis)
  • QDA (Quadratic Discriminant Analysis)
  • Multinomial logistic regression

Once the meta features were created, I used xgboost to predict the results of the matches and got the following results:

Draw (Real) Loss (Real) Win (Real)
Draw Predicted) 0 3 0
Loss (Predicted) 133 248 110
Win (Predicted) 314 257 707
Sensitivity for Win/Draw Specificity for Win/Draw Sensitivity for Win/Loss Specificity for Win/Loss Overall Accuracy
100% 0% 86% 49% 54%

Although this model performed better than predicting all wins, for its complexity, it is not an improvement on the results obtained from using only xgboost and neural network algorithms.

Conclusion/Recommendation

For this project, I collected data from Kaggle, cleaned it up to deal with null cases, merged certain tables, and performing feature selection in order to visualize the data and perform some machine learning algorithms in an attempt to correctly predict the outcome of the soccer games.

Although many simple and complicated models were created to accurately predict the outcomes of soccer games, and we were able to predict better than the null case, it appears that the features need to be revisited in order to obtain better model results. We could consolidate certain features such as player attributes, or drop certain features to simplify model since model complexity. Another issue could be that more data needs to be collected to better reflect a more even distribution of the win/loss/draw classes. This could potentially assist in correctly predicting the outcome of the soccer matches.  

Sources

https://www.kaggle.com/hugomathien/soccer

https://rstudio.github.io/shinydashboard/

About Author

Efezino Erome-Utunedi

Efezino recently completed his MENG in Mechatronics Design at the University of British Columbia, focusing on controls engineering. He now works full-time at an engineering consulting firm while enrolled in the NYCDSA's 2017 January to May online cohort,...
View all posts by Efezino Erome-Utunedi >

Related Articles

Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Beware of Feature Importance for Business Decisions
Meetup
Building a Safer Future
Python
Tech Layoffs: Exploring the Trends and Industry Shifts
Meetup
Analysis of Mass Shootings and Gun Ownership in the United States

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Tom March 21, 2018
The odds used for features are closing odds. As such they can't be used in reality to predict upcoming matches. I feel that because of this, the strong bias between these odds and home and away wins is a reason you don't predict many draws. Have you tested your model on future matches and if so could you provide some accuracy measurements?
Flora Erome Utunedi October 18, 2017
Efezino this is a very thoughtful analysis. Good job Congratulations.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application