Soccer Data Analyzation

Posted on Jul 2, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

As one of the most popular global competitive sports, soccer has little statistical and analytic works on it. However, with the advent of big data, the trend of using data to improve becoming more and more obvious. There are already some soccer analysis companies such like OPTA and Prozone growing very fast. Motivated by the 2018 - 2019 UEFA Champions League, what I am trying to do is not only to collect data, but also to better analyze the data to serve the sport. 

Here, you can find my Shiny App and Github.

Data Collection

The first part of dataset came from Kaggle. It contains more than 25 thousand matches from season 2008/2009 to 2015/2016, 10 thousand players, 11 European leagues with their lead championship. After calculating goal number and group them by country or season, we can obtain a whole picture of this 11 leagues. The rule is every two teams in the same league would have a match every year. On the left, we can find Spain, France and England have the most number of games, which means they have most teams and therefore the internal competition is more intense. On the right is average goal number of every league in every season.  We can compare different leagues' performance every year by adding or dropping the bar representing their average goal number. Actually, the majority of winners of previous UEFA Champions League came from the five leagues with higher average goal number.

Soccer Data Analyzation

In order to figure out how the value of players influence the performance of a team, I also scrape data from transfermarkt using python. It contains the basic information of each year's 250 most valuable players from 2008 to 2015.

Application Features

Data Comparison between teams

Users can compare the win percentage of same team or different teams. For example, let's compare the most famous teams from England,  Germany and Spanish: Liverpool, Bayern Munich and Barcelona. It’s interesting one of the best teams in England doesn’t have high win percentage as teams in other leagues. This might due to, remember this dataset only have matches between teams from same league. there are too many great teams in England. 

Soccer Data Analyzation

Data Comparison between players

This application also allow users to get access to a table containing players information, especially their ratings and potentials at every match. Users can search by typing names of one or many specific players and get a line plot of their overall rating during these years.

Intuition 1: Home or Away?

It is interesting to find that home team wins almost twice number of games of away team, which is reasonable because in home game court, most of fans and supporters are here for the home team. It’s called home advantage. This benefit has been attributed to psychological effects supporting fans have on the competitors or referees. 

Soccer Data Analyzation

Intuition 2: How to be a good player?

This dataset also contains player’s overall rating and 36 parameters evaluating players. I calculated the correlation between every parameter and overall rating. The larger the correlation, the more important the parameter is. We can get a conclusion that the reaction is the most important factor that would affect a player's performance. And age is also a useful parameter when a coach choosing players for his team.

Intuition 3: The importance of market value.

The last, left is the scatter plot of  win percentage and market value of a team. The market value means how much the team used to buy players, so it’s actually the value of players in every team. And the value of players not only means the ability of players, it can also attract more fans, resource, sponsors. These points are all collected in the left up area. So with low value, a team might get high win percentage. But with high value, a team has less probability of performing bad. On the right, the blue line is the win percentage of Manchester city every year, it fit well with the trend of market value.

Future Works

This application can be improved in following aspects:

  1. Adding data about matches between teams from different leagues. 
  2. With every player’s performance and previous matches result, using machine learning algorithm to predict the win probability for any two teams in a game.

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI