Soccer Data Analyzation
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
As one of the most popular global competitive sports, soccer has little statistical and analytic works on it. However, with the advent of big data, the trend of using data to improve becoming more and more obvious. There are already some soccer analysis companies such like OPTA and Prozone growing very fast. Motivated by the 2018 - 2019 UEFA Champions League, what I am trying to do is not only to collect data, but also to better analyze the data to serve the sport.
Here, you can find my Shiny App and Github.
Data Collection
The first part of dataset came from Kaggle. It contains more than 25 thousand matches from season 2008/2009 to 2015/2016, 10 thousand players, 11 European leagues with their lead championship. After calculating goal number and group them by country or season, we can obtain a whole picture of this 11 leagues. The rule is every two teams in the same league would have a match every year. On the left, we can find Spain, France and England have the most number of games, which means they have most teams and therefore the internal competition is more intense. On the right is average goal number of every league in every season. We can compare different leagues' performance every year by adding or dropping the bar representing their average goal number. Actually, the majority of winners of previous UEFA Champions League came from the five leagues with higher average goal number.
In order to figure out how the value of players influence the performance of a team, I also scrape data from transfermarkt using python. It contains the basic information of each year's 250 most valuable players from 2008 to 2015.
Application Features
Data Comparison between teams
Users can compare the win percentage of same team or different teams. For example, let's compare the most famous teams from England, Germany and Spanish: Liverpool, Bayern Munich and Barcelona. It’s interesting one of the best teams in England doesn’t have high win percentage as teams in other leagues. This might due to, remember this dataset only have matches between teams from same league. there are too many great teams in England.
Data Comparison between players
This application also allow users to get access to a table containing players information, especially their ratings and potentials at every match. Users can search by typing names of one or many specific players and get a line plot of their overall rating during these years.
Intuition 1: Home or Away?
It is interesting to find that home team wins almost twice number of games of away team, which is reasonable because in home game court, most of fans and supporters are here for the home team. It’s called home advantage. This benefit has been attributed to psychological effects supporting fans have on the competitors or referees.
Intuition 2: How to be a good player?
This dataset also contains player’s overall rating and 36 parameters evaluating players. I calculated the correlation between every parameter and overall rating. The larger the correlation, the more important the parameter is. We can get a conclusion that the reaction is the most important factor that would affect a player's performance. And age is also a useful parameter when a coach choosing players for his team.
Intuition 3: The importance of market value.
The last, left is the scatter plot of win percentage and market value of a team. The market value means how much the team used to buy players, so it’s actually the value of players in every team. And the value of players not only means the ability of players, it can also attract more fans, resource, sponsors. These points are all collected in the left up area. So with low value, a team might get high win percentage. But with high value, a team has less probability of performing bad. On the right, the blue line is the win percentage of Manchester city every year, it fit well with the trend of market value.
Future Works
This application can be improved in following aspects:
- Adding data about matches between teams from different leagues.
- With every player’s performance and previous matches result, using machine learning algorithm to predict the win probability for any two teams in a game.