Data driven Identification of Overvalued and Undervalued Soccer Players Relative to In Season Performance
In European football, data has shown that the growth of billionaire owners and increased revenues has led to a transfer market that has seen astronomical player valuations. From 2017 onward, we have seen the top 10 transfer fees in the history of the sport, which includes Paris Saint Germain’s purchase of Neymar and Mbappe for a combined fee of 367 million Euros. Since valuations are highly speculative, it is important for clubs to deploy a system using data that can identify if a player is overvalued, undervalued, or valued within reason relative to their performances on the field.
A great example of a club overspending on talent is Barcelona. In the last 5 years, the club has spent massive transfer fees on players such as Philippe Coutinho, Ousmane Dembele, and Antoine Griezmann, all of whom had lackluster on field performances. Combine this with poor resale value and massive contracts and you get a club which accumulated more than a billion dollars in debt. This impact goes beyond the board room, as they now field a team sitting in 5th place with 35 points through 21 matches, just shy of Champions League qualification.
On the contrary, clubs such as Leicester City identified undervalued talent to build a team which went on to win the Premier League and rock the world of football. They famously signed players like Jamie Vardy and Riyad Mahrez at absurdly low prices, only for them to reach the pinnacle of their respective positions. As time goes on and player valuations continue to grow, it will be important for clubs to spend wisely to compete at the highest level while sustaining themselves financially.
To achieve this, we will find the correlation between a team’s key team metrics and points throughout numerous seasons. These correlations will be used as weights which will be mathematically applied to player’s individual metrics throughout a season. In addition to this, factors such as age, position bracket, and league difficulty will contribute to an overall weighted score.
Using a Loess line of best fit, we can find players who are overvalued, undervalued, or valued within reason relative to their current performance on the pitch. This will help clubs realize if they are spending wisely in the market, or if they should look to identify a different talent to bring into the squad.
Correlation to Points
To identify a correlation to points, we imported 3 years of team data from the 2018/19 season through the end of the 2020/21 season in the big 5 leagues of Europe (Premier League, Serie A, La Liga, Bundesliga, and Ligue 1). The 2019/20 season of Ligue 1 was removed, as it was halted without resumption due to COVID.
As expected, goals had the most positive correlation to points with 0.88, followed by xG (expected goals) at 0.80. Numerous other offensive statistics topped the chart, such as touches in the penalty area, key passes, and passes to the final third. In the case of some metrics with negative correlations such as interceptions, we will use the absolute value. This is because a team with high interception metrics typically indicates they are out of possession and therefore have less chances to score goals (which we see has the strongest correlation to point generation).
However, if a player successfully intercepts the ball often, it could indicate that they are a great reader of the game, and they regain possession for their team. These correlations will be saved as variables and applied as weights to the statistics of individual players later.
Import Player Data
After finding our correlations, we import individual player data from all outfield players across the top 5 leagues in Europe. The data is comprised of all metrics within league play for the 2021/22 season from Wyscout. It includes general information (such as player name, age, preferred foot, market value) and a variety of attacking, defending, and passing metrics. These include but are not limited to goals, xG, defensive duels per 90, aerial duels per 90, progressive passes, etc. In total, we have 75 features that show on field performance with a total of 2,345 players.
Upon filtering, we selected players who are currently on the club’s roster to weed out players who were transferred outside of the top 5 leagues or on the youth team. This shrunk the total to 2,153 players.
Modifying the Player Data Frame
- Inserting a league column based on the team the player is currently at. The leagues used are the Premier League, La Liga, Serie A, Bundesliga, and Ligue 1
- Adding a column for main position, which took the initial generalized position information and cut it down to the main position recognized in Wyscout.
- Grouping these players by position brackets. This was done so individuals who play in similar roles on opposite sides of the field would be grouped, since they would have similar key performance indicators. The position brackets are described below:
- Center Forward – for center forwards
- CAM – for central attacking midfielders
- Center Mid – for central and defensive midfielders
- Center Back – for center backs
- Outside Back – for outside backs and wing backs
- Wide Attacker – for wing players and wide attacking midfield options.
- Grouping players based on development status. We referenced a study done by Tom Worville at the Athletic, which helped identify player peak ages based on their respective position. A brief video of was done by Tifo Football and can be watched at https://www.youtube.com/watch?v=ZFUX1qXj6J8. The article written by Tom Worville can be found at https://theathletic.com/2935360/2021/11/15/what-age-do-players-in-different-positions-peak/. Although it is important to note that not all player progression is linear, we grouped our players into the following brackets:
- Developing – players who are below peak ages for their position bracket
- Peak – players who are in their peak ages for their position bracket
- Veteran – players who are beyond the peak ages of their position bracket
- Lastly, we converted all player data into percentiles for their respective position bracket. By this, we can identify the top goal scorer in the center forward bracket by the 100th percentile, while also finding the top goal scorer in the wide attacker bracket by the 100th percentile. This eliminated conflicts across position brackets on data. The percentile metric will play a role in the formula described below.
Generate a Weighted Player Value
To generate a weighted player value for each metric, we deployed one of the following formulas:
- Percentile * ((Correlation +/- 1) * 2)
- Percentile * ((1 + abs(Correlation) )* 2)
The +1 was used when a positive correlation was found and the -1 was used when there was a negative correlation (such as the fouls metric, because this negatively impacts a player). The formula with absolute value was used when we had a correlation that was negatively correlated to points but would have a positive impact on a player. To reference the example before, a team with a lot of interceptions would indicate they are often out of possession, therefor they have more chances to intercept the ball. On the other hand, a player with an ample number of interceptions is regaining possession for their team and indicates an ability to read passing lanes.
In some cases, there were no exact matches of correlation coefficients to points and player data. In cases like these, the metric was either omitted or a similar/identical coefficient was applied. For example, there was no assist feature in our data frame comprised of team data. However, since a goal has the same outcome as an assist, the goal correlation weight was applied to the formula when calculating the value of the assist feature.
Another weight that we added pertained to the league a player’s team is based in. To generate this weight, we scraped Wikipedia’s UEFA Coefficient page. A league’s UEFA Coefficient is a summation of the previous 5 years league coefficients. To garner points, the clubs in your league must get positive results in European competitions.
This coefficient indicates the quality of your league because they are generated by clubs playing outside of their domestic zone and competing against European teams. This is extremely significant as your league ranking grants your domestic competition more qualification places in European competitions. Before adding this metric to our data frame, we multiplied the league coefficient by 0.25. This was done to avoid massive gaps in player valuations solely based on league. We know this to be true, because clubs in one league frequently recruit from clubs in another.
From the data frame above, we can see that the Premier League is rated the highest. La Liga is ranked second, falling from its previous peak of first as recently as the 2019/20 season. Serie A and the Bundesliga are neck and neck, ranked 3rd and 4th respectively. In 5th, we have Ligue 1 in France.
Lastly, we added numerical weights to players development status. Developing was assigned the largest weight, as a player in the developing category is assumed to progress under the right guidance. As these players progress and display stronger metrics, a club will be able to sell them off for a large profit. Therefore, purchasing a player at the age of 21 with the same metrics as a player at the age of 28 would be a wiser investment. Players in peak ages are generally at the height of their career.
They typically have accumulated an abundance of game time and clubs may look to sign a player in this bracket to make an immediate impact. Over the course of a few years, a player in their peak should still retain their market value if performance levels are maintained. Finally, we have veteran players. These are players who often have the most experienced but have little resale value. Veterans are brought in to fill immediate needs or to bolster depth but should not be looked at as an investment. They will typically have little to no resale value as their abilities fade with age. The assigned weights are listed below:
- Developing – 1.5
- Peak – 1.25
- Veteran – 1.0
|Leagues||Position Peaks||Average Market Value|
|La Liga||Developing||$ 17,002,597|
|La Liga||Peak||$ 15,537,500|
|La Liga||Veteran||$ 7,242,500|
|Ligue 1||Developing||$ 11,838,400|
|Ligue 1||Peak||$ 10,976,190|
|Ligue 1||Veteran||$ 7,030,952|
|Premier League||Developing||$ 27,016,304|
|Premier League||Peak||$ 27,666,667|
|Premier League||Veteran||$ 17,903,509|
|Serie A||Developing||$ 14,754,455|
|Serie A||Peak||$ 15,420,000|
|Serie A||Veteran||$ 8,273,770|
The data above shows how the average market value for players in development and peak stages of their career remain relatively similar across all leagues, but fall significantly once reaching veteran status. Peak players in the Premier League also average a higher market value than developing and peak players outside of the Premier League. This is a contributed to the high quality of play and money disparity between the Premier League and the rest.
Before calculating the final player value, the main data frame was split into 6 separate data frames based on the position brackets listed above. These 6 data frames then had key performance indicators selected for their respective positions. For example, the data frame for center forwards are filled with features pertaining to goals, shots, touches in the penalty box, etc… while the data frame for center backs contains defensive duels, aerial duels, interceptions, and more.
When calculating the final player value, we would find the summation of all metrics across the key performance indicators for each respective position. We then added this subtotal to the league weight we identified before. Lastly, we multiplied our new subtotal with the weight assigned to developmental status. This final score shows how we rate each player based on their current season’s performance. The formula is listed below:
(rowSums(columns relating to player data) + League Coefficient Weight) * Position Peak Weight
Plotting a Line of Best Fit to Identify Talent Valuation
When plotting a line of best fit, we used the Loess method. The reasoning behind this was because as a player’s weighted score increases, their market value increases at a higher rate. This is because less players are performing at an extremely high level (relative to the others) and therefor their talents demand a higher market value. The Loess curve will create a much smoother line through the scatterplot, which is more indicative of the trend stated above compared to a line of best fit from a linear model. An example comparing the two for wide attackers can be seen below:
- Within Reason - if a player’s residual is between 0% and 20% of their market value, their market value is within reason of the performances this season.
- Slightly Overvalued - if a player’s residual is between 20.01% and 30% of their market value, they are slightly overvalued relative to their season performance.
- Significantly Overvalued – if a player’s residual value is greater than 30.01% of their market value, they are significantly overvalued relative to their season performance
- Within Reason – if a player’s residual value is between 0% and -20% of their market value, their market value is within reason of the performances this season.
- Slightly Undervalued – if a player’s residual is between -20.01% and -30% of their market value, they are slightly undervalued relative to their season performance.
- Significantly Undervalued – if a player’s residual value is less than -30.01% of their market value, they are significantly overvalued relative to their season performance.
This method would be best deployed prior to the January transfer window using data from the first half of our season or during an off season where a team can evaluate a player’s data from the prior year completed.
To evaluate our model, we randomly selected 3 players who had a weighted score around the 3rd quartile for each position bracket. We then created bar graphs to compare the market values and weighted scores of the players selected. Lastly, we selected a few KPIs to be displayed amongst the players, depending on their positional bracket. The percentile comparison will help us justify if valuation labeling is aligned. Below, we will go over 5 of the 6 position brackets, with the central forward bracket being deployed in a use case later.
- Each positional bracket has at least 2 valuation labels.
- Labels next to the bar display the valuation bracket along with the players market value.
- The discrepancy in weighted scores between positional brackets is attributed to the fact that positions like center mid had more features applied than positions like center back. This is because there are a larger variety of attributes that can be found in this position.
- This graph shows that each positional bracket has players with a similar weighted score. Since this is indicative of a players performance in the season selected, these players should in theory have comparable market values. This is obviously not the case, with the most glaring example being Sadio Mané . His market value is 889% higher than Simon, while having the second largest weighted score. For this reason, he is labeled as "Significantly Overvalued" while Simon is "Significantly Undervalued". We will compare each players percentiles for KPIs relative to their position shortly.
- Labels next to the bar display each players valuation label and their weighted score.
- Percentiles for the CAM bracket maxes out at 84. This is because there were only 84 players in this positional bracket, therefor an 84 is the max score.
- Market Values
- O. Duda - $6,000,000
- Denis Suárez - $10,000,000
- N. Mbuku - $9,000,000
- Duda is a strong shooter of the ball. He generates a high shot volume with a respectable shot % on target relative to his peers. This has led to an xG per 90 that sits around the 3rd quartile for the position. Since we know goals come at a premium when it comes to point generation, he could be a great addition for a team looking to improve their goal tally.
- Suarez sits around the middle of the pack in most categories. However, he has a high pass output per 90 along with a strong desire to play the ball progressively. His overall pass accuracy leaves more to be desired.
- Mbuku gets a lot of his shots on target and attempts to dribble defenders, but is subpar in most other key categories. However, he is only 19 years old and falls into the developmental category. In this case, it may be worth evaluating the player further with a scouting report. This will allow us to judge the player by in game performances and see if he shows a high potential.
- The valuation labels for Duda and Suárez seem accurate relative to their market value. Mbuku's ratings are a bit underwhelming relative to valuation label and market value, but it is important to note that he is extremely young at 19 years old. For this reason, an in person scouting report would be more indicative of potential upside relative to his current market value.
- Market Values
- B. Santamaria - $14,000,000
- M. Loum - $4,000,000
- Jae-Sung Lee - $3,500,000
- The most obvious standout in this is Jae-Sung Lee. Valued at $3,500,000, his goal related metrics are amongst the strongest in the top 5 leagues for the central midfield position bracket. In addition to this, he has generated a respectable amount of assists per 90, with an expected assist metric that sits near the 3rd quartile. Although his passing and dribbling abilities are lacking, his ability to contribute to a teams goal tally at a cut rate price justify his valuation label.
- On the contrary, Santamaria has a market value 400% greater than Lee. He is currently in his peak age bracket, but only his successful dribble percentage is at a top level. With that being said, he attempts a low amount of dribbles per 90 indicating he is not be a player who dribbles often. Although his xG per 90 metric sits just shy of the 3rd quartile, there doesn't seem to be a justifiable reason to spend so much on him when other viable options are available at a cheaper price. His valuation label of significantly overvalued seems warranted.
- Market Values
- I. Diop - $12,000,000
- K. Danso - $5,500,000
- Germán Sánchez - $800,000
- In the case of all 3 center backs randomly selected, no player jumps out as a must sign target. With the exception of Danso's ability to win aerial duels, all positive KPIs for the players displayed are around the the middle or back of the pack for their respective position.
- Diop's valuation label of "Within Reason" is extremely generous and should be rejected by a club. His performances in game do not warrant the highest valuation of the trio, as he consistently has a lower percentile rating compared to his peers.
- Germán Sánchez is an interesting case. Although his metrics aren't at a high level, he has a respectable defensive duel win %. He is 35 years old and considered a veteran, while having one of the lowest market values in the position. This would warrant his label of "Significantly Undervalued", as a team on a tight budget can look to him as an option to bolster squad depth.
- Market Values
- Emerson Palmieri - $14,000,000
- Aihen Muñoz - $6,000,000
- W. Kechrida - $1,300,000
- Emerson Palmieri has the highest market value of the trio. His largest strength comes from his ability to pass the ball accurately, which is and rated in the 92nd percentile. However, his defensive metrics are abysmal. His defensive duel win % doesn't even reach the 1st quartile. He would be much more useful on a team that allow fullbacks to get forward freely with minimal defensive duties. However, as he just exited his peak years we believe his label of "Significantly Overvalued" is warranted.
- On the flip side, Muñoz has extremely strong defensive duel metrics sitting around the 3rd quartile in defensive duels per 90 and defensive duels won. His accuracy is worse than that of Emerson, but his general cross output and crosses into the goalie box are far higher. Pair this with the fact that he just entered his peak years at the age of 24 and you have a full back who's market value is below his playing ability. In theory, his ability should continue to progress over the coming years under the right guidance. This progression in skill should coincide with a rise in market value, allowing for a club to make a potential profit down the road.
- Market Value
- S. Mané - $80,000,000
- Iker Muniain - $12,000,000
- M. Simon - $9,000,000
- Sadio Mané is one of the most famous names in world football. His success at Liverpool includes titles in both the Premier League and Champions League. Mané shines most in his shooting/goal scoring abilities. His xG per 90, goals per 90, and shots per 90 are all at a minimum of the 88th percentile. However, he sits around the median for most other key metrics. In addition to this, his accurate pass % is at an abysmal level sitting at the 26th percentile. Although his goal scoring ability is elite, he is currently 29 years old and will most likely see his market value drop significantly as he goes through his early thirties. For these reasons, we believe his label of "Significantly Overvalued" is warranted.
- Mané's market value is listed as being over 650% greater than the other two players on this list. Despite this valuation, both Muniain and Simon provide significant value to their sides and are extremely strong in other KPIs. For example, Munian's passing and assist related metrics are consistently at the 84th percentile or higher. In addition to this, Simon is an excellent crosser of the ball and has generated an xA metric at an elite level in the 95th percentile. He is also in his peak years and would be expected to maintain or grow in market value throughout. For these reasons, we believe the valuation labels for both Munian and Simon are correct. These are players who can be brought in, have a significant output for the team, and the difference in cost between them and Mané can be reinvested in other areas of the squad. Overall, a stronger squad should have more of an impact on point generation than an individual player.
Deployment – A Premier League Club Looking to Sign a Center Forward
Lets say we are a recently promoted Premier League club looking to avoid relegation. We had a shaky start to the season and find ourselves in 17th place, 1 spot above the relegation zone. To get out of this situation, we look to sign players in the January transfer window to help bolster the squad for the second half of the season.
We are confident in our defense’s ability to prevent goals relative to the teams in the bottom half of the table but lack a reliable striker up top. Since we want a player who has some match sharpness to make an immediate impact, we filter the data frame to display central forwards with a score between 70 and 80, along with a minimum of 630 minutes played.
After sifting through the shortlist, we sort our search down to Ollie Watkins, Ludovic Ajorque, and Ihlas Bebou. All players are in their peak age years for the position but play in different leagues. Ollie Watkins is rated as significantly overvalued, Bebou is rated within reason, and Ajorque is significantly undervalued. Below, we will dive into some of the metrics the strikers have.
Ollie Watkins has the largest market value at $35,000,000. Both Bebou and Ajorque are valued far lower at $16,000,000 and $15,000,000, respectively. Watkins is labeled as "Significantly Overvalued", Bebou is "Within Reason", and Ajorque is "Significantly Undervalued".
- All players have very similar xG metrics. Watkins is right on par with his xG, scoring 5 goals. Bebou is slightly over performing his xG, scoring 7 goals. Ajorque is nearly doubling his xG, scoring 10 goals so far this season.
- Although Ajorque has slightly less shots per 90 than Bebou and Watkins, he is extremely accurate with his shooting. Over 70% of his shots are on target. This is likely a major factor for his high goal tally.
- Bebou is the least accurate shooter, finding the target around 40% of the time. Watkins is hitting the target just shy of 50%.
- Watkins is finding himself getting more touches in the penalty area than both Ajorque and Bebou. This is important, as chances in the penalty area will typically lead to better goal scoring opportunities.
- When it comes to efficiency in these dangerous areas, Watkins is not up to par. His goal conversion rate is below 15%, which is significantly lower than both Bebou and Ajorque.
- On the contrary, Ajorque is extremely efficient. His goal conversion rate of 34% indicates that 34% of his shots are resulting in a goal.
Although Ajorque will likely see his goal tally regress (an xG doubling a goal tally is typically unsustainable long term), his goal conversion rate and shot accuracy are phenomenal. Combine this with the facts that he has an xG similar to Watkins and a much lower market value, and we begin to see why the club should tap him for further evaluation going forward. The cost difference between the players could be used to further bolster the squad, which in hand increases the chances of the club avoiding relegation.
In conclusion, the valuation labels generated provide accurate guidance when comparing the metrics of players within a given time frame and their market value. In the minority of cases where our valuation label did not fit the player performance relative to their market value, diving into specific KPIs allowed us to disqualify this candidate.
Although we know market values are highly speculative, this system can provide a guiding hand when a team is deciding on whether on field performances warrant a specific market value. It is important to note that these labels should not be the only deciding factor when signing a player. To be most effective, this should be paired with various scouting reports on individual players.
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.