Tennis Data Association Circuit Analysis
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
The Association of Tennis Professionals is the governing body for the menβs tennis circuit. Data from tournaments is stored on the site https://www.atptour.com. The site contains a list of the top menβs players. The list of players is stored on several pages. The list of the top 100 players is here: https://www.atptour.com/en/rankings/singles. The other players are listed on similar pages.This data could be useful in building models to predict the likelihood of a player winning a point, game, or match, though first that data hasΒ to be scraped.
The Scraper
The first part of the project was to build a scraper. The goal of the scraper is to get data for the top 500 ATP players. The scraper gets the link for each player and then requests the url for that playerβs stats page. An example stats page for Novak Djokovic is: https://www.atptour.com/en/players/novak-djokovic/d643/player-stats. The stats page allows one to select a year and a surface, and for each selection, produces service and return stats for that player. After going to the playerβs stats page, the scraper gets the data for each year and for each surface. The scraper then goes to the next player on the list.
Data
Before investigating the data, it was necessary to perform some preliminary cleaning. Some of the rows contained all N/As, since there was no data available for the player for a particular year and surface. I removed all of these rows. Some of the rows contained all 0s. These are also not valid, so I removed these as well. Finally, I converted the percent data to ratios.
EDA
I performed some EDA on the data. The first plot shows the relationship between double faults and aces. We might expect that as a player hits more aces, he will also generate more double faults due to the risk of trying for an ace. The plot below seems to bear this out, as we see a positive correlation between number of aces and number of double faults:
Another thing it revealed was the difference in the surface materials involved. It is common knowledge that clay courts differ in significant ways from grass courts, for example. We would like to know if the data bears out this difference. The next plot shows the percentage of return games won broken down by surface.Β
As expected, there seems to be a difference between the surfaces. Players win more return games on clay compared to other surfaces. The hardest surface to win return games is grass and carpet.
Finally, I wanted to see whether there is an empirical relationship between the ability to win service points and the ability to win return points. We might expect that a playerβs overall tennis skills would make it more likely to win both service and return points. However, does the data bear this out?
There doesnβt appear to be any strong relationship between the percentage of service points won and the percentage of return points won. This may point to there being two distinct tennis skills: service ability and return ability.
Data Model
To further explore the data, I decided to construct a simple logistic regression model to see if surface type can be predicted from the available data. I performed a 10 fold cross validation logistic regression. The model did not perform very well, though. The mean accuracy was ~55% with a standard deviation of ~3.25%. Using negative log loss, the mean score was -1.045 with a sd of .037.
Conclusion
I performed some basic EDA on the tennis data. There appears to be a relationship between serve and return statistics and surface type. There doesnβt appear to be any correlation between the likelihood of winning a service point and the likelihood of winning a return point. Finally, a simple logistic regression model did not predict surface type very well.
In the future, more EDA can be performed to look at different questions. Also, better models can be constructed to try to improve prediction accuracy.