Visualizing tennis stats

Posted on Sep 20, 2019


The Association of Tennis Professionals is the governing body for the men’s tennis circuit. Data from tournaments is stored on the site The site contains a list of the top men’s players. The list of players is stored on several pages. The list of the top 100 players is here: The other players are listed on similar pages.This data could be useful in building models to predict the likelihood of a player winning a point, game, or match, though first that data has  to be scraped.

The Scraper

The first part of the project was to build a scraper. The goal of the scraper is to get data for the top 500 ATP players. The scraper gets the link for each player and then requests the url for that player’s stats page. An example stats page for Novak Djokovic is: The stats page allows one to select a year and a surface, and for each selection, produces service and return stats for that player. After going to the player’s stats page, the scraper gets the data for each year and for each surface. The scraper then goes to the next player on the list.


Before investigating the data, it was necessary to perform some preliminary cleaning. Some of the rows contained all N/As, since there was no data available for the player for a particular year and surface. I removed all of these rows. Some of the rows contained all 0s. These are also not valid, so I removed these as well. Finally, I converted the percent data to ratios.


I performed some EDA on the data. The first plot shows the relationship between double faults and aces. We might expect that as a player hits more aces, he will also generate more double faults due to the risk of trying for an ace. The plot below seems to bear this out, as we see a positive correlation between number of aces and number of double faults:

Another thing it revealed was the difference in the surface materials involved. It is common knowledge that clay courts differ in significant ways from grass courts, for example. We would like to know if the data bears out this difference. The next plot shows the percentage of return games won broken down by surface. 

As expected, there seems to be a difference between the surfaces. Players win more return games on clay compared to other surfaces. The hardest surface to win return games is grass and carpet.

Finally, I wanted to see whether there is an empirical relationship between the ability to win service points and the ability to win return points. We might expect that a player’s overall tennis skills would make it more likely to win both service and return points. However, does the data bear this out?

There doesn’t appear to be any strong relationship between the percentage of service points won and the percentage of return points won. This may point to there being two distinct tennis skills: service ability and return ability.


To further explore the data, I decided to construct a simple logistic regression model to see if surface type can be predicted from the available data. I performed a 10 fold cross validation logistic regression. The model did not perform very well, though. The mean accuracy was ~55% with a standard deviation of ~3.25%. Using negative log loss, the mean score was -1.045 with a sd of .037.


I performed some basic EDA on the tennis data. There appears to be a relationship between serve and return statistics and surface type. There doesn’t appear to be any correlation between the likelihood of winning a service point and the likelihood of winning a return point. Finally, a simple logistic regression model did not predict surface type very well.

In the future, more EDA can be performed to look at different questions. Also, better models can be constructed to try to improve prediction accuracy.





About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp