Tennis Data Association Circuit Analysis

Posted on Sep 20, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

The Association of Tennis Professionals is the governing body for the men’s tennis circuit. Data from tournaments is stored on the site https://www.atptour.com. The site contains a list of the top men’s players. The list of players is stored on several pages. The list of the top 100 players is here: https://www.atptour.com/en/rankings/singles. The other players are listed on similar pages.This data could be useful in building models to predict the likelihood of a player winning a point, game, or match, though first that data has  to be scraped.

The Scraper

The first part of the project was to build a scraper. The goal of the scraper is to get data for the top 500 ATP players. The scraper gets the link for each player and then requests the url for that player’s stats page. An example stats page for Novak Djokovic is: https://www.atptour.com/en/players/novak-djokovic/d643/player-stats. The stats page allows one to select a year and a surface, and for each selection, produces service and return stats for that player. After going to the player’s stats page, the scraper gets the data for each year and for each surface. The scraper then goes to the next player on the list.

Data

Before investigating the data, it was necessary to perform some preliminary cleaning. Some of the rows contained all N/As, since there was no data available for the player for a particular year and surface. I removed all of these rows. Some of the rows contained all 0s. These are also not valid, so I removed these as well. Finally, I converted the percent data to ratios.

EDA

I performed some EDA on the data. The first plot shows the relationship between double faults and aces. We might expect that as a player hits more aces, he will also generate more double faults due to the risk of trying for an ace. The plot below seems to bear this out, as we see a positive correlation between number of aces and number of double faults:

Tennis Data Association Circuit Analysis

Another thing it revealed was the difference in the surface materials involved. It is common knowledge that clay courts differ in significant ways from grass courts, for example. We would like to know if the data bears out this difference. The next plot shows the percentage of return games won broken down by surface. 

Tennis Data Association Circuit Analysis

As expected, there seems to be a difference between the surfaces. Players win more return games on clay compared to other surfaces. The hardest surface to win return games is grass and carpet.

Finally, I wanted to see whether there is an empirical relationship between the ability to win service points and the ability to win return points. We might expect that a player’s overall tennis skills would make it more likely to win both service and return points. However, does the data bear this out?

Tennis Data Association Circuit Analysis

There doesn’t appear to be any strong relationship between the percentage of service points won and the percentage of return points won. This may point to there being two distinct tennis skills: service ability and return ability.

Data Model

To further explore the data, I decided to construct a simple logistic regression model to see if surface type can be predicted from the available data. I performed a 10 fold cross validation logistic regression. The model did not perform very well, though. The mean accuracy was ~55% with a standard deviation of ~3.25%. Using negative log loss, the mean score was -1.045 with a sd of .037.

Conclusion

I performed some basic EDA on the tennis data. There appears to be a relationship between serve and return statistics and surface type. There doesn’t appear to be any correlation between the likelihood of winning a service point and the likelihood of winning a return point. Finally, a simple logistic regression model did not predict surface type very well.

In the future, more EDA can be performed to look at different questions. Also, better models can be constructed to try to improve prediction accuracy.

 

 

 

 

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI