Data Analysis of WTA Player Styles, 2012-2020

Posted on Dec 15, 2020

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Naomi Osaka | Getty Images

Data Science Introduction

The Women's Tennis Association (WTA) features great diversity among its players, from their national origin to how they approach the game of tennis. Common, albeit non-scientific, wisdom shared by many fans and commentators dictates that an aggressive style of tennis is what sets the all-time great players apart. Looking through a list of famous names--Serena Williams, Martina Navratilova, Steffi Graf, Chris Evert--there seems to be ample anecdotal evidence for this assertion. In this text we will use data to analyze WTA player styles of 2012-2020.

Another reasonable expectation is that the most successful players make relatively few errors over the course of a match. There is additional anecdotal evidence for this, as Caroline Wozniacki, Angelique Kerber, and Simona Halep--players who are lauded in particular for their consistency--have all been ranked #1 in the world in recent seasons.

The goal of this project is to apply a scientific approach to player style classification and to explore whether there are indeed measurable associations between playing style and success on the WTA tour.

I created an interactive Shiny app that allows users to display various pieces of information related to player style by year, including head-to-head win percentages between different styles and a detailed look at the playing styles of Grand Slam tournament winners. I hope to offer insights that could be of use to both tennis players and instructors/coaches, in terms of which approaches to the game can offer the highest chance of success and in terms of what improvements can be made by players of specific styles.

Using Data to Analyze: Quantifying Player Style

The data used in this project is made available by GitHub user JeffSackmann, located here and here. The data contains detailed information about WTA matches dating back to the 1970s. To control for the effects that changes in racket technology and court surface have had on playing style, I restricted my analysis to matches from 2012-2020.

The most important features for the purposes of this project were the score of each match, the number of winners struck by each player per match, and the number of unforced errors committed by each player per match. Inspired by the conventional tennis wisdom mentioned in the introduction, I engineered two new features, aggression and consistency, which I defined in the following ways:

Intuitively, aggression measures how frequently a player ends a point on her own terms, either by hitting a winner or making an error. Consistency measures how successfully a player "keeps the ball inside the court". Since a player's style can change over time, I did not want these figures to be static. Hence, aggression and consistency are calculated separately from year to year.

Finally, to separate players into discrete categories, I imposed specific cutoffs: In a given year, a player is considered aggressive if her aggression is above the mean for that year and is defensive otherwise. A player is considered consistent if her consistency is above the median for that year and is inconsistent otherwise. I chose the mean as the cutoff for aggressive vs. defensive, as the distributions of players' aggression tends to be symmetric. Consistency, on the other hand, tends to have a right-skewed distribution, so I selected the median as the cutoff point. The distributions of aggression and consistency for the year 2019 are visualized below.

Data Analysis of WTA Player Styles, 2012-2020 Data Analysis of WTA Player Styles, 2012-2020


The analysis of the available data revealed that consistency was associated with various measures of success. In the years studied, 29 Grand Slams were won by consistent players, while only three were won by inconsistent players. The scatter plots below show the locations of Grand Slam champions amid regions of different player styles for 2019 and 2018, respectively.

Data Analysis of WTA Player Styles, 2012-2020 Data Analysis of WTA Player Styles, 2012-2020

Consistency was also associated with positive outcomes in terms of win percentage. Consistent players nearly always had a positive overall win percentage; the only exception to this was 2012, when defensive consistent players won under 50% of total games. In terms of matchups between styles, consistent players, whether aggressive or defensive, won a majority of games against inconsistent players in every year studied. Inconsistent players, both aggressive and defensive, never had a win percent above 50% The overall and head-to-head win percent figures for 2019 are shown below.

Data Analysis of WTA Player Styles, 2012-2020

To look specifically at the results of Grand Slam winners over time, I created the following visualizations to clearly show the association between percentage of games won, success at the Grand Slam level, and changes in playing style The charts below show the percentage of games won and the playing style for Sloane Stephens and Angelique Kerber, respectively. Note that these players tended to win Grand Slams and have higher overall win percentages in years when they were classified as consistent. In the Shiny app, similar plots can be found for fourteen other Grand Slam winners.

Data Analysis of WTA Player Styles, 2012-2020 Data Analysis of WTA Player Styles, 2012-2020

Aggression was not as strongly associated with success, either in terms of win percent or in terms of Grand Slams won. While aggressive players did hold an edge in the number of Grand Slams won--nineteen for aggressive players versus thirteen for defensive--these figures were heavily skewed by the presence of Serena Williams, who won ten Grand Slams in the time period studied and was almost always classified as aggressive. Without Serena Williams, defensive players actually lead in Grand Slams won twelve to ten.

Interestingly, aggression did seem to be a significant determining factor in matchups between players of the same consistency classification: In all years except 2013 and 2020, aggressive consistent players won at least 50% of games against defensive consistent players. On the other hand, in each year after 2014, defensive inconsistent players had a winning record against aggressive inconsistent players.

Finally, for different player styles, I examined the correlations between win percentage and four other figures: aggression, consistency, the mean number of unforced errors per game, and the mean number of winners per game. This revealed several interesting trends. Committing more errors, which ostensibly should put a player at a disadvantage, does not seem to negatively impact defensive players at all. For defensive inconsistent players, a higher rate of errors was actually slightly positively correlated with win percent. Players of all styles benefited from hitting more winners and from playing more consistently, but only defensive players saw a general increase in win percentage as aggression increased.

Data Analysis of WTA Player Styles, 2012-2020

Conclusions and Insights

The conventional wisdom that served as the primary inspiration for this project--that aggression and consistency are keys to success in professional tennis--is at least partially supported by the data. Most notably, win percentage and Grand Slam victories were dominated by consistent players. It is conceivable that aggression was a more significant factor in the past, but for the years studied in this project, it was not a strong predictor of success in terms of either Grand Slam victories or percentage of games won.

The main predictive value of aggression was in determining success in matches between players of the same consistency classification: Aggressive players performed better against defensive ones when both players were consistent, while defensive players outperformed aggressive ones when both players were inconsistent. As a result, I would recommend that consistent players make an effort to play aggressively, while inconsistent players could benefit from strengthening their defensive skills.

Insights derived from the correlation graph suggest that it is possible for aggressive players to be "too aggressive," as there is actually a negative correlation for aggressive players between aggression and win percent. As such, aggressive players should be mindful of the advantage of playing with some level of restraint and control. It is also important for aggressive players to make an effort to keep their unforced error count low.

For defensive consistent players, there is a relatively low correlation between win percent and every other quantity considered, so for a player who seeks dependable, positive results, this playing style may be a good approach. This is reflected in real-life results, particularly in the case of Caroline Wozniacki. She was classified in this project as defensive consistent in seven out of nine years, and had been a perpetual fixture near the top of the WTA rankings before her retirement in 2020, while perhaps lagging a step behind the very top aggressive players like Serena Williams. Finally, it appears quite important for inconsistent players to make efforts to increase their consistency, as well as their winner count.

Below are links to my GitHub repository for this project, as well as the Shiny App.

Shiny App

GitHub repository

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI