Data Analysis of WTA Player Styles, 2012-2020
The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Naomi Osaka | Getty Images
Data Science Introduction
The Women's Tennis Association (WTA) features great diversity among its players, from their national origin to how they approach the game of tennis. Common, albeit non-scientific, wisdom shared by many fans and commentators dictates that an aggressive style of tennis is what sets the all-time great players apart. Looking through a list of famous names--Serena Williams, Martina Navratilova, Steffi Graf, Chris Evert--there seems to be ample anecdotal evidence for this assertion. In this text we will use data to analyze WTA player styles of 2012-2020.
Another reasonable expectation is that the most successful players make relatively few errors over the course of a match. There is additional anecdotal evidence for this, as Caroline Wozniacki, Angelique Kerber, and Simona Halep--players who are lauded in particular for their consistency--have all been ranked #1 in the world in recent seasons.
The goal of this project is to apply a scientific approach to player style classification and to explore whether there are indeed measurable associations between playing style and success on the WTA tour.
I created an interactive Shiny app that allows users to display various pieces of information related to player style by year, including head-to-head win percentages between different styles and a detailed look at the playing styles of Grand Slam tournament winners. I hope to offer insights that could be of use to both tennis players and instructors/coaches, in terms of which approaches to the game can offer the highest chance of success and in terms of what improvements can be made by players of specific styles.
Using Data to Analyze: Quantifying Player Style
The data used in this project is made available by GitHub user JeffSackmann, located here and here. The data contains detailed information about WTA matches dating back to the 1970s. To control for the effects that changes in racket technology and court surface have had on playing style, I restricted my analysis to matches from 2012-2020.
The most important features for the purposes of this project were the score of each match, the number of winners struck by each player per match, and the number of unforced errors committed by each player per match. Inspired by the conventional tennis wisdom mentioned in the introduction, I engineered two new features, aggression and consistency, which I defined in the following ways:
Intuitively, aggression measures how frequently a player ends a point on her own terms, either by hitting a winner or making an error. Consistency measures how successfully a player "keeps the ball inside the court". Since a player's style can change over time, I did not want these figures to be static. Hence, aggression and consistency are calculated separately from year to year.
Finally, to separate players into discrete categories, I imposed specific cutoffs: In a given year, a player is considered aggressive if her aggression is above the mean for that year and is defensive otherwise. A player is considered consistent if her consistency is above the median for that year and is inconsistent otherwise. I chose the mean as the cutoff for aggressive vs. defensive, as the distributions of players' aggression tends to be symmetric. Consistency, on the other hand, tends to have a right-skewed distribution, so I selected the median as the cutoff point. The distributions of aggression and consistency for the year 2019 are visualized below.
The analysis of the available data revealed that consistency was associated with various measures of success. In the years studied, 29 Grand Slams were won by consistent players, while only three were won by inconsistent players. The scatter plots below show the locations of Grand Slam champions amid regions of different player styles for 2019 and 2018, respectively.
Consistency was also associated with positive outcomes in terms of win percentage. Consistent players nearly always had a positive overall win percentage; the only exception to this was 2012, when defensive consistent players won under 50% of total games. In terms of matchups between styles, consistent players, whether aggressive or defensive, won a majority of games against inconsistent players in every year studied. Inconsistent players, both aggressive and defensive, never had a win percent above 50% The overall and head-to-head win percent figures for 2019 are shown below.
To look specifically at the results of Grand Slam winners over time, I created the following visualizations to clearly show the association between percentage of games won, success at the Grand Slam level, and changes in playing style The charts below show the percentage of games won and the playing style for Sloane Stephens and Angelique Kerber, respectively. Note that these players tended to win Grand Slams and have higher overall win percentages in years when they were classified as consistent. In the Shiny app, similar plots can be found for fourteen other Grand Slam winners.
Aggression was not as strongly associated with success, either in terms of win percent or in terms of Grand Slams won. While aggressive players did hold an edge in the number of Grand Slams won--nineteen for aggressive players versus thirteen for defensive--these figures were heavily skewed by the presence of Serena Williams, who won ten Grand Slams in the time period studied and was almost always classified as aggressive. Without Serena Williams, defensive players actually lead in Grand Slams won twelve to ten.
Interestingly, aggression did seem to be a significant determining factor in matchups between players of the same consistency classification: In all years except 2013 and 2020, aggressive consistent players won at least 50% of games against defensive consistent players. On the other hand, in each year after 2014, defensive inconsistent players had a winning record against aggressive inconsistent players.
Finally, for different player styles, I examined the correlations between win percentage and four other figures: aggression, consistency, the mean number of unforced errors per game, and the mean number of winners per game. This revealed several interesting trends. Committing more errors, which ostensibly should put a player at a disadvantage, does not seem to negatively impact defensive players at all. For defensive inconsistent players, a higher rate of errors was actually slightly positively correlated with win percent. Players of all styles benefited from hitting more winners and from playing more consistently, but only defensive players saw a general increase in win percentage as aggression increased.
Conclusions and Insights
The conventional wisdom that served as the primary inspiration for this project--that aggression and consistency are keys to success in professional tennis--is at least partially supported by the data. Most notably, win percentage and Grand Slam victories were dominated by consistent players. It is conceivable that aggression was a more significant factor in the past, but for the years studied in this project, it was not a strong predictor of success in terms of either Grand Slam victories or percentage of games won.
The main predictive value of aggression was in determining success in matches between players of the same consistency classification: Aggressive players performed better against defensive ones when both players were consistent, while defensive players outperformed aggressive ones when both players were inconsistent. As a result, I would recommend that consistent players make an effort to play aggressively, while inconsistent players could benefit from strengthening their defensive skills.
Insights derived from the correlation graph suggest that it is possible for aggressive players to be "too aggressive," as there is actually a negative correlation for aggressive players between aggression and win percent. As such, aggressive players should be mindful of the advantage of playing with some level of restraint and control. It is also important for aggressive players to make an effort to keep their unforced error count low.
For defensive consistent players, there is a relatively low correlation between win percent and every other quantity considered, so for a player who seeks dependable, positive results, this playing style may be a good approach. This is reflected in real-life results, particularly in the case of Caroline Wozniacki. She was classified in this project as defensive consistent in seven out of nine years, and had been a perpetual fixture near the top of the WTA rankings before her retirement in 2020, while perhaps lagging a step behind the very top aggressive players like Serena Williams. Finally, it appears quite important for inconsistent players to make efforts to increase their consistency, as well as their winner count.
Below are links to my GitHub repository for this project, as well as the Shiny App.