Analyzing Data to Predict the Outcome of Tennis Matches
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Betting on tennis is becoming increasingly popular. As a first step to developing a betting strategy, it is necessary to develop a data model to predict the outcome of individual tennis matches. The men's professional tennis circuit (Association of Tennis Professionals or ATP) hosts many tournaments throughout the year. The ATP also provides rankings of the players, which is updated on a weekly basis.
The question I sought to answer in this project was whether it is possible to use available data to develop a classification model to predict the outcome of an individual tennis match. Such a model could then be used with odds data to develop a full blown betting strategy.
The data was taken from Jeff Sackmann's github, https://github.com/JeffSackmann/tennis_atp. The data includes near all ATP matches from 1968 through part of 2019. It also includes a number of interesting features, such as the player rankings, the number of points accumulated at the time of the match, in match statistics, such as the number of aces each player hit during the match, etc. Unfortunately, there a number of features the data did not include, such as return statistics, and, a number of the early observations did not include all of the features.
The first step in the preprocessing was to combine all the individual datasets into one big dataset. Since all the datasets contained the same features, this was straightforward.
The second step was to remove the bias in the dataset. Since the original data labelled all the data with the column names "winner" and "loser", depending on whether the data belonged to the winning player or the losing player, it was necessary to relabel all the relevant column to avoid the bias that might result when using the data as is. To do so, I randomly assigned player1 to either the winner or loser, and player2 to the other player. The random assignment resulted in player1 as the winner around half the time.
The third step was to filter the dataset to only include those observations where the ranking of both players was available, since I intuited that this would be the strongest predictor. This reduced the number of observations from around 170k to around 90k. Although this was a dramatic reduction, it turns out that most of the discarded data lacked significant information anyway.
Data on Feature Engineering
Some prior machine learning models used only the ranking of the two players to predict match outcome. This makes sense, since the ranking captures a players performance over the past year, and is likely a strong predictor of the player's current ability. However, there are many other types of information that might be useful in predicting the outcome of a match. For example, the past head to head of player1 and player2 could be extremely relevant, especially the most recent matches. The quality of a player's service game and return game is also likely of importance.
To better capture the nuances of each player, I decided to compute the past head to head of each player for each match, and service metrics from past matches for both players. These data were only available post 1991. To ensure that enough data was available in the past, I decided to further restrict the dataset to matches post 1999. That way, I would have about 10 years of past match data to compute these statistics.
The features I computed included: aces per point, double faults per point, head to head results between the two players, first serve percentage, second serve percentage, etc. I scaled the serve data by point to avoid the bias that would occur if, for example, I had used number of aces, since a player may have had more opportunities to hit an ace than his opponent.
Issue on Features
One issue that arose is that some observations included new players, for which there was no prior record of performance. One option was to label all statistics for this player as 0, but that would likely produce biased results, since 0 is the lowest metric, and just because a player has no prior matches in the record, does not mean that he should be assigned the worst score. I ultimately decided to delete all observations with 0s in them, which does not seem like the best solution. This is something to look into in the future.
Finally, most modelling in the past combined the player 1 feature and the player 2 feature into a single feature. For example, for the ranking feature, subtracting the two rankings would consolidate the two features into a single feature. This has the advantage of producing a symmetric model and reducing the feature space by half. However, it has the disadvantage of eliminating information. I decided to not consolidate any of the features into a single feature.
To get a feel for the data and the effect of the player's ranking on the outcome, I decided to first try a two feature model that uses only the player rankings. To see whether the data was linearly separable, I first plotted the two rankings along with a color coded response indicating whether player 1 won or lost.
Logistic Regression Model
This figure shows that the data for these two features seems to be linearly separable. Since a linear model looks like it would do a good job with this classification task, I first tried to fit a logistic regression model to the data. I first split the data into a 90% - 10% train test split. I then fit a logistic regression model to the data and plotted the boundary:
The decision boundary is a straight line that looks like it passes through the origin. The intercept term is -.017, and the slope coefficients are -.005 and .005. The training accuracy was 66% and the test set accuracy was 65%.
For comparison, I decided to fit an LDA model to the data. As might be expected, the LDA model yielded similar results to the logistic regression model. The decision boundary is shown in the figure below.
Again, the decision boundary is a straight line that looks like it passes through the origin. The intercept term is -.012, and the coefficients are -.004 and +.004. The training accuracy was 66% and the test accuracy was 65%.
Modeling using the full feature set
For the next round of modeling, I added all the features. As mentioned above, in addition to the rankings, these features included the first serve percentage, second serve percentage, ace percentage, dbl fault percentage, and head to head score. I expected that these extra features would improve the accuracy over the two feature model, but, as we shall see, they did not.
To test the models, I first split the data into a 90% - 10% train test split. I used cross validation with grid search to select the best hyperparameters, refit the best hyperparameters to the full train set, and tested the model on the test set. I used 5 fold cross validation.
Data on Random Forest
The first model I tested using all the features was a random forest. The parameters I tuned were the number of estimators, the measure of impurity, the minimum samples per leaf, and the minimum samples per split.
Validated parameters of the optimal cross are listed in the table below:
|number of estimators||100|
|minimum samples per leaf||5|
|minimum samples per split||22|
The test accuracy was 65%. Surprisingly, this is no better than the simple logistic regression and LDA model test accuracy above. This needs to be further investigated.
As expected, the features that were most important were the rankings. The feature importance bar chart is shown below:
The player rankings are by far the most important features. This may explain why adding extra features did not improve the performance - if the rankings are swamping out all other features, then it makes sense that the performance of the model may not improve with extra features.
Support vector machine
I next tried to fit a support vector machine model to the full feature set. At first, I tried to use a cross validated grid search to select the optimal hyperparameter C, but for some reason, the simulation would not terminate. Instead, I ran several different models varying the value of C. The results are summarized below:
We see that the test error starts at 60%, but then drops down to 58% for the rest of the values of C. Thus, the svm model performs worse than the simple logistic regression model and the random forest model.
I then fit a logistic regression model to the full feature set. I performed a grid search on the regularization constant, C. The optimal value was C = 1, with a test error of 64%
Linear Discriminant Data Analysis
Finally, I performed an LDA analysis on the full feature set. The LDA model returned a test score of 53%
A logistic regression model with two features performed just as well as a random forest model with multiple features. The reason is likely because the rankings are overwhelmingly the most important features of the features I engineered.
Follow up work includes engineering other features that may add predictive value over the rankings and developing a full betting model using the results. For features, using the universal tennis ratings might improve the quality of the predictions.