Predicting the Outcome of Professional Tennis Matches

Posted on Oct 29, 2019


Betting on tennis is becoming increasingly popular. As a first step to developing a betting strategy, it is necessary to develop a model to predict the outcome of individual tennis matches. The men's professional tennis circuit (Association of Tennis Professionals or ATP) hosts many tournaments throughout the year. The ATP also provides rankings of the players, which is updated on a weekly basis. The question I sought to answer in this project was whether it is possible to use available data to develop a classification model to predict the outcome of an individual tennis match. Such a model could then be used with odds data to develop a full blown betting strategy.


The data was taken from Jeff Sackmann's github, The data includes near all ATP matches from 1968 through part of 2019. The data includes a number of interesting features, such as the player rankings, the number of points accumulated at the time of the match, in match statistics, such as the number of aces each player hit during the match, etc. Unfortunately, there a number of features the data did not include, such as return statistics, and, a number of the early observations did not include all of the features.



The first step in the preprocessing was to combine all the individual datasets into one big dataset. Since all the datasets contained the same features, this was straightforward.

The second step was to remove the bias in the dataset. Since the original data labelled all the data with the column names "winner" and "loser", depending on whether the data belonged to the winning player or the losing player, it was necessary to relabel all the relevant column to avoid the bias that might result when using the data as is. To do so, I randomly assigned player1 to either the winner or loser, and player2 to the other player. The random assignment resulted in player1 as the winner around half the time.

The third step was to filter the dataset to only include those observations where the ranking of both players was available, since I intuited that this would be the strongest predictor. This reduced the number of observations from around 170k to around 90k. Although this was a dramatic reduction, it turns out that most of the discarded data lacked significant information anyway. 

Feature Engineering

Some prior machine learning models used only the ranking of the two players to predict match outcome. This makes sense, since the ranking captures a players performance over the past year, and is likely a strong predictor of the player's current ability. However, there are many other types of information that might be useful in predicting the outcome of a match. For example, the past head to head of player1 and player2 could be extremely relevant, especially the most recent matches. The quality of a player's service game and return game is also likely of importance.

To better capture the nuances of each player, I decided to compute the past head to head of each player for each match, and service metrics from past matches for both players. These data were only available post 1991. To ensure that enough data was available in the past, I decided to further restrict the dataset to matches post 1999. That way, I would have about 10 years of past match data to compute these statistics. 

The features I computed included: aces per point, double faults per point, head to head results between the two players, first serve percentage, second serve percentage, etc. I scaled the serve data by point to avoid the bias that would occur if, for example, I had used number of aces, since a player may have had more opportunities to hit an ace than his opponent. One issue that arose is that some observations included new players, for which there was no prior record of performance. One option was to label all statistics for this player as 0, but that would likely produce biased results, since 0 is the lowest metric, and just because a player has no prior matches in the record, does not mean that he should be assigned the worst score. I ultimately decided to delete all observations with 0s in them, which does not seem like the best solution. This is something to look into in the future.

Finally, most modelling in the past combined the player 1 feature and the player 2 feature into a single feature. For example, for the ranking feature, subtracting the two rankings would consolidate the two features into a single feature. This has the advantage of producing a symmetric model and reducing the feature space by half. However, it has the disadvantage of eliminating information. I decided to not consolidate any of the features into a single feature.


Initial Modelling

To get a feel for the data and the effect of the player's ranking on the outcome, I decided to first try a two feature model that uses only the player rankings. To see whether the data was linearly separable, I first plotted the two rankings along with a color coded response indicating whether player 1 won or lost.

This figure shows that the data for these two features seems to be linearly separable. Since a linear model looks like it would do a good job with this classification task, I first tried to fit a logistic regression model to the data. I first split the data into a 90% - 10% train test split. I then fit a logistic regression model to the data and plotted the boundary:

The decision boundary is a straight line that looks like it passes through the origin. The intercept term is -.017, and the slope coefficients are -.005 and .005. The training accuracy was 66% and the test set accuracy was 65%.

For comparison, I decided to fit an LDA model to the data. As might be expected, the LDA model yielded similar results to the logistic regression model. The decision boundary is shown in the figure below.

Again, the decision boundary is a straight line that looks like it passes through the origin. The intercept term is -.012, and the coefficients are -.004 and +.004. The training accuracy was 66% and the test accuracy was 65%.

Modeling using the full feature set

For the next round of modeling, I added all the features. As mentioned above, in addition to the rankings, these features included the first serve percentage, second serve percentage, ace percentage, dbl fault percentage, and head to head score. I expected that these extra features would improve the accuracy over the two feature model, but, as we shall see, they did not.

To test the models, I first split the data into a 90% - 10% train test split. I used cross validation with grid search to select the best hyperparameters, refit the best hyperparameters to the full train set, and tested the model on the test set. I used 5 fold cross validation.


Random Forest

The first model I tested using all the features was a random forest. The parameters I tuned were the number of estimators, the measure of impurity, the minimum samples per leaf, and the minimum samples per split.

The optimal cross validated parameters are listed in the table below:

number of estimators 100
impurity measure gini
minimum samples per leaf 5
minimum samples per split 22

The test accuracy was 65%. Surprisingly, this is no better than the simple logistic regression and LDA model test accuracy above. This needs to be further investigated.

As expected, the features that were most important were the rankings. The feature importance bar chart is shown below:

The player rankings are by far the most important features. This may explain why adding extra features did not improve the performance - if the rankings are swamping out all other features, then it makes sense that the performance of the model may not improve with extra features.

Support vector machine

I next tried to fit a support vector machine model to the full feature set. At first, I tried to use a cross validated grid search to select the optimal hyperparameter C, but for some reason, the simulation would not terminate. Instead, I ran several different models varying the value of C. The results are summarized below:

C Test error
1 60%
10 58%
100 58%
1000 58%

We see that the test error starts at 60%, but then drops down to 58% for the rest of the values of C. Thus, the svm model performs worse than the simple logistic regression model and the random forest model.

Logistic Regression

I then fit a logistic regression model to the full feature set. I performed a grid search on the regularization constant, C. The optimal value was C = 1, with a test error of 64%

Linear Discriminant Analysis

Finally, I performed an LDA analysis on the full feature set. The LDA model returned a test score of 53%


A logistic regression model with two features performed just as well as a random forest model with multiple features. The reason is likely because the rankings are overwhelmingly the most important features of the features I engineered.

Follow up work includes engineering other features that may add predictive value over the rankings and developing a full betting model using the results. For features, using the universal tennis ratings might improve the quality of the predictions.



About Author

Leave a Comment

Google January 22, 2021
Google Wonderful story, reckoned we could combine a number of unrelated data, nevertheless really worth taking a search, whoa did 1 study about Mid East has got much more problerms too.
Google December 21, 2020
Google The time to study or stop by the material or internet sites we have linked to beneath.
cbd oil for dogs December 4, 2020
cbd oil for dogs [...]one of our visitors a short while ago suggested the following website[...]
Google November 4, 2020
Google Sites of interest we have a link to.
Google September 30, 2020
Google Every after inside a while we opt for blogs that we read. Listed beneath are the newest internet sites that we choose.
Avatar September 3, 2020 [...]although web-sites we backlink to below are considerably not related to ours, we really feel they are truly really worth a go by, so possess a look[...]
OnHax August 24, 2020
OnHax [...]we prefer to honor several other online web-sites on the net, even if they aren’t linked to us, by linking to them. Beneath are some webpages worth checking out[...]
Thesis Writing Services July 24, 2020
Thesis Writing Services [...]although internet sites we backlink to beneath are considerably not associated to ours, we really feel they are actually really worth a go as a result of, so possess a look[...]
Homepage November 28, 2019
... [Trackback] [...] Read More: [...]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp