Analyzing Data to Predict the Outcome of Tennis Matches

Posted on Oct 29, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Betting on tennis is becoming increasingly popular. As a first step to developing a betting strategy, it is necessary to develop a data model to predict the outcome of individual tennis matches. The men's professional tennis circuit (Association of Tennis Professionals or ATP) hosts many tournaments throughout the year. The ATP also provides rankings of the players, which is updated on a weekly basis.

The question I sought to answer in this project was whether it is possible to use available data to develop a classification model to predict the outcome of an individual tennis match. Such a model could then be used with odds data to develop a full blown betting strategy.


The data was taken from Jeff Sackmann's github, The data includes near all ATP matches from 1968 through part of 2019. It also includes a number of interesting features, such as the player rankings, the number of points accumulated at the time of the match, in match statistics, such as the number of aces each player hit during the match, etc. Unfortunately, there a number of features the data did not include, such as return statistics, and, a number of the early observations did not include all of the features.


Preprocessing Data

The first step in the preprocessing was to combine all the individual datasets into one big dataset. Since all the datasets contained the same features, this was straightforward.

The second step was to remove the bias in the dataset. Since the original data labelled all the data with the column names "winner" and "loser", depending on whether the data belonged to the winning player or the losing player, it was necessary to relabel all the relevant column to avoid the bias that might result when using the data as is. To do so, I randomly assigned player1 to either the winner or loser, and player2 to the other player. The random assignment resulted in player1 as the winner around half the time.

The third step was to filter the dataset to only include those observations where the ranking of both players was available, since I intuited that this would be the strongest predictor. This reduced the number of observations from around 170k to around 90k. Although this was a dramatic reduction, it turns out that most of the discarded data lacked significant information anyway.Β 

Data on Feature Engineering

Some prior machine learning models used only the ranking of the two players to predict match outcome. This makes sense, since the ranking captures a players performance over the past year, and is likely a strong predictor of the player's current ability. However, there are many other types of information that might be useful in predicting the outcome of a match. For example, the past head to head of player1 and player2 could be extremely relevant, especially the most recent matches. The quality of a player's service game and return game is also likely of importance.

To better capture the nuances of each player, I decided to compute the past head to head of each player for each match, and service metrics from past matches for both players. These data were only available post 1991. To ensure that enough data was available in the past, I decided to further restrict the dataset to matches post 1999. That way, I would have about 10 years of past match data to compute these statistics.Β 

The features I computed included: aces per point, double faults per point, head to head results between the two players, first serve percentage, second serve percentage, etc. I scaled the serve data by point to avoid the bias that would occur if, for example, I had used number of aces, since a player may have had more opportunities to hit an ace than his opponent.

Issue on Features

One issue that arose is that some observations included new players, for which there was no prior record of performance. One option was to label all statistics for this player as 0, but that would likely produce biased results, since 0 is the lowest metric, and just because a player has no prior matches in the record, does not mean that he should be assigned the worst score. I ultimately decided to delete all observations with 0s in them, which does not seem like the best solution. This is something to look into in the future.

Finally, most modelling in the past combined the player 1 feature and the player 2 feature into a single feature. For example, for the ranking feature, subtracting the two rankings would consolidate the two features into a single feature. This has the advantage of producing a symmetric model and reducing the feature space by half. However, it has the disadvantage of eliminating information. I decided to not consolidate any of the features into a single feature.


Initial Modelling

To get a feel for the data and the effect of the player's ranking on the outcome, I decided to first try a two feature model that uses only the player rankings. To see whether the data was linearly separable, I first plotted the two rankings along with a color coded response indicating whether player 1 won or lost.

Analyzing Data to Predict the Outcome of Tennis Matches

Logistic Regression Model

This figure shows that the data for these two features seems to be linearly separable. Since a linear model looks like it would do a good job with this classification task, I first tried to fit a logistic regression model to the data. I first split the data into a 90% - 10% train test split. I then fit a logistic regression model to the data and plotted the boundary:

Analyzing Data to Predict the Outcome of Tennis Matches

The decision boundary is a straight line that looks like it passes through the origin. The intercept term is -.017, and the slope coefficients are -.005 and .005. The training accuracy was 66% and the test set accuracy was 65%.

For comparison, I decided to fit an LDA model to the data. As might be expected, the LDA model yielded similar results to the logistic regression model. The decision boundary is shown in the figure below.

Analyzing Data to Predict the Outcome of Tennis Matches

Again, the decision boundary is a straight line that looks like it passes through the origin. The intercept term is -.012, and the coefficients are -.004 and +.004. The training accuracy was 66% and the test accuracy was 65%.

Modeling using the full feature set

For the next round of modeling, I added all the features. As mentioned above, in addition to the rankings, these features included the first serve percentage, second serve percentage, ace percentage, dbl fault percentage, and head to head score. I expected that these extra features would improve the accuracy over the two feature model, but, as we shall see, they did not.

To test the models, I first split the data into a 90% - 10% train test split. I used cross validation with grid search to select the best hyperparameters, refit the best hyperparameters to the full train set, and tested the model on the test set. I used 5 fold cross validation.


Data on Random Forest

The first model I tested using all the features was a random forest. The parameters I tuned were the number of estimators, the measure of impurity, the minimum samples per leaf, and the minimum samples per split.

Validated parameters of the optimal cross are listed in the table below:

number of estimators 100
impurity measure gini
minimum samples per leaf 5
minimum samples per split 22

The test accuracy was 65%. Surprisingly, this is no better than the simple logistic regression and LDA model test accuracy above. This needs to be further investigated.

As expected, the features that were most important were the rankings. The feature importance bar chart is shown below:

The player rankings are by far the most important features. This may explain why adding extra features did not improve the performance - if the rankings are swamping out all other features, then it makes sense that the performance of the model may not improve with extra features.

Support vector machine

I next tried to fit a support vector machine model to the full feature set. At first, I tried to use a cross validated grid search to select the optimal hyperparameter C, but for some reason, the simulation would not terminate. Instead, I ran several different models varying the value of C. The results are summarized below:

C Test error
1 60%
10 58%
100 58%
1000 58%

We see that the test error starts at 60%, but then drops down to 58% for the rest of the values of C. Thus, the svm model performs worse than the simple logistic regression model and the random forest model.

Logistic Regression

I then fit a logistic regression model to the full feature set. I performed a grid search on the regularization constant, C. The optimal value was C = 1, with a test error of 64%

Linear Discriminant Data Analysis

Finally, I performed an LDA analysis on the full feature set. The LDA model returned a test score of 53%


A logistic regression model with two features performed just as well as a random forest model with multiple features. The reason is likely because the rankings are overwhelmingly the most important features of the features I engineered.

Follow up work includes engineering other features that may add predictive value over the rankings and developing a full betting model using the results. For features, using the universal tennis ratings might improve the quality of the predictions.



About Author

Leave a Comment

Google August 28, 2021
Google One of our guests not long ago suggested the following website.
Google January 22, 2021
Google Wonderful story, reckoned we could combine a number of unrelated data, nevertheless really worth taking a search, whoa did 1 study about Mid East has got much more problerms too.
Google December 21, 2020
Google The time to study or stop by the material or internet sites we have linked to beneath.
cbd oil for dogs December 4, 2020
cbd oil for dogs [...]one of our visitors a short while ago suggested the following website[...]
Google November 4, 2020
Google Sites of interest we have a link to.
Google September 30, 2020
Google Every after inside a while we opt for blogs that we read. Listed beneath are the newest internet sites that we choose. September 3, 2020 [...]although web-sites we backlink to below are considerably not related to ours, we really feel they are truly really worth a go by, so possess a look[...]
OnHax August 24, 2020
OnHax [...]we prefer to honor several other online web-sites on the net, even if they arenΒ’t linked to us, by linking to them. Beneath are some webpages worth checking out[...]
Thesis Writing Services July 24, 2020
Thesis Writing Services [...]although internet sites we backlink to beneath are considerably not associated to ours, we really feel they are actually really worth a go as a result of, so possess a look[...]
Homepage November 28, 2019
... [Trackback] [...] Read More: [...]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI