NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Analyzing Data to Predict the Outcome of Tennis Matches

Analyzing Data to Predict the Outcome of Tennis Matches

Rohit Parthasarathy
Posted on Oct 29, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Betting on tennis is becoming increasingly popular. As a first step to developing a betting strategy, it is necessary to develop a data model to predict the outcome of individual tennis matches. The men's professional tennis circuit (Association of Tennis Professionals or ATP) hosts many tournaments throughout the year. The ATP also provides rankings of the players, which is updated on a weekly basis.

The question I sought to answer in this project was whether it is possible to use available data to develop a classification model to predict the outcome of an individual tennis match. Such a model could then be used with odds data to develop a full blown betting strategy.

Dataset

The data was taken from Jeff Sackmann's github, https://github.com/JeffSackmann/tennis_atp. The data includes near all ATP matches from 1968 through part of 2019. It also includes a number of interesting features, such as the player rankings, the number of points accumulated at the time of the match, in match statistics, such as the number of aces each player hit during the match, etc. Unfortunately, there a number of features the data did not include, such as return statistics, and, a number of the early observations did not include all of the features.

 

Preprocessing Data

The first step in the preprocessing was to combine all the individual datasets into one big dataset. Since all the datasets contained the same features, this was straightforward.

The second step was to remove the bias in the dataset. Since the original data labelled all the data with the column names "winner" and "loser", depending on whether the data belonged to the winning player or the losing player, it was necessary to relabel all the relevant column to avoid the bias that might result when using the data as is. To do so, I randomly assigned player1 to either the winner or loser, and player2 to the other player. The random assignment resulted in player1 as the winner around half the time.

The third step was to filter the dataset to only include those observations where the ranking of both players was available, since I intuited that this would be the strongest predictor. This reduced the number of observations from around 170k to around 90k. Although this was a dramatic reduction, it turns out that most of the discarded data lacked significant information anyway. 

Data on Feature Engineering

Some prior machine learning models used only the ranking of the two players to predict match outcome. This makes sense, since the ranking captures a players performance over the past year, and is likely a strong predictor of the player's current ability. However, there are many other types of information that might be useful in predicting the outcome of a match. For example, the past head to head of player1 and player2 could be extremely relevant, especially the most recent matches. The quality of a player's service game and return game is also likely of importance.

To better capture the nuances of each player, I decided to compute the past head to head of each player for each match, and service metrics from past matches for both players. These data were only available post 1991. To ensure that enough data was available in the past, I decided to further restrict the dataset to matches post 1999. That way, I would have about 10 years of past match data to compute these statistics. 

The features I computed included: aces per point, double faults per point, head to head results between the two players, first serve percentage, second serve percentage, etc. I scaled the serve data by point to avoid the bias that would occur if, for example, I had used number of aces, since a player may have had more opportunities to hit an ace than his opponent.

Issue on Features

One issue that arose is that some observations included new players, for which there was no prior record of performance. One option was to label all statistics for this player as 0, but that would likely produce biased results, since 0 is the lowest metric, and just because a player has no prior matches in the record, does not mean that he should be assigned the worst score. I ultimately decided to delete all observations with 0s in them, which does not seem like the best solution. This is something to look into in the future.

Finally, most modelling in the past combined the player 1 feature and the player 2 feature into a single feature. For example, for the ranking feature, subtracting the two rankings would consolidate the two features into a single feature. This has the advantage of producing a symmetric model and reducing the feature space by half. However, it has the disadvantage of eliminating information. I decided to not consolidate any of the features into a single feature.

 

Initial Modelling

To get a feel for the data and the effect of the player's ranking on the outcome, I decided to first try a two feature model that uses only the player rankings. To see whether the data was linearly separable, I first plotted the two rankings along with a color coded response indicating whether player 1 won or lost.

Analyzing Data to Predict the Outcome of Tennis Matches

Logistic Regression Model

This figure shows that the data for these two features seems to be linearly separable. Since a linear model looks like it would do a good job with this classification task, I first tried to fit a logistic regression model to the data. I first split the data into a 90% - 10% train test split. I then fit a logistic regression model to the data and plotted the boundary:

Analyzing Data to Predict the Outcome of Tennis Matches

The decision boundary is a straight line that looks like it passes through the origin. The intercept term is -.017, and the slope coefficients are -.005 and .005. The training accuracy was 66% and the test set accuracy was 65%.

For comparison, I decided to fit an LDA model to the data. As might be expected, the LDA model yielded similar results to the logistic regression model. The decision boundary is shown in the figure below.

Analyzing Data to Predict the Outcome of Tennis Matches

Again, the decision boundary is a straight line that looks like it passes through the origin. The intercept term is -.012, and the coefficients are -.004 and +.004. The training accuracy was 66% and the test accuracy was 65%.

Modeling using the full feature set

For the next round of modeling, I added all the features. As mentioned above, in addition to the rankings, these features included the first serve percentage, second serve percentage, ace percentage, dbl fault percentage, and head to head score. I expected that these extra features would improve the accuracy over the two feature model, but, as we shall see, they did not.

To test the models, I first split the data into a 90% - 10% train test split. I used cross validation with grid search to select the best hyperparameters, refit the best hyperparameters to the full train set, and tested the model on the test set. I used 5 fold cross validation.

 

Data on Random Forest

The first model I tested using all the features was a random forest. The parameters I tuned were the number of estimators, the measure of impurity, the minimum samples per leaf, and the minimum samples per split.

Validated parameters of the optimal cross are listed in the table below:

number of estimators 100
impurity measure gini
minimum samples per leaf 5
minimum samples per split 22

The test accuracy was 65%. Surprisingly, this is no better than the simple logistic regression and LDA model test accuracy above. This needs to be further investigated.

As expected, the features that were most important were the rankings. The feature importance bar chart is shown below:

The player rankings are by far the most important features. This may explain why adding extra features did not improve the performance - if the rankings are swamping out all other features, then it makes sense that the performance of the model may not improve with extra features.

Support vector machine

I next tried to fit a support vector machine model to the full feature set. At first, I tried to use a cross validated grid search to select the optimal hyperparameter C, but for some reason, the simulation would not terminate. Instead, I ran several different models varying the value of C. The results are summarized below:

C Test error
1 60%
10 58%
100 58%
1000 58%

We see that the test error starts at 60%, but then drops down to 58% for the rest of the values of C. Thus, the svm model performs worse than the simple logistic regression model and the random forest model.

Logistic Regression

I then fit a logistic regression model to the full feature set. I performed a grid search on the regularization constant, C. The optimal value was C = 1, with a test error of 64%

Linear Discriminant Data Analysis

Finally, I performed an LDA analysis on the full feature set. The LDA model returned a test score of 53%

Conclusion

A logistic regression model with two features performed just as well as a random forest model with multiple features. The reason is likely because the rankings are overwhelmingly the most important features of the features I engineered.

Follow up work includes engineering other features that may add predictive value over the rankings and developing a full betting model using the results. For features, using the universal tennis ratings might improve the quality of the predictions.

 

 

About Author

Rohit Parthasarathy

View all posts by Rohit Parthasarathy >

Leave a Comment

Google August 28, 2021
Google One of our guests not long ago suggested the following website.
Google January 22, 2021
Google Wonderful story, reckoned we could combine a number of unrelated data, nevertheless really worth taking a search, whoa did 1 study about Mid East has got much more problerms too.
Google December 21, 2020
Google The time to study or stop by the material or internet sites we have linked to beneath.
cbd oil for dogs December 4, 2020
cbd oil for dogs [...]one of our visitors a short while ago suggested the following website[...]
Google November 4, 2020
Google Sites of interest we have a link to.
Google September 30, 2020
Google Every after inside a while we opt for blogs that we read. Listed beneath are the newest internet sites that we choose.
mksorb.com September 3, 2020
mksorb.com [...]although web-sites we backlink to below are considerably not related to ours, we really feel they are truly really worth a go by, so possess a look[...]
OnHax August 24, 2020
OnHax [...]we prefer to honor several other online web-sites on the net, even if they arenย’t linked to us, by linking to them. Beneath are some webpages worth checking out[...]
Thesis Writing Services July 24, 2020
Thesis Writing Services [...]although internet sites we backlink to beneath are considerably not associated to ours, we really feel they are actually really worth a go as a result of, so possess a look[...]
Homepage November 28, 2019
... [Trackback] [...] Read More: nycdatascience.edu/blog/student-works/predicting-the-outcome-of-professional-tennis-matches/ [...]

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application