Data Analyzing Horse Racing

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


"Founded in 1884, the Hong Kong Jockey Club is one of the most treasured—and lucrative—legacies of Britain’s colonial rule over the city. Its emerald turf attracts about HK$138.8 million (US$17.86 million) per race, more than any other track in the world." - Bloomberg

Successfully predicting even a small percentage of winning horses over a large amount of races can lead to an absurd ROI thanks to compounded interest. Most famously, William Benter earned nearly 1 Billion USDs by creating a computer program to data analyze the horse racing market.

Attracted by the possibility of measuring up our data science skills to one metric, and one metric only, ROI, as well as the challenge of facing off against all HK horse racing market participants, our group (consisting of 5 data scientists) partnered with RaceQuant, a startup specializing in HK horse-betting.


RaceQuant provided us with data for all races held by the Hong-Kong Jockey Club for the years 2015, 2016, 2017, and 2018 (1st and 2nd quarters). The data set consisted of 81 different features. The objective of this capstone project was to:

1) create a a model to predict the probability of a given horse in a given race winning said race; and

2) use the probabilities outputted by our model to create a betting strategy to maximize our ROI based on a $100,000 betting bankroll when back-testing for 540 races randomly selected from the data set.

Due to an NDA contract, parts III. and IV. describing our approaches to data analysis and modeling, will be significantly simplified.


Our raw data contained 2384 unique races, with a total of 24863 horses having run those races. Before we created our model, we performed several data transformations. The data cleaning was performed through using Pandas in Python.

Some of the features in our data set had missing values. For example: for jockey and trainer win percentages, we assumed a value of 10% for first time jockeys and trainers. This value was assigned as 10%, since most first time jockeys and trainers at the HK tracks have participated in other international events and haver performed quite well on average.

We also did some feature engineering to better capture certain types of information from our data. For example: we created a feature related to the horses' weight. The percent deviation of a horse’s weight from its winning weight was introduced in order to examine the effect of horse weight on the chances of winning. Similarly, the percent deviation of a horse’s weight from his average weight over all races was introduced to standardize horse weight.


In the racing data we were given, we found that many features did not hold up to a linear relationship with an increase in a horse's win probability. Therefore, we tested multiple models that captured complex relationships between input features and output probability. We ended up sticking with a neural network which was then find tuned to maximize the probability of our neural network being able to choose the winning horse of a given race.


Having calculated what the probability was that a given horse would win per race, we set out to develop a betting strategy to maximize our ROI. To do so, we first explored the Kelly Criterion:

Where: f* is the fraction of the current bankroll to wager, b is the net odds received for the wager (your return is "b to 1" where you make a bet of $1 you would receive $b in addition to getting the $1 back), p is the probability of winning, q is the probability of losing (1-p).


Several issues came up while applying this methodology:

  • The f values calculated were much too high and often led to an early bankruptcy.
  • The Kelly Criterion makes the assumption that bets made are independent: In a race, there can be up to 14 horses => 14 possible bets that are all mutually exclusive outcomes to one another.


To remedy the first issue, a multiplicative constant between 0 and 1 for f was introduced. It was found that 0.2 worked optimally. If interested, we recommend looking up the derivation for the Kelly Criterion which proves that a multiplicative coefficient will delay exponential growth but will provide less variance in early return outcomes. For other issues, we adapted an alternative betting strategy described in Peter Tompkin’s paper “An explicit solution to the problem of optimizing the allocations of a better’s wealth when wagering on horse races”. This strategy can be broken down into several steps:

1) Calculate the market odds  βk  , where Qk are the payoff-odds.

2) Calculate expected revenue rate:     where D = 1- tt with tt being the track take or tax on one's bet.

3) Reorder the expected revenue rates in descending order such that er1 will be the best bet

4) Create a Set S =Φ , k=1 and R(S) = 1. Thus, the best bet erwill be er1 considered first for step 5.

5) If  Data Analyzing Horse Racing >  R(S), then insert the kth outcome into the set S and recalculate R(S) according to:Data Analyzing Horse Racing

6) Repeat step 5 until the condition in step 5 is no longer fulfilled,   S0 = S  then.

7) Calculate the optimal fraction of bankroll to bet on each horse for a given race with:

Similar to the Kelly Criterion, one can optimize the longevity of our bankroll by multiplying f by a fixed ratio, which we found to be 0.03.


The outcome of our model is summarized in the below table, which holds the key statistics of our betting strategy.

Metrics Results
Sample Size -- Training and Testing
Total training + test races 2384
Total training + test horses 24863
Training races 1843
% of training races 77.31%
Training horses 18304
% of training horses 73.62%
Number of races in test 541
% of test races 22.69%
Number of horses in test 6559
% of test horses 26.38%
Bankroll Related
Initial Bankroll $ $100,000.00
Minimum Bankroll $ $91,912.05
Maximum Bankroll $ $1,418,354.18
Bets Related
Total number of bets made 1953
Average number of bets made per race 3.61
% of total bets possible made 29.78%
Number of winners bet 143
% of rank 1 horses predicted 26.38%
Biggest bet $ $28,774.20
Smallest bet $ $1.30
Biggest winning bet $ $28,774.20
Smallest winning bet $ $1.30
Biggest losing bet $ $27,241.50
Smallest losing bet $ $1.30
Odds Related
Average net win odds 5.38
Lowest net win odds 0.1
Highest net win odds 17
Final Bankroll $ $1,054,003.96
Total spend on bets made $ $5,333,083.60
ROI as betting % 19.76%
Total ROI % 1054.00%


VI. Team

This capstone project was completed by Basant Dhital, Tristan Dresbach, Jiwon Cha, SangYeon Choi, and Karim Zaatary in collaboration with RaceQuant through NYC Data Science Academy. Please contact Basant via LinkedIn  and Tristan via LinkedIn for any questions.

About Authors

Basant Dhital

Basant Dhital is a Physics Ph.D. with an excellent background in Mathematics and Statistics and demonstrated programming skills. During his Ph.D. research, he developed several algorithms to process and analyze NMR and other spectroscopic data. He developed a...
View all posts by Basant Dhital >

Tristan Dresbach

Tristan is an aspiring data scientist with a track record of using data to drive significant and tangible business results in retail and financial services. He has hands on experience in R and Python in web-scraping, data visualization,...
View all posts by Tristan Dresbach >

karim El Zaatari

Data Scientist and mechanical engineering graduate with a demonstrated record of leadership & problem solving. My data science projects span over various topics including air pollution, carpooling,house pricing and machine learning in horse racing.
View all posts by karim El Zaatari >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI