Horse Races Can Machine Learning Make A Winning Proposition?

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

For our capstone project, we partnered with RaceQuant, a startup specializing in Hong Kong horse race betting. Our goal was to apply machine learning to the world of horse racing to more accurately predict the outcome of races held by the Hong Kong Jockey Club and to advise on an optimal betting strategy. The Hong Kong Jockey Club (HKJC) is world-renowned and a distinct part of the culture in Hong Kong. It's emerald turf attracts about HK$138.8 million (US$17.86 million) per race, more than any other track in the world.

Betting on horse racing is notoriously difficult and is considered by many speculators to be uncrackable. But difficult is not the same as impossible. Our motivation to find an edge and generate profitable models stems from the ground-breaking work of Bill Benter, who is said to have amassed a $1bn fortune over his career doing just that.

We were provided with race data for more than 1600 races from the 2016-2017 and 2017-2018 seasons for races held by Hong Kong Jockey Club. To be profitable we had to first clear the hurdle of a 17.5% track-take that the HKJC deducted from the Win Pool on every race.

We began with a deep-dive into the Kelly Criterion and an exploration of the data made available to us. Our original inclination was to develop linear models that could predict horse running times, build probability distribution functions around those predicted times, simulate races, and apply a betting algorithm to them. After studying quantile-quantile plots of the features & standard errors, and applying various transformation methods including the Box-Cox Transformation, it became harder for us to justify a purely linear modeling path, given the nuances we were observing in the data.

Instead, we opted to proceed with logistic and classification based modeling, as this process relaxed some of the prerequisites, and would more easily output to us winning probabilities that we could use to feed our betting model. We engineered several features and imputed missing values on a feature by feature basis.

We created several new features to try and better estimate the probability of a horse winning a race. Based on the assumption that horses that weigh in at close to their average winning body weight have a higher likelihood of winning, we created a binary flag to signal that. We likewise engineered a feature that compared a horse’s speed rating (as computed by RaceQuant analysts) with that of a typical winner (in that Class). We created features that measured a horse’s change in weight, how many days since its last race, whether or not this was its first race in Hong Kong, and whether the horse won its last race or not. We created a composite weighted winning percentage that also considered the recent number of wins.

We imputed missing values for a horse’s previous ratings, distance run in previous races, course over which the horse had competed on in previous races, trackwork and barrier trials, jockey and previous jockey win percentage, wins and mounts. If a horse was new, it did not have an average horse body weight, so we imputed this feature with its previous weight.

Using correlation matrices, random forest classification, and coefficient analysis on normalized variables, we evaluated the relative predictive power and importance of each feature. From this work, we built models one feature at a time, based on sets of features that we identified as being impactful, and evaluated their performance. Our guess here was that our original fully-featured model with well over 100 variables may be over-informed, somewhat confusing, and not generating optimal probabilities. Our inclination proved correct; in nearly every modeling instance, we found a reduced model performed better.

By way of example, below are four sample logistic models we ran with dramatically reduced feature sets (in different combinations). Each of these performed better than our fully featured model. Of note, the starting bankroll for each model was $100,000. Additionally, though these models appeared to generate strong returns in the seasons they were initially trained and tested on, the further simulation showed reduced performance. We were concerned by their drawdown rate (as measured by minimum bankroll).


Model 1

Model 2

Model 3

Model 4

Total Number of Bets





Number of Bets Per Race





ROI (On Betting Amount)





Number Of Winning Bets:





Final Bankroll:





Minimum Bankroll:





Maximum Bankroll





% Of Times Winner Predicted Correctly





One question we debated as a team related to potential model over-fitting and model bias.  Could it be that a certain feature set worked well in one test set while another would perform better in a different season?  Another possibility we had to consider: could it be that one type of model performed better in one season, and another performed better in a different season? Given that we only had two seasons for modeling & testing, we addressed this issue by grouping and regrouping each of the races into different season simulations.

In order to best represent the true overall distribution of results for each model, we ran Monte Carlo Simulations. Monte Carlo Simulations test models through repeated random sampling. In our case, this process consisted of repeating random 80/20 splits of our data. For each split of the data, we first trained the model on a random 80% set of races. Then we passed the fitted models obtained from training to our betting algorithm, which was run on the remaining 20%.

Running many instances of simulations for each model and taking the average performance into account allowed us to achieve more accurate estimates of each model’s true performance.

The above histogram shows the ending bankroll for 500 simulated seasons consisting of all the races in the testing. Mean: $151,637, Median: $138,713, Min: $71,062, Max: $388,576.

We ran full and simplified models, and evaluated betting outcomes, for a variety of model types, including standard Logistic, Random Forest, XGBoost, Light Gradient Boost Model (LGBM), and CatBoost.  For each of these models, we took note of average drawdown, average final bankroll, number of bets, Return on Capital Deployed, and Return on Initial Bankroll across the thousands of simulations we created.

Our attention turned to the Kelly Betting Algorithm that formed the basis of our betting strategy.  We experimented with the algorithm and fractional betting parameters; ultimately, we zeroed in on a 5% fractional allocation to each race.  On the Kelly formula itself, we found, consistently, that a very slight modification to the traditional formula resulted in far superior betting outcomes, no matter what probability model we fed in.  This modification allows for the inclusion of more consensus bets (i.e. lower odds) than the traditional algorithm, and we found this to be an effective method both in the actual seasons as well as the thousands of simulated seasons we tested on.  

In conclusion, we present a summary of how our models performed.  Across 500 simulated seasons, our best returns were seen with an XG Boost model, that generated median and average returns of 13.5% and 14.5%, respectively, with maximum losses relatively well-contained, as observed by our minimum bankroll levels.  For further work, we would look to additional feature engineering and hyperparameter tuning, so as to include more race information and improve on our returns.




Total Number of Bets:



Number of Bets Per Race:



Amount Wagered:



ROI on Bankroll:



ROI on Betting Amount:



Number of Winning Bets:



Biggest Bet:



Smallest Bet:



Initial Bankroll:



Final Bankroll:



Minimum Bankroll:



Maximum Bankroll:



About RaceQuant -- RaceQuant was established by experts in the Thoroughbred racing domain who believed that Machine Learning could be applied successfully to maximize the return on betting investment and can be contacted at [email protected].


About Authors

Michael Sankari

Michael is a Certified Data Scientist with experience in R, Python and SQL. Furthermore, he has a strong background in the finance and real estate industries and loves using analytics to make better decisions.
View all posts by Michael Sankari >

Matthew Rautionmaa

Matthew is an aspiring data scientist with over four years of professional success in leveraging insights from data analysis to generate business impact in the financial services industry. He is experienced in Python, R, Machine Learning, Web Scraping...
View all posts by Matthew Rautionmaa >

Eric Adlard

Eric is an aspiring data scientist with a track record of using data to drive business insights in financial services. He has hands-on experience in R and Python in web-scraping, data visualization, supervised and unsupervised machine learning, as...
View all posts by Eric Adlard >

David Levy

David Levy completed his BS from the Kelley School of Business at Indiana University. He has eight years of experience across financial services in various data-oriented, quantitative roles. David enjoys applying an analytical mindset and approach to solve...
View all posts by David Levy >

Marc Hasson

As an investment research professional, much of my work over the last 17 has centered around developing a deep understanding of businesses based on senior management interactions, financial modeling, forecasting, and primary due diligence. Data has also been...
View all posts by Marc Hasson >

Leave a Comment

Milind Dalvi October 23, 2019
Interesting Blog! However, it seems like the text focuses more on the design of the betting framework rather than the model itself. Yeah, you can classify for "horse placing" or regress for "finish time" but it seems to me that racing is ranking problem. Did you try XGBoost with ranking objective? I wonder you must have faced difficulties with that imbalance in classification. Also, there is no mention of ensembling models... interesting

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI