In the Money: Predicting the Outcomes of Horse Races with RaceQuant


RaceQuant is a startup specializing in consulting for horse race betting. RaceQuant enlisted our team to use machine learning to more accurately predict the outcome of horse races, to advise betting strategy. They provided three years worth of Hong Kong Jockey Club horse racing data (2015-2017) from the tracks in Sha Tin and Happy Valley, including public data from the HKJC website and enhanced data which included 35 additional variables.

The payout for horse race bets is based on the amount of money bet on each horse, after the HKJC has taken an 18% commission. So, in order to turn a profit, your bets must beat this 18% hurdle. Our approach was to model the probability of each horse winning in a given race and compare this to the market perceived probability and then recommend bets on horses whose chances exceeded the market probability. We will discuss in detail below.

Data Processing

The raw data contained over 29,000 observations, covering 2,384 HKJC races between 2015 and 2017. Before this data could be fed into models, some transformation was necessary. The data cleaning and processing was done using the Pandas library in Python.

Handling Missing Data

Some features did not have data for all horses. Some of these could be reasonably imputed. For example, horses with only one race didn’t yet have an avg_horse_body_weight. These were imputed as horse_body_weight (the horse’s current body weight). In other cases, features with missing data were replaced altogether by new features. For example, avg_winning_body_weight was dropped in favor of the new features, prev_won (if the horse has previously won a race) and wtdf_prev_avg_win_wt (the weight difference between the horse’s current race and it’s average winning weight).

New Features

We created several new features to make the data more manageable and better capture important information about the horses’ previous performance:

  • prev_won - whether the horse has won a previous race.
  • previously_raced - whether the horse has had a previous race (win or lose).
  • wtdf_prev_race - weight difference between the previous race and this one (if the horse has no previous race, this value is 0).
  • wtdf_prev_avg_win_wt - weight difference between the average winning body weight and this race (if the horse has no previous wins, this value is 0).

Modeling Approaches

Because of the nature of horse races (many discrete races with 7-14 horses), it is difficult to build a model which predicts horse rank in a given race outright. Furthermore, many betting strategies rely on predicting the probability of a given horse winning a race and comparing it to the perceived market probability to determine what to bet. Consequently, our approach was to build models to predict horse run times and use these to simulate each race many times, allowing us to extract the probability of each horse for each race. Breaking it down:

  1. Predict the mean run time for a given horse under a given set of conditions.
  2. Predict the variance in run time for this horse/conditions. Combined with the mean prediction, this gives us a distribution of possible times for this horse under these conditions.
  3. Repeat for each horse in the race.
  4. Using the predicted time distributions for each horse in the race, simulate 100,000 races. Treat the fraction of races won by each horse as its probability of winning the actual race (ex: if horse A wins 20,000 of the 100,000 simulated races, it’s predicted probability of winning the race is 0.2).

We used several types of models with this approach, including:

Model Library
Linear Regression scikit-learn LinearRegression
Ridge and Lasso Regression scikit-learn LassoCV, ElasticnetCV, RidgeCV
K-Nearest Neighbors (KNN) scikit-learn KNeighborsRegressor
Gradient Boosting XGBoost
Random Forest scikit-learn RandomForestRegressor


We took several approaches to measuring the success of our model:

  1. Measuring the percent of correctly predicted first and second place horses for our test set of races.
  2. Measuring the return on investment assuming a flat bet of $10 on only the winner of each race.
  3. Measuring the return on investment using a betting strategy derived from the Kelly Criterion.

Predicting Winners and ROI For $10 Bets

The first measure of success that we used was to consider the number of races for which we correctly predicted the winner. Out of 805 races in our test set, our best model correctly predicted the first place horse in 17.52% of races (141 races). It correctly predicted the second place horse in 12.80% of races (103 races). These results are stronger than betting randomly, which is expected to return ~9% correct first place horses.

We then applied a flat betting strategy to these predictions, betting $10 on each horse we predicted to win. The return on investment for this strategy was -1.13% (a loss of $91 for an investment of $8,050). Note that the payout on each race is dependent on how much the betting population bet on each horse (a proxy for how much each horse is expected to win). The more bets placed on an individual horse, the lower the payout. Therefore, there is less of a reward for guessing horses that are perceived as likely to win.

Return For Kelly Betting Strategy

The Kelly Criterion is a formula used to optimize betting strategy. The Kelly Criterion weighs the payout for each bet against the probability of winning and recommends what fraction of your bankroll to wager. It can be described with the following equation:



  • f* is the fraction of the current bankroll to wager.
  • b is the net odds received for the wager (your return is "b to 1" where you on a bet of $1 you would receive $b in addition to getting the $1 back).
  • p is the probability of winning.
  • q is the probability of losing (1-p).

Using this approach, our return on investment was -100% (hitting a remaining budget of $0 after 69 races).

Areas for Improvement

There are a variety of areas for improvement for this project:

  • Exploring additional modeling techniques, particularly for modeling the error in race time.
  • Employing a multinomial conditional logistic regression model (see below for more detail).
  • Considering more conservative betting strategies.
  • Adding additional historical data to train model.
  • Additional feature engineering.

Multinomial Conditional Logistic Regression

The Multinomial Conditional Logistic Regression model (MCLR) is an alternative methodology to our approach. Instead of modeling the run time and subsequently assessing the error, MCLR would provide the probability that each horse in any given race finishes in 1st place, which is precisely our target. While conceptually simpler, the hesitation to implement it is the complexity for our specific use-case.

About the Team

This was completed as a capstone project at NYC Data Science Academy. The members of this team are Kevin Cannon, Howard Chang, Julie Levine, Lavanya Gupta, and Max Schoenfeld. Thanks to RaceQuant for working with us!

About Authors

Julie Levine

Julie Levine has a BSE in Chemical and Biomolecular Engineering from The University of Pennsylvania. She has worked in a variety of roles in marketing and product management at tech companies including Factual and Datadog. Currently, she is...
View all posts by Julie Levine >

Howard Chang

Howard is currently a NYC Data Science Academy Fellow with a MS in Applied Math and Statistics from Stony Brook University. He has work experience in Portfolio Finance and Margin at a billion dollar multi-strategy hedge fund and...
View all posts by Howard Chang >

Max Schoenfeld

Max is a data scientist pursuing opportunities to use his machine learning expertise in a market-oriented setting such as sports gambling, finance, or general business analysis. He has business experience providing investment professionals with data solutions and recommendations.
View all posts by Max Schoenfeld >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp