In the Money: Predicting the Outcomes of Horse Races with RaceQuant


RaceQuant is a startup specializing in consulting for horse race betting. RaceQuant enlisted our team to use machine learning to more accurately predict the outcome of horse races, to advise betting strategy. They provided three years worth of Hong Kong Jockey Club horse racing data (2015-2017) from the tracks in Sha Tin and Happy Valley, including public data from the HKJC website and enhanced data which included 35 additional variables.

The payout for horse race bets is based on the amount of money bet on each horse, after the HKJC has taken an 18% commission. So, in order to turn a profit, your bets must beat this 18% hurdle. Our approach was to model the probability of each horse winning in a given race and compare this to the market perceived probability and then recommend bets on horses whose chances exceeded the market probability. We will discuss in detail below.

Data Processing

The raw data contained over 29,000 observations, covering 2,384 HKJC races between 2015 and 2017. Before this data could be fed into models, some transformation was necessary. The data cleaning and processing was done using the Pandas library in Python.

Handling Missing Data

Some features did not have data for all horses. Some of these could be reasonably imputed. For example, horses with only one race didnโ€™t yet have an avg_horse_body_weight. These were imputed as horse_body_weight (the horseโ€™s current body weight). In other cases, features with missing data were replaced altogether by new features. For example, avg_winning_body_weight was dropped in favor of the new features, prev_won (if the horse has previously won a race) and wtdf_prev_avg_win_wt (the weight difference between the horseโ€™s current race and itโ€™s average winning weight).

New Features

We created several new features to make the data more manageable and better capture important information about the horsesโ€™ previous performance:

  • prev_won - whether the horse has won a previous race.
  • previously_raced - whether the horse has had a previous race (win or lose).
  • wtdf_prev_race - weight difference between the previous race and this one (if the horse has no previous race, this value is 0).
  • wtdf_prev_avg_win_wt - weight difference between the average winning body weight and this race (if the horse has no previous wins, this value is 0).

Modeling Approaches

Because of the nature of horse races (many discrete races with 7-14 horses), it is difficult to build a model which predicts horse rank in a given race outright. Furthermore, many betting strategies rely on predicting the probability of a given horse winning a race and comparing it to the perceived market probability to determine what to bet. Consequently, our approach was to build models to predict horse run times and use these to simulate each race many times, allowing us to extract the probability of each horse for each race. Breaking it down:

  1. Predict the mean run time for a given horse under a given set of conditions.
  2. Predict the variance in run time for this horse/conditions. Combined with the mean prediction, this gives us a distribution of possible times for this horse under these conditions.
  3. Repeat for each horse in the race.
  4. Using the predicted time distributions for each horse in the race, simulate 100,000 races. Treat the fraction of races won by each horse as its probability of winning the actual race (ex: if horse A wins 20,000 of the 100,000 simulated races, itโ€™s predicted probability of winning the race is 0.2).

We used several types of models with this approach, including:

Model Library
Linear Regression scikit-learn LinearRegression
Ridge and Lasso Regression scikit-learn LassoCV, ElasticnetCV, RidgeCV
K-Nearest Neighbors (KNN) scikit-learn KNeighborsRegressor
Gradient Boosting XGBoost
Random Forest scikit-learn RandomForestRegressor


We took several approaches to measuring the success of our model:

  1. Measuring the percent of correctly predicted first and second place horses for our test set of races.
  2. Measuring the return on investment assuming a flat bet of $10 on only the winner of each race.
  3. Measuring the return on investment using a betting strategy derived from the Kelly Criterion.

Predicting Winners and ROI For $10 Bets

The first measure of success that we used was to consider the number of races for which we correctly predicted the winner. Out of 805 races in our test set, our best model correctly predicted the first place horse in 17.52% of races (141 races). It correctly predicted the second place horse in 12.80% of races (103 races). These results are stronger than betting randomly, which is expected to return ~9% correct first place horses.

We then applied a flat betting strategy to these predictions, betting $10 on each horse we predicted to win. The return on investment for this strategy was -1.13% (a loss of $91 for an investment of $8,050). Note that the payout on each race is dependent on how much the betting population bet on each horse (a proxy for how much each horse is expected to win). The more bets placed on an individual horse, the lower the payout. Therefore, there is less of a reward for guessing horses that are perceived as likely to win.

Return For Kelly Betting Strategy

The Kelly Criterion is a formula used to optimize betting strategy. The Kelly Criterion weighs the payout for each bet against the probability of winning and recommends what fraction of your bankroll to wager. It can be described with the following equation:



  • f* is the fraction of the current bankroll to wager.
  • b is the net odds received for the wager (your return is "b to 1" where you on a bet of $1 you would receive $b in addition to getting the $1 back).
  • p is the probability of winning.
  • q is the probability of losing (1-p).

Using this approach, our return on investment was -100% (hitting a remaining budget of $0 after 69 races).

Areas for Improvement

There are a variety of areas for improvement for this project:

  • Exploring additional modeling techniques, particularly for modeling the error in race time.
  • Employing a multinomial conditional logistic regression model (see below for more detail).
  • Considering more conservative betting strategies.
  • Adding additional historical data to train model.
  • Additional feature engineering.

Multinomial Conditional Logistic Regression

The Multinomial Conditional Logistic Regression model (MCLR) is an alternative methodology to our approach. Instead of modeling the run time and subsequently assessing the error, MCLR would provide the probability that each horse in any given race finishes in 1st place, which is precisely our target. While conceptually simpler, the hesitation to implement it is the complexity for our specific use-case.

About the Team

This was completed as a capstone project at NYC Data Science Academy. The members of this team are Kevin Cannon, Howard Chang, Julie Levine, Lavanya Gupta, and Max Schoenfeld. Thanks to RaceQuant for working with us!

About Authors

Julie Levine

Julie Levine has a BSE in Chemical and Biomolecular Engineering from The University of Pennsylvania. She has worked in a variety of roles in marketing and product management at tech companies including Factual and Datadog. Currently, she is...
View all posts by Julie Levine >

Howard Chang

Howard is currently a NYC Data Science Academy Fellow with a MS in Applied Math and Statistics from Stony Brook University. He has work experience in Portfolio Finance and Margin at a billion dollar multi-strategy hedge fund and...
View all posts by Howard Chang >

Max Schoenfeld

Max is a data scientist pursuing opportunities to use his machine learning expertise in a market-oriented setting such as sports gambling, finance, or general business analysis. He has business experience providing investment professionals with data solutions and recommendations.
View all posts by Max Schoenfeld >

Leave a Comment

Google March 15, 2022
Google We came across a cool web site that you simply may take pleasure in. Take a look in case you want.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI