Can Machine Learning Make Horse Races A Winning Proposition?

For our capstone project, we partnered with RaceQuant, a startup specializing in Hong Kong horse race betting. Our goal was to apply machine learning to the world of horse racing to more accurately predict the outcome of races held by the Hong Kong Jockey Club and to advise on an optimal betting strategy. The Hong Kong Jockey Club (HKJC) is world-renowned and a distinct part of the culture in Hong Kong. It's emerald turf attracts about HK$138.8 million (US$17.86 million) per race, more than any other track in the world.

Betting on horse racing is notoriously difficult and is considered by many speculators to be uncrackable. But difficult is not the same as impossible. Our motivation to find an edge and generate profitable models stems from the ground-breaking work of Bill Benter, who is said to have amassed a $1bn fortune over his career doing just that.

We were provided with race data for more than 1600 races from the 2016-2017 and 2017-2018 seasons for races held by Hong Kong Jockey Club. To be profitable we had to first clear the hurdle of a 17.5% track-take that the HKJC deducted from the Win Pool on every race.

We began with a deep-dive into the Kelly Criterion and an exploration of the data made available to us. Our original inclination was to develop linear models that could predict horse running times, build probability distribution functions around those predicted times, simulate races, and apply a betting algorithm to them. After studying quantile-quantile plots of the features & standard errors, and applying various transformation methods including the Box-Cox Transformation, it became harder for us to justify a purely linear modeling path, given the nuances we were observing in the data.

Instead, we opted to proceed with logistic and classification based modeling, as this process relaxed some of the prerequisites, and would more easily output to us winning probabilities that we could use to feed our betting model. We engineered several features and imputed missing values on a feature by feature basis.

We created several new features to try and better estimate the probability of a horse winning a race. Based on the assumption that horses that weigh in at close to their average winning body weight have a higher likelihood of winning, we created a binary flag to signal that. We likewise engineered a feature that compared a horse’s speed rating (as computed by RaceQuant analysts) with that of a typical winner (in that Class). We created features that measured a horse’s change in weight, how many days since its last race, whether or not this was its first race in Hong Kong, and whether the horse won its last race or not. We created a composite weighted winning percentage that also considered the recent number of wins.

We imputed missing values for a horse’s previous ratings, distance run in previous races, course over which the horse had competed on in previous races, trackwork and barrier trials, jockey and previous jockey win percentage, wins and mounts. If a horse was new, it did not have an average horse body weight, so we imputed this feature with its previous weight.

Using correlation matrices, random forest classification, and coefficient analysis on normalized variables, we evaluated the relative predictive power and importance of each feature. From this work, we built models one feature at a time, based on sets of features that we identified as being impactful, and evaluated their performance. Our guess here was that our original fully-featured model with well over 100 variables may be over-informed, somewhat confusing, and not generating optimal probabilities. Our inclination proved correct; in nearly every modeling instance, we found a reduced model performed better.

By way of example, below are four sample logistic models we ran with dramatically reduced feature sets (in different combinations). Each of these performed better than our fully featured model. Of note, the starting bankroll for each model was $100,000. Additionally, though these models appeared to generate strong returns in the seasons they were initially trained and tested on, the further simulation showed reduced performance. We were concerned by their drawdown rate (as measured by minimum bankroll).

 

Model 1

Model 2

Model 3

Model 4

Total Number of Bets

762

700

727

710

Number of Bets Per Race

4.7

4.3

4.5

4.4

ROI (On Betting Amount)

18.9%

23.5%

22.5%

18.3%

Number Of Winning Bets:

62

54

60

63

Final Bankroll:

246,436

289,154

294,663

269,050

Minimum Bankroll:

72,924

68,442

69,498

67,849

Maximum Bankroll

327,080

316,425

340,525

370,295

% Of Times Winner Predicted Correctly

29.8%

32.3%

31.7%

31.7%

One question we debated as a team related to potential model over-fitting and model bias.  Could it be that a certain feature set worked well in one test set while another would perform better in a different season?  Another possibility we had to consider: could it be that one type of model performed better in one season, and another performed better in a different season? Given that we only had two seasons for modeling & testing, we addressed this issue by grouping and regrouping each of the races into different season simulations.

In order to best represent the true overall distribution of results for each model, we ran Monte Carlo Simulations. Monte Carlo Simulations test models through repeated random sampling. In our case, this process consisted of repeating random 80/20 splits of our data. For each split of the data, we first trained the model on a random 80% set of races. Then we passed the fitted models obtained from training to our betting algorithm, which was run on the remaining 20%.

Running many instances of simulations for each model and taking the average performance into account allowed us to achieve more accurate estimates of each model’s true performance.

The above histogram shows the ending bankroll for 500 simulated seasons consisting of all the races in the testing. Mean: $151,637, Median: $138,713, Min: $71,062, Max: $388,576.

We ran full and simplified models, and evaluated betting outcomes, for a variety of model types, including standard Logistic, Random Forest, XGBoost, Light Gradient Boost Model (LGBM), and CatBoost.  For each of these models, we took note of average drawdown, average final bankroll, number of bets, Return on Capital Deployed, and Return on Initial Bankroll across the thousands of simulations we created.

Our attention turned to the Kelly Betting Algorithm that formed the basis of our betting strategy.  We experimented with the algorithm and fractional betting parameters; ultimately, we zeroed in on a 5% fractional allocation to each race.  On the Kelly formula itself, we found, consistently, that a very slight modification to the traditional formula resulted in far superior betting outcomes, no matter what probability model we fed in.  This modification allows for the inclusion of more consensus bets (i.e. lower odds) than the traditional algorithm, and we found this to be an effective method both in the actual seasons as well as the thousands of simulated seasons we tested on.  

In conclusion, we present a summary of how our models performed.  Across 500 simulated seasons, our best returns were seen with an XG Boost model, that generated median and average returns of 13.5% and 14.5%, respectively, with maximum losses relatively well-contained, as observed by our minimum bankroll levels.  For further work, we would look to additional feature engineering and hyperparameter tuning, so as to include more race information and improve on our returns.

XG BOOST STATISTICS

Mean

Median

Total Number of Bets:

                  1,205

                  1,207

Number of Bets Per Race:

                        7.5

                        7.5

Amount Wagered:

$315,247

$300,164

ROI on Bankroll:

51.6%

38.7%

ROI on Betting Amount:

14.5%

13.5%

Number of Winning Bets:

                       111

                       111

Biggest Bet:

$3,842

$3,550

Smallest Bet:

$10.00

$10.00

Initial Bankroll:

$100,000

$100,000

Final Bankroll:

$151,637

$138,713

Minimum Bankroll:

$91,273

$93,360

Maximum Bankroll:

$158,237

$144,620

About RaceQuant -- RaceQuant was established by experts in the Thoroughbred racing domain who believed that Machine Learning could be applied successfully to maximize the return on betting investment and can be contacted at [email protected]


About Authors

Michael Sankari

Michael Sankari

Michael is a Certified Data Scientist with experience in R, Python and SQL. Furthermore, he has a strong background in the finance and real estate industries and loves using analytics to make better decisions.
View all posts by Michael Sankari >
Matthew Rautionmaa

Matthew Rautionmaa

Matthew is an aspiring data scientist with over four years of professional success in leveraging insights from data analysis to generate business impact in the financial services industry. He is experienced in Python, R, Machine Learning, Web Scraping...
View all posts by Matthew Rautionmaa >
Eric Adlard

Eric Adlard

Eric is an aspiring data scientist with a track record of using data to drive business insights in financial services. He has hands-on experience in R and Python in web-scraping, data visualization, supervised and unsupervised machine learning, as...
View all posts by Eric Adlard >
Avatar

David Levy

David Levy completed his BS from the Kelley School of Business at Indiana University. He has eight years of experience across financial services in various data-oriented, quantitative roles. David enjoys applying an analytical mindset and approach to solve...
View all posts by David Levy >
Avatar

Marc Hasson

As an investment research professional, much of my work over the last 17 has centered around developing a deep understanding of businesses based on senior management interactions, financial modeling, forecasting, and primary due diligence. Data has also been...
View all posts by Marc Hasson >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data Book Launch Book-Signing bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp