Can Categorical Features Help Improve Horse Racing Models?

Posted on Oct 15, 2020
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

LinkedInGitHubEmail | Data | Web App


The  horse racing community has been using quantitative data  to develop betting algorithms for decades.  Indicators including horse bodyweight, age, and previous lap times are all utilized. Along with the domain specific Speed Index to predict future race outcomes. Our team was asked to answer a new question in horse racing: Can subjective data taken by an SME improve the predictive ability of horse racing models?

Our partner in this task was RaceQuant, a horse racing syndicate that makes proprietary wagers on races in Hong Kong’s two venues: Happy Valley and Sha-Tin. Prior to each race, participating horses are walked around an open area next to the track known as the ‘paddock.’ Racing insiders stand by and orate their observations of each horse to a respective betting syndicate.

RaceQuant had contracted an SME to take note of specific features, and One Hot Encoded the observations.  These observations can include subjective takes like ‘confident’ or ‘nervous,’ and seemingly span the entire English vocabulary.  For the study, features were broken down to between 5 and 7 distinct values.  As an example, some horses will have bandaged legs. 

This feature would then be broken down into categories of [front-leg single, front-leg double, rear-leg single, rear-leg double, all] and subsequently One Hot Encoded for data privacy.  The observations spanned one season of racing and 800 individual races.

In addition to the features provided by the SME, all observations also included which one of the 23 primary trainers was the trainer for the horse. Features indicating the rank of each horse by bodyweight in the race were given, as well as columns indicating how the horse’s bodyweight had changed since its prior race.  As with the data provided from the Paddock, the trainer and bodyweight data was One Hot Encoded.

Establishing the Benchmarks

In order to judge the relevance of the provided data, RaceQuant established two benchmarks. The Tips Indexes provided by the Hong Kong Jockey Club and the Odds data for each horse in each race.  The Tips Indexes provided by the Hong Kong Jockey Club are published on its website at before 4:30 pm on the day preceding a race day. As well as before 9:30 am on each race day morning.  For the analysis, the team utilized the race day measure of the Tips Index. 

“A Tips Index is not a "tip" itself but a reference figure based on selections by the racing media and generated by applying a statistical formula. Each starter's Tips Index represents the degree of favor it has among tipsters.  For easy comprehension, a Tips Index is expressed in the form of an odds-like figure. The smaller the figure, the higher is the degree of favor a starter enjoys among tipsters.”  Historically, the Tips Indexes ~24% of the winning horses.

Odds Data

The Odds data was taken five minutes prior to the start of the race.  Odds are subject to change and are fully correlated with the amount of money placed on the horse.  Additionally, the true odds payout is not set until the start of the race. Consequently, the value of a ticket purchased can move up and down accordingly.  Lastly, the racetrack collects ~22% of total wagers as commission.  Therefore, the summed probabilities of winning for the horses racing is 1.22 or 122% as derived from the listed odds.  Historically, the Odds favorite wins ~33% of the time.

RaceQuant had two primary questions regarding the data: First, can the Paddock data alone or in conjunction with the Tips Index and Odds be used to predict a higher percentage of winning horses relative to the established benchmarks on an “unseen” number of races.  Second, would a model using this data provide a higher ROI than the benchmarks after making one unit bets on the model’s favored horse over a given period of time. 

With two benchmarks to compare results, the team focused on probabilistic classification models to decide whether or not a horse is a winner.  A motivating factor in this decision was that the RaceQuant desired full transparency in the models so future efforts could focus on features with statistically significant results.  Given this, along with the limited data, the agreed upon choice was to utilize Logistic Regression models in the research. 

Bolton and Chapman

Bolton and Chapman  had already proven the usefulness of a multinomial logit model for handicapping horse races in their 1986 white paper.  However, given the highly subjective features within the data set the team also developed a Random Forest Classifier. 

An obstacle was the nature of the data, more specifically, that the models lacked the ability to identify the ‘race-by-race’ nature of the data. This meant each horse would effectively be racing against all of the horses in the dataset, including itself if there were > 1 observations. To solve this problem, we pulled the win probabilities each model assigned a given horse, and then ranked the horses in a given race based on the model-assigned probability of winning.

To summarize, two models were trained, a Logistic and Random Forest classifier, for both the Paddock data alone and the combined Paddock and Tips/Odds dataset. For the test set, the probabilities of winning for each horse were determined from the models, and the expected winner of each race was selected as the horse with the highest win probability.

The total number of correct predictions was calculated, and the picks from our models on the test set were compared with our two benchmarks: betting on the horse with the most favorable odds and betting on the horse with the most favorable Tips Index.


When the Logistic Regression and Random Forest were trained and tested on only the Paddock dataset, these models underperformed their benchmarks. Moreover, the return (loss) from utilizing strictly the Paddock data performed no better than random selection.

While all four strategies seem to track closely for the first two months, eventually both models diverge and never recover.  Between November and December of 2019 the random forest outperforms on the test set.  These results may simply be noise,  as this deviation is well within the critical value to have appeared at random.  Nevertheless, there does appear to be structure within the results that tracks in line with the Tips and Odds dataset; specifically in January, late March, and early May.

One of the first decisions on how to proceed with the analysis was whether to use the Tips Index, the Odds favorite or both as comparative indexes.  To answer this, the first step was to determine whether or not there was a statistical difference in the performance of the Odds favorite as compared to the Tips favorite.  Using Fisher’s exact test, the team determined that performance results are dependent on the proxy used with a p-value of 0.00032 and a higher return for the Tips Index.

While strictly utilizing the Tips Index in the model may have led to enhanced initial performance, the decision was made to include both; the team hoped that any divergence between the Tips Index and the Odds favorite may be of value to the model.  

Paddock Data

Promising results were observed when the Paddock data was supplemented with the Odds and Tips Index of each horse.

Over the same timeframe, both the Logistic Regression and Random Forest classifiers made selections that resulted in smaller losses than the benchmarks. Additionally, the win rate of each model had a significant increase. Single unit bets made only on the TIPS favorite accurately predicted the winner in 22.6% of the test races, while betting only on the odds favorite accurately predicted the winner in 26.4% of the test races.

Logistic Regression, however, predicted 28.3% of the winners while Random Forest predicted 28.9%.  Additionally, the models were able to track the better performing index, rising as Tips returns diverged from Odds in November, and increasing again in February as Tips returns turned upwards.

Based on Fisher’s exact test it can be concluded that performance is, in fact, dependent on the selection strategy of the models against both the Tips index and Odds favorite.  While additional data will be needed for a higher level of confidence, it appears that the Paddock data is, in fact, a confounding variable which leads to enhanced prediction accuracy and reduces overall loss under a one unit betting strategy.

Determining Efficacy

To determine the efficacy of the model’s strongest predictions, four thresholds [60%, 70%, 80%, 90%] were implemented, and the model was run against each. Each test required the model’s choice to have a predicted win probability at or above the threshold in order to place a wager.  If no horse in a race exceeded the threshold then no bets were placed.  

Count of expected win percentage for the horse ranked highest in each race

While changing the threshold did not result in a profitable model, the model with > 90% only exhibited loss due to the racetracks commission on each bet placed.  Another interesting takeaway was that the >70% model actually underperformed the base model.  This highlights the value of correctly selecting horses in well-matched races as well as long-shots to drive ROI.

A major concern from this analysis was the distribution of the probability to win for the horses selected by the model to win a respective race.  That is, how frequently a horse with probability to win of x had the highest percentage chance, as predicted by the model, to win its race.  The most frequent observations occurred between ~80% and ~87.5%.  Working backwards, this yields a break-even odds (when using 85%) of 3/17 (1.18 in decimal and -566.67 American/moneyline).  

Win Percentage Predictions

It is clear that the model’s win percentage predictions are substantially higher than in reality.   The primary cause of this problem was the binomial nature of the modeling.  When the win probabilities from the Logistic model are graphed for all horses, a large number of horses are given virtually no chance to win the race.  As the expected win probability increases past 5%, a peak is seen at ~35% followed by a steady drop-off. 

However, even with the steady drop-off sufficient horses exhibit the features of a “winning” horse to receive a high probability.  As the project moves forward and a multinomial is exchanged, these probabilities will decline.

Count of predicted win probability for each horse in the dataset
Count of predicted win probability for each horse in the dataset

Interestingly, the Random Forest classifier had more difficulty in identifying strictly losing horses.  The Random Forest classifier also had fewer ambiguous values (those between 40% and 60%).  While neither of the models produced promising distributions based on implied odds, the logistic model captured more distinctly “losing” characteristics.  Moving forward, the Kelly Criterion will take these probabilities as inputs and allocate funds accordingly.  Until these probabilities are brought in line, no asset allocation can be completed.


The initial challenge faced surrounded the data itself.  While the TIPS Index dataset spanned five years, the Paddock data only included observations for one season of racing.  This limited our observations to only 800 races ~ 8000 observations in total and 1414 distinct horses.

Given the focus on strictly predicting horses that finished first, the team elected to develop a baseline model using only the data provided and compare the results to strategies of selecting winners. Doing this by using the TIPS Index and predictions made by the “wisdom of crowds” inherent in the Odds favorite. This choice was made to a large extent so that the results could be clear to the client. The team would be able to present options of how to remedy issues that were encountered in the data.  As this project would eventually be moved to an internal team transparency around decisions were vital.  

While the model did outperform both indexes, the sensitivity of the models required significant improvement before they could be implemented. The need for more observations was clear.

Several techniques could have been used to increase observations of the minority class in this situation. However, within the horse racing community there are domain based practices of resampling with origins based in the Ranking Choice Theorem developed by Luce and Suppes (1965).  More recently, Chapman and Stalin have described how “the extra information inherent in such rank ordered choice sets may be exploited and provided an“explosion process” that applies to a class of models of which the stochastic utility model is a member. 

Chapman and Stalin Explosion Process Data

By applying the Chapman and Stalin explosion process, it is possible to decompose rank ordered choice sets into ‘d’ statistically independently choice sets where ‘d’ is the depth of explosion. 

Bolton and Chapman’s “Searching For Positive Returns At The Track” provides the example of three such sets: “[horse 4 finished ahead of horse 1, 2, and 3], [horse 2 finished ahead of horse 1 and horse 3], and [horse 1 finished ahead of horse 3], where no ordering is implied among the “non-chosen” horses in each “choice” set.”

These exploded choice sets are statistically independent, so they can be easily added to the dataset and are equivalent to completely independent horse races.  This leads to an increase in the number of independent choice sets available for analysis and, ultimately, would provide more precise parameter estimates. 

“Extensive small-sample Monte Carlo results reported in Chapman and Stalin document the improved precision that can accrue by making use of the explosion process.”  This becomes tremendously valuable due to both the time and cost of obtaining a sufficiently robust dataset to use for modeling.

Boosting Prediction Accuracy

Due to the limited amount of quantitative data, feature engineering appeared to be a prescient path to take in order to bolster prediction accuracy.  Average lengths distance behind the winner appeared to be a good starting point.  For horses that won, this number would be the negative lengths between the horse and second place.  However, the frequency distributions of races ran by a horse is heavily right skewed, as many have run only one race. 

Additionally, this metric would be biased against horses that have never ran before.  In a random forest model, this issue could be slightly mitigated by using a dedicated large number to represent such horses but for the logistic model a solution could not be so easily implemented.  However, there were several remedies available:

  • Utilize available metrics such as Speed Index (when available) to predict the horse’s finish time at the race distance and ensemble the two models
  • Identify the horse in the current dataset with > n finishes most similar to the horse in question using KNN and assume the same value for the variable
  • Average the horse’s past n performances during training over the same distance

Adding Available Information

Given that the data for use did not include quantitative metrics other than the number of lengths that a horse placed behind the winner the only applicable solution was to use KNN.  However, the implementation of such a  technique over all the variables in the dataset might not actually be necessary in this situation.  It has already been discussed that bodyweight, trainer, and jockey are important variables from the Paddock dataset. 

The team’s hypothesis is that adding readily available information such as Speed Index, which is known to have a strong correlation to finish time, might be enough to estimate the average range of time that a horse would take to finish a race.  Testing this theory using AIC or BIC over the aforementioned variables and additional variables from the Paddock dataset would be a very informative next step.

Data on Placement In a Race

After analyzing the average lengths behind the winner data the team saw a fairly frequent trend of high variance in horses that finished below fourth place.  Even winning horses would have runs that had a drastically negative impact on their average distance to the winner. 

The thesis behind this variability was that once a jockey knew his/her horse would not place the jockey would ease off of the horse and reserve as much of the horse’s energy for another race or for training.  When the filter was applied so that only the top four horses were accounted for on each race, the standard deviation fell by over 62%.  Because of this, the team was back to not only the first question of how to manage horses that have not run but also how to manage horses that have not placed in the top four. 

Several approaches were presented on how to manage this issue.  The first was to One Hot Encode various distance gaps and add an additional label to signify a placing below fourth.  To accomplish this task would require a SME who could validate the methodology and additional statistical analysis.

Data Gained from Keeping Track of Performance

Before delving into such an undertaking, the team decided to track whether or not wins-to-races ran as well as places-to-races ran was a strong indicator of future performance.  The team did this by separating the training and testing dataset by race date and then calculating total races won to total races for all horses in the training set.  These numbers were then applied to each horse in the testing set with unseen horses receiving a value of zero.  

The result of this approach was a higher predicted win probability for horses with a high index but a drop-off in total accuracy due to the bias against unseen horses.  Additionally, because of the rules of racing in Hong Kong, dominant horses may have additional weight added in subsequent races. The model was unable to account for these changes.

Overall, the team concluded that additional quantitative information is necessary to effectively utilize feature engineering for the model.  While the Paddock data alone performed no better than a random choice approach, when coupled with the TIPS and Odds data the Paddock observations appear to act as a confounding variable that substantially increase overall model returns.  Nevertheless, due to the lack of observations and subsequent variance observed in the model, the team would need to gather more data to establish a definitive stance.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI