Studying Data to Predict Horse Racing Outcomes in India
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Horse racing and data go hand-in-hand. The vast array of statistics about horses, jockeys, trainers, lineage of horses, and much more is impressive, and the application of this data in determining odds of success is integral to the sport. For centuries people have worked to understand the relationships among the data in an effort to better predict the success of a horse with a dream of โstriking it bigโ.
With so much data and the possibility of immediate application of predictive models, we became quickly enthralled with the idea of building a better model to predict outcomes. Since half of the final project team was from India, and due to the relative ease of obtaining the data, we chose to focus on horse racing in India. Our decision to proceed with this project was easy, but was the last easy step in the process.
Obtaining the Data
The first challenge was finding data. The entities that own the data control it tightly. Results from individual races are findable on the internet, but there is not a single location that has all the data, nor is there a compiled database that was available to us. Our first challenge, therefore, became creating a database upon which we could train models. We used beautiful soup in Python to scrape over 3,500 web pages of data. We were able to build an initial database with records from the 5 major tracks in India, spanning almost 10 years of data.
- Tracks: 5 Tracks with over 9 years of races
- Horses: Over 13,000 individual horses, running over 200,000 times
- Jockeys: More than 650 jockeys, some of whom rode in more than 3,000 races
- Trainers: 267 trainers, some with horses running over 3,000 times
- Races: Over 23,000 races captured
The database was formatted as a single row for each horse appearing in each race, totalling 210,000 runnings. The data in each record included information about the horse and its lineage, the trainer and jockey, and statistics including the horse weight, speed rating, race-time odds, gate draw and age, as well as the variables we might want to predict, the finish place and finish time.
Cleaning & Preparing the Data
The challenge with this project continued when we looked at the data. The following illustration highlights the extent of the missingness in the data.
The race-time odds were missing from nearly 35% of the data. We struggled with how to handle this missingness, debating whether more bias would be introduced by imputing or by dropping these records. Ultimately, the race-time odds proved to be vital to developing a model with predictive power and the range of odds between the favorite and other horses was sometimes very large. Because of this, we concluded that it would produce less bias if we dropped the records entirely.
We have to proceed with one large caveat: we do not have an understanding of why this data was missing, and how it might affect the conclusions if it were available. However, we found it to be the case that odds info was missing from entire races or it was complete for entire races, and it was not the case that individual horses in a race were missing odds info.
The rest of the missingness was more understandable, or was easily imputable without introducing meaningful bias. For instance, occasionally the rating would be missing. In the case of a missing rating, we set that horseโs speed rating to be the mean of all the horses in the race. We ended up engineering features that had missing data for several reasons, and we were able to use the same procedure of imputing those values with the mean of a race.
Data Exploration
Our first look into the cleaned dataset provided us with some clues about how to proceed. The first thing that stood out was extent to which the odds alone were predictive. The implied probability of the favorite horse winning the race was about 42%, derived from the race-time odds. The favorite horse actually won 43% of the time. The mean odds of the favorite was 1.7/1, which meant that a simple โpick the favoriteโ strategy would have been profitable. Our earlier caveat raises its ugly head at exactly this timeโฆ if there was some reason for the missingness that has to do with the odds, then this would not hold true.
The following chart highlights the implied probability of winning for horses starting in each rank of the favorites table. Also included is the actual win percentage, and the number of observations. For horses beginning a race as the odds-favorite, the mean implied win percent was the aforementioned 42%; the actual win rate was 43% in over 13,000 races.
Mean Odds of Winning
Looking at the mean odds of winning for each odds-rank and comparing to the mean odds of the horse that ended in each finish position clued us in to the potential profit opportunity. If the favorite ran with 1.7/1 odds and the horse that ultimately won had odds of 4.7/1, it seems that there is an additional $3 of profit potential if our model was to perfectly predict winners.
Wins
We also found that there were particular horses, jockeys, and trainers with a high percentage of wins. The following charts highlight the winningness of the top horses, races, and jockeys (ordered by total count of wins).
Winning Percentage Based on Start-Gate
One of the most interesting trends we uncovered was the winning percentage based on the start-gate. Horses starting at lower positions tended to win more often than horses starting in higher gates.
Feature Engineering
Given the persistence of winning for specific horses, trainers, and jockeys, we wanted to make sure that our feature-space included the history of each of these groups. We created methods of grouping and filtering in Python to create historical records for each horse, trainer and jockey for all races preceding the current race and appended to the appropriate record.
Each record now contained historical run times, win, place, and show percentages for individual horses, as well as win, place, and show percentages for trainers and jockeys. We then applied procedures to group within races, and added features that described the difference between each horse and the rest of the field - comparing prior race times, speed ratings and weights. We now had a full, feature-rich dataset that included 20 more features than the simple data we had scraped.
Data Preparation
We had the benefit of having scraped 9+ years of data. This allowed us to create three chronologically ordered data sets:
- Training Data (2007-2013) ~67,000 records
- Validation Data (2014) ~14,000 records
- Test Data (2015 & part of 2016) ~25,000 records
We scaled the data prior to splitting the files, so that the data could work with any model we decide to use to optimize our results in the future.
Machine Learning Objectives and Model Details
The primary objective for applying machine learning was to develop a model that could potentially beat a simplistic approach of always betting on the track favorite, based on the odds. However, it is important to mention that the odds represent much more information than just the probabilities a racing establishment puts on a horse to win a race.
The odds also represent information about how much a horse is being backed by race-goers, so by definition it also represents latent information of people's emotions -- fear, greed and maybe even some insider knowledge of how a horse may perform on that day. This information is really hard to de-construct and it doesn't sit in any existing variables anywhere else.
We wanted to assess whether we could use the odds in tandem with the information collected and engineered so far, i.e. the past performance of horses, jockeys and trainers along with information about the current race, to predict the outcome of a winning horse better than just using the odds themselves.
So it is a straight-forward classification problem for which we combined a gradient boosted machine model and a Neural network model to predict the outcome.
We sought to optimize both models on the AUC statistic, in an attempt to maximize true positive signals of winners and minimize false positive signals.
Gradient Boosted Machine (GBM) Results
The AUC we got for the GBM model was about 0.83 on the training and validation datasets.
However, when looking at the confusion matrix on the test set (below), from among all the signals generated, the GBM model only predicted 34% of them accurately.
The optimal GBM model had the following parameter settings:
- Number of trees: 1,000
- Number of terminal nodes: 2
- Min. obs. at terminal nodes: 50
- Learn rate: 0.018
- Column sample rate (per split): 0.65
- Row sample rate (per tree): 0.55
The variable importance plot (re-scaled to be budgeted out of a 100%) suggests that the main predictive element of the model is the odds itself. Secondary importance measures relate to jockey past performance and how much better a horse's past statistics are versus that of the field.
Neural Network (NN) Results
The NN model also yielded an AUC score of about the same as the GBM model (0.84), but the accuracy of predictions from among the signals was much higher -- 40%.
The optimal Neural Network model had the following parameter settings. It is important to note though that it is easy to underfit Neural Networks because of the sheer number of parameters available to tune, so improvements to this base model can likely still be made.
- Number of Hidden Layers / Nodes: 1 / 500
- L1 Regularization: 0.000040
- L2 Regularization: 0.000010
- Input Drop Out Ratio: 0.025
- Learn Rate: 0.007
- Epochs: 100
Stacking the Models
Individual models don't compare well to an odds-only model:
Going by odds alone โ 43% Accuracy | GBM โ 34% Accuracy | NN โ 40% Accuracy
Stacked models typically yield better prediction strength than individual models. The basic procedure employed for stacking (as outlined below) was:
- we developed base models (the GBM and NN already mentioned) on the training set
- predicted those models on the validation set, and
- used the probabilities generated as a result as inputs to a final meta learning (or combiner) model that would predict results on the test set
Using a GBM model as the meta learning model, we improved our prediction accuracy to 55%. But, one important caveat to note is that it generated far fewer signals (1/3rd to 1/5th of the other models spoken of thus far).
In order to compare the true performance of the stacked model to the bet-the-favorite model, we would somehow need to covert them to be on the same scale, i.e. we can asses their performance on a per signal basis.
More specifically, if we assumed that you bet one dollar for every signal that arose from each of the models discussed so far, it would yield an average of:
- $0.003 (per dollar) using the GBM-only model
- $0.013(per dollar) using the Neural Network-only model
- $0.045 (per dollar) using the Stacked model
- $0.047 (per dollar) using the Favorite-only approach
Conclusion
Even after stacking, our profitability turns out to be just about as effective as a bet-the-favorites strategy, which yields an average of about a 4.5% return on every bet. While the results are on par with a far simpler approach, this is just the beginning of the modeling process. We have so far built up a good framework to allow for swapping in or out additional models with better optimized parameters. Thus far, we have only explored gradient boosted and neural network models for each of the levels of the stacked model. We'd love to try extending that to additional algorithms for the base layer as well as the meta learning layer.
Additionally, we may attempt to engineer a few more features. One such feature we would immediately like to look at is the spread of odds across all the horses in a particular race. Usually a horse with very low odds compared to other horses is more likely to win, and if this phenomenon is separated from the rest of the data, we may be able to focus on predicting winners for closer races, potentially ones having higher payouts. We could also optimize models based on dollar payouts rather than a simple accuracy % and we plan to write our own objective function for that further down the road.