Studying Data to Predict Horse Racing Outcomes in India

Posted on Sep 22, 2016
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Horse racing and data go hand-in-hand. ¬†The vast array of statistics about horses, jockeys, trainers, lineage of horses, and much more is impressive, and the application of this data in determining odds of success is integral to the sport. ¬†For centuries people have worked to understand the relationships among the data in an effort to better predict the success of a horse with a dream of ‚Äústriking it big‚ÄĚ. ¬†

With so much data and the possibility of immediate application of predictive models, we became quickly enthralled with the idea of building a better model to predict outcomes.  Since half of the final project team was from India, and due to the relative ease of obtaining the data, we chose to focus on horse racing in India.  Our decision to proceed with this project was easy, but was the last easy step in the process.

Obtaining the Data

The first challenge was finding data.  The entities that own the data control it tightly.  Results from individual races are findable on the internet, but there is not a single location that has all the data, nor is there a compiled database that was available to us.  Our first challenge, therefore, became creating a database upon which we could train models.  We used beautiful soup in Python to scrape over 3,500 web pages of data.  We were able to build an initial database with records from the 5 major tracks in India, spanning almost 10 years of data.

  • Tracks: 5 Tracks with over 9 years of races
  • Horses: Over 13,000 individual horses, running over 200,000 times
  • Jockeys: More than 650 jockeys, some of whom rode in more than 3,000 races
  • Trainers: 267 trainers, some with horses running over 3,000 times
  • Races: Over 23,000 races captured

The database was formatted as a single row for each horse appearing in each race, totalling 210,000 runnings.  The data in each record included information about the horse and its lineage, the trainer and jockey, and statistics including the horse weight, speed rating, race-time odds, gate draw and age, as well as the variables we might want to predict, the finish place and finish time.

Cleaning & Preparing the Data

The challenge with this project continued when we looked at the data.  The following illustration highlights the extent of the missingness in the data.

Studying Data to Predict Horse Racing Outcomes in India

The race-time odds were missing from nearly 35% of the data.  We struggled with how to handle this missingness, debating whether more bias would be introduced by imputing or by dropping these records.  Ultimately, the race-time odds proved to be vital to developing a model with predictive power and the range of odds between the favorite and other horses was sometimes very large.  Because of this, we concluded that it would produce less bias if we dropped the records entirely.  

We have to proceed with one large caveat: we do not have an understanding of why this data was missing, and how it might affect the conclusions if it were available.  However, we found it to be the case that odds info was missing from entire races or it was complete for entire races, and it was not the case that individual horses in a race were missing odds info.

The rest of the missingness was more understandable, or was easily imputable without introducing meaningful bias.  For instance, occasionally the rating would be missing.  In the case of a missing rating, we set that horse’s speed rating to be the mean of all the horses in the race.  We ended up engineering features that had missing data for several reasons, and we were able to use the same procedure of imputing those values with the mean of a race.

Data Exploration

Our first look into the cleaned dataset provided us with some clues about how to proceed. ¬†The first thing that stood out was extent to which the odds alone were predictive. ¬†The implied probability of the favorite horse winning the race was about 42%, derived from the race-time odds. ¬†The favorite horse actually won¬†43% of the time. ¬†The mean odds of the favorite was 1.7/1, which meant that a simple ‚Äúpick the favorite‚ÄĚ strategy would have been profitable. ¬†Our earlier caveat raises its ugly head at exactly this time‚Ķ if there was¬†some reason for the missingness that has to do with the odds, then this would not hold true.

 The following chart highlights the implied probability of winning for horses starting in each rank of the favorites table.  Also included is the actual win percentage, and the number of observations.  For horses beginning a race as the odds-favorite, the mean implied win percent was the aforementioned 42%; the actual win rate was 43% in over 13,000 races.


Mean Odds of Winning

Looking at the mean odds of winning for each odds-rank and comparing to the mean odds of the horse that ended in each finish position clued us in to the potential profit opportunity.  If the favorite ran with 1.7/1 odds and the horse that ultimately won had odds of 4.7/1, it seems that there is an additional $3 of profit potential if our model was to perfectly predict winners.











We also found that there were particular horses, jockeys, and trainers with a high percentage of wins.  The following charts highlight the winningness of the top horses, races, and jockeys (ordered by total count of wins).





Winning Percentage Based on Start-Gate

One of the most interesting trends we uncovered was the winning percentage based on the start-gate.  Horses starting at lower positions tended to win more often than horses starting in higher gates.


Feature Engineering

Given the persistence of winning for specific horses, trainers, and jockeys, we wanted to make sure that our feature-space included the history of each of these groups.  We created methods of grouping and filtering in Python to create historical records for each horse, trainer and jockey for all races preceding the current race and appended to the appropriate record.  

Each record now contained historical run times, win, place, and show percentages for individual horses, as well as win, place, and show percentages for trainers and jockeys.  We then applied procedures to group within races, and added features that described the difference between each horse and the rest of the field - comparing prior race times, speed ratings and weights. We now had a full, feature-rich dataset that included 20 more features than the simple data we had scraped.

Data Preparation

We had the benefit of having scraped 9+ years of data. This allowed us to create three chronologically ordered data sets:

  • Training Data (2007-2013) ¬†~67,000 records
  • Validation Data (2014) ¬†~14,000 records
  • Test Data (2015 & part of 2016) ¬†~25,000 records

We scaled the data prior to splitting the files, so that the data could work with any model we decide to use to optimize our results in the future.

Machine Learning Objectives and Model Details

The primary objective for applying machine learning was to develop a model that could potentially beat a simplistic approach of always betting on the track favorite, based on the odds. However, it is important to mention that the odds represent much more information than just the probabilities a racing establishment puts on a horse to win a race.

The odds also represent information about how much a horse is being backed by race-goers, so by definition it also represents latent information of people's emotions -- fear, greed and maybe even some insider knowledge of how a horse may perform on that day. This information is really hard to de-construct and it doesn't sit in any existing variables anywhere else.

We wanted to assess whether we could use the odds in tandem with the information collected and engineered so far, i.e. the past performance of horses, jockeys and trainers along with information about the current race, to predict the outcome of a winning horse better than just using the odds themselves.

So it is a straight-forward classification problem for which we combined a gradient boosted machine model and a Neural network model to predict the outcome.

We sought to optimize both models on the AUC statistic, in an attempt to maximize true positive signals of winners and minimize false positive signals.

Gradient Boosted Machine (GBM) Results

The AUC we got for the GBM model was about 0.83 on the training and validation datasets.


However, when looking at the confusion matrix on the test set (below), from among all the signals generated, the GBM model only predicted 34% of them accurately.



The optimal GBM model had the following parameter settings:

  • Number of trees: 1,000
  • Number of terminal nodes: 2
  • Min. obs. at terminal nodes: 50
  • Learn rate: 0.018
  • Column sample rate (per split): 0.65
  • Row sample rate (per tree): 0.55

The variable importance plot (re-scaled to be budgeted out of a 100%) suggests that the main predictive element of the model is the odds itself. Secondary importance measures relate to jockey past performance and how much better a horse's past statistics are versus that of the field.


Neural Network (NN) Results

The NN model also yielded an AUC score of about the same as the GBM model (0.84), but the accuracy of predictions from among the signals was much higher -- 40%.


The optimal Neural Network model had the following parameter settings. It is important to note though that it is easy to underfit Neural Networks because of the sheer number of parameters available to tune, so improvements to this base model can likely still be made.

  • Number of Hidden Layers / Nodes: 1 / 500
  • L1 Regularization: 0.000040
  • L2 Regularization: 0.000010
  • Input Drop Out Ratio: 0.025
  • Learn Rate: 0.007
  • Epochs: 100

Stacking the Models

Individual models don't compare well to an odds-only model:
Going by odds alone ‚Üí 43% Accuracy | GBM ‚Üí 34% Accuracy | NN ‚Üí 40% Accuracy

Stacked models typically yield better prediction strength than individual models.  The basic procedure employed for stacking (as outlined below) was:

  • we developed base models (the GBM and NN already mentioned) on the training set
  • predicted those models on the validation set, and
  • used the probabilities generated as a result as inputs to a final meta learning (or combiner) model that would predict results on the test set


Using a GBM model as the meta learning model, we improved our prediction accuracy to 55%. But, one important caveat to note is that it generated far fewer signals (1/3rd to 1/5th of the other models spoken of thus far).

In order to compare the true performance of the stacked model to the bet-the-favorite model, we would somehow need to covert them to be on the same scale, i.e. we can asses their performance on a per signal basis.

More specifically, if we assumed that you bet one dollar for every signal that arose from each of the models discussed so far, it would yield an average of:

  • $0.003 (per dollar) using the GBM-only model
  • $0.013(per dollar) using the Neural Network-only model
  • $0.045 (per dollar) using the Stacked¬†model
  • $0.047 (per dollar) using the Favorite-only approach


Even after stacking, our profitability turns out to be just about as effective as a bet-the-favorites strategy, which yields an average of about a 4.5% return on every bet. While the results are on par with a far simpler approach, this is just the beginning of the modeling process. We have so far built up a good framework to allow for swapping in or out additional models with better optimized parameters. Thus far, we have only explored gradient boosted and neural network models for each of the levels of the stacked model. We'd love to try extending that to additional algorithms for the base layer as well as the meta learning layer.

Additionally, we may attempt to engineer a few more features. One such feature we would immediately like to look at is the spread of odds across all the horses in a particular race. Usually a horse with very low odds compared to other horses is more likely to win, and if this phenomenon is separated from the rest of the data, we may be able to focus on predicting winners for closer races, potentially ones having higher payouts. We could also optimize models based on dollar payouts rather than a simple accuracy % and we plan to write our own objective function for that further down the road.






About Authors

Ben Townson

Ben Townson graduated from the New York City Data Science Academy 12-week Data Science Bootcamp on September 23. At NYCDSA he has mastered machine learning and data analysis techniques, complementing more than ten years spent in the finance...
View all posts by Ben Townson >

Leave a Comment

Mark Littlewood May 11, 2018
Could I suggest that you are focusing too much on predicting winners and not enough on predicting profit. The two are not always the same and its seems that you are falling into the trap that most no betting academics fall into when starting out on this problem. There also appears to be something wrong with your back the fav and make a profit discovery. This cannot be correct, are you sure that the 1.71 are not decimal odds. If backing the fav makes a profit then the Indian pari mutual or bookmakers are going to go bust
Kush Shukla June 10, 2017
Beautiful piece of work! An extension to this work could be to evaluate the performance over different types of betting allowed. That will add profit sense to work. Would you mind sharing the dataset you developed for this project?
Regis January 8, 2017
Hello Very interesting. Would it be possible to share the code in order to test it? Regards Régis
Peter Jameson January 7, 2017
An excellent write up, well done!

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI