Data Analyzing Horse Racing
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
"Founded in 1884, the Hong Kong Jockey Club is one of the most treasured—and lucrative—legacies of Britain’s colonial rule over the city. Its emerald turf attracts about HK$138.8 million (US$17.86 million) per race, more than any other track in the world." - Bloomberg
Successfully predicting even a small percentage of winning horses over a large amount of races can lead to an absurd ROI thanks to compounded interest. Most famously, William Benter earned nearly 1 Billion USDs by creating a computer program to data analyze the horse racing market.
Attracted by the possibility of measuring up our data science skills to one metric, and one metric only, ROI, as well as the challenge of facing off against all HK horse racing market participants, our group (consisting of 5 data scientists) partnered with RaceQuant, a startup specializing in HK horse-betting.
RaceQuant provided us with data for all races held by the Hong-Kong Jockey Club for the years 2015, 2016, 2017, and 2018 (1st and 2nd quarters). The data set consisted of 81 different features. The objective of this capstone project was to:
1) create a a model to predict the probability of a given horse in a given race winning said race; and
2) use the probabilities outputted by our model to create a betting strategy to maximize our ROI based on a $100,000 betting bankroll when back-testing for 540 races randomly selected from the data set.
Due to an NDA contract, parts III. and IV. describing our approaches to data analysis and modeling, will be significantly simplified.
II. FEATURE CLEANING AND ENGINEERING
Our raw data contained 2384 unique races, with a total of 24863 horses having run those races. Before we created our model, we performed several data transformations. The data cleaning was performed through using Pandas in Python.
Some of the features in our data set had missing values. For example: for jockey and trainer win percentages, we assumed a value of 10% for first time jockeys and trainers. This value was assigned as 10%, since most first time jockeys and trainers at the HK tracks have participated in other international events and haver performed quite well on average.
We also did some feature engineering to better capture certain types of information from our data. For example: we created a feature related to the horses' weight. The percent deviation of a horse’s weight from its winning weight was introduced in order to examine the effect of horse weight on the chances of winning. Similarly, the percent deviation of a horse’s weight from his average weight over all races was introduced to standardize horse weight.
III. MODELING APPROACH
In the racing data we were given, we found that many features did not hold up to a linear relationship with an increase in a horse's win probability. Therefore, we tested multiple models that captured complex relationships between input features and output probability. We ended up sticking with a neural network which was then find tuned to maximize the probability of our neural network being able to choose the winning horse of a given race.
IV. BETTING STRATEGY
Having calculated what the probability was that a given horse would win per race, we set out to develop a betting strategy to maximize our ROI. To do so, we first explored the Kelly Criterion:
Where: f* is the fraction of the current bankroll to wager, b is the net odds received for the wager (your return is "b to 1" where you make a bet of $1 you would receive $b in addition to getting the $1 back), p is the probability of winning, q is the probability of losing (1-p).
Several issues came up while applying this methodology:
- The f values calculated were much too high and often led to an early bankruptcy.
- The Kelly Criterion makes the assumption that bets made are independent: In a race, there can be up to 14 horses => 14 possible bets that are all mutually exclusive outcomes to one another.
To remedy the first issue, a multiplicative constant between 0 and 1 for f was introduced. It was found that 0.2 worked optimally. If interested, we recommend looking up the derivation for the Kelly Criterion which proves that a multiplicative coefficient will delay exponential growth but will provide less variance in early return outcomes. For other issues, we adapted an alternative betting strategy described in Peter Tompkin’s paper “An explicit solution to the problem of optimizing the allocations of a better’s wealth when wagering on horse races”. This strategy can be broken down into several steps:
2) Calculate expected revenue rate: where D = 1- tt with tt being the track take or tax on one's bet.
3) Reorder the expected revenue rates in descending order such that er1 will be the best bet
4) Create a Set S =Φ , k=1 and R(S) = 1. Thus, the best bet erk will be er1 considered first for step 5.
6) Repeat step 5 until the condition in step 5 is no longer fulfilled, S0 = S then.
Similar to the Kelly Criterion, one can optimize the longevity of our bankroll by multiplying f by a fixed ratio, which we found to be 0.03.
V. DATA OUTCOME
The outcome of our model is summarized in the below table, which holds the key statistics of our betting strategy.
|Sample Size -- Training and Testing|
|Total training + test races||2384|
|Total training + test horses||24863|
|% of training races||77.31%|
|% of training horses||73.62%|
|Number of races in test||541|
|% of test races||22.69%|
|Number of horses in test||6559|
|% of test horses||26.38%|
|Initial Bankroll $||$100,000.00|
|Minimum Bankroll $||$91,912.05|
|Maximum Bankroll $||$1,418,354.18|
|Total number of bets made||1953|
|Average number of bets made per race||3.61|
|% of total bets possible made||29.78%|
|Number of winners bet||143|
|% of rank 1 horses predicted||26.38%|
|Biggest bet $||$28,774.20|
|Smallest bet $||$1.30|
|Biggest winning bet $||$28,774.20|
|Smallest winning bet $||$1.30|
|Biggest losing bet $||$27,241.50|
|Smallest losing bet $||$1.30|
|Average net win odds||5.38|
|Lowest net win odds||0.1|
|Highest net win odds||17|
|Final Bankroll $||$1,054,003.96|
|Total spend on bets made $||$5,333,083.60|
|ROI as betting %||19.76%|
|Total ROI %||1054.00%|
This capstone project was completed by Basant Dhital, Tristan Dresbach, Jiwon Cha, SangYeon Choi, and Karim Zaatary in collaboration with RaceQuant through NYC Data Science Academy. Please contact Basant via LinkedIn and Tristan via LinkedIn for any questions.