Numerai Hedge Fund Competition
Numerai is a hedge fund that uses a machine learning competition to crowd source trade predictions. Competition is based onΒ proprietary hedge fund data collected and curated by Numerai. Data is encrypted before being made public because itΒ is highly valuable, proprietary and its quality provides a competitive edge for Numerai. This enables Numerai to obtain machine learning predictions on private data without ever makingΒ it public.
Homomorphic structure preserving encryptionΒ is used to transform and encrypt the data. Additionally, competition data is scaled and normalized. This scaling and normalization leaves limited room for feature engineering and participants have to rely on strong algorithms to achieve success. This makes itΒ an algorithm vs algorithm competition rather than competitors spending endless hours feature engineering.
Numerai argues that high quality proprietary data is expensive to collect and provides a significantΒ competitive advantage. Hedge funds and other financial institutions are in an optimal place to collect and curate this data but they have a strong incentive to keep it private and guard it. But these institutions only employ a small percent of theΒ world's machine learning talent pool. ItΒ further argues that this makes financial markets machine learning inefficient.
This articleΒ at Wired.com provides a good background on Numerai.
With a competition style format on encrypted data, Numerai is able to obtain machine learning predictions from a larger pool of data scientist giving it a further competitive edge. Once individual participants submit their predictions, Numerai uses themΒ to construct a meta-model. It uses the predictions made by this meta-model for live trading.Β Cross-entropy between the meta-model and user predictions determine the leader-board rankings for participants.Β RationaleΒ for using a meta-modelΒ comes from the literature onΒ ensembleΒ learning.
Meta model construction and its bias over timeΒ can also be compared to financial markets. Financial markets are based on decisions made by individuals, which, when combined determine the market direction. Similar processes are at work with the movement of this meta model. It is build on the predictions of the individual participants and the direction is determined by them. Additionally, by using a meta-model formed out of a bag of predictions, Numerai is able to take a portfolio theory approach to predictions. Meta-model is made of individual bets made by many data scientists. Averaging of these bets significantly reduces the individual systematic bias and model variance. Resulting leftover bias can be interpreted as learning deduced from the data.
Our Approach
We tried dozens of algorithms to gauge their effectiveness on the data sets. Algorithms with high time, memory complexity or poor learning ability were eliminated step by step. Our aim for model selection was to maintain a quick turnaround time so that we can quickly benchmark results, twerk knobsΒ and resubmit.
Following table shows the time taken, accuracy, loss and f1-score of multiple models. All the algorithms were executed on Google Compute High-CPU (64 core, 60 GB memory and Ubuntu 16.04) instances.
Algorithm | Time (in secs) | Accuracy | Cross-entropy | F1-score |
Stochastic Gradient Descent withΒ log optimizer | 7.0169 | 0.5172 | 0.6924 | 0.52039 |
Random Forests Classifier | 2.3170 | 0.5130 | 0.6926 | 0.4905 |
Gradient Boosting Classifier | 110.1550 | 0.5140 | 0.6924 | 0.5074 |
Multilevel Perceptron | 101.9041 | 0.5117 | 0.6926 | 0.5235 |
XGBoost Classifier | 18.5751 | 0.5117 | 0.6925 | 0.5035 |
Extra Trees Classifier | 26.2257 | 0.5165 | 0.6924 | 0.5221 |
Decision Trees Classifier | 5.0928 | 0.5147 | 0.6927 | 0.4695 |
Logistic Regression Classifier | 20.3553 | 0.5143 | 0.6928 | 0.5116 |
Keras Classifier | 75.5020 | 0.5078 | 0.6929 | 0.4035 |
Results
We optimized and calibrated individual models using bayesian hyperparameter optimizationΒ and came up with three ensemble models that consistently produced low variance and low bias. Models were ensemble by soft voting, hard voting and using predictions from previous models as additional features. Deep neural network in Keras was used as a meta classifier. These same algorithms were used for predictions on the following weekβs data-set.Β
Soft votingΒ ensemble model consistently ranked among top 5 for both the weeks. Other two models consistently ranked between the 5th and the 20th position.
Give us a shout out if you want to chat about additional details.
LinkedIn -Β Kamal Sandhu and Abhishek Desai
Future
In the near future, we would like to build a deep learning model using transfer learning along with full generalization and automation from week to week. Implementation using PySpark (for parallelizing) and by incorporating MongoDB (parameter tracking) will also be in the works.
Tools Used
Python, Anaconda, Jupyter Notebook, Pycharm, Pandas, JSON, Scikit-learn, Xgboost, Keras, Hyperopt, Scikit-optimize, Mlxtend, Google Compute,Β Linux