Numerai Hedge Fund Competition

Posted on Mar 30, 2017

Numerai is a hedge fund that uses a machine learning competition to crowd source trade predictions. Competition is based on proprietary hedge fund data collected and curated by Numerai. Data is encrypted before being made public because it is highly valuable, proprietary and its quality provides a competitive edge for Numerai. This enables Numerai to obtain machine learning predictions on private data without ever making it public.

Homomorphic structure preserving encryption is used to transform and encrypt the data. Additionally, competition data is scaled and normalized. This scaling and normalization leaves limited room for feature engineering and participants have to rely on strong algorithms to achieve success. This makes it an algorithm vs algorithm competition rather than competitors spending endless hours feature engineering.

Numerai argues that high quality proprietary data is expensive to collect and provides a significant competitive advantage. Hedge funds and other financial institutions are in an optimal place to collect and curate this data but they have a strong incentive to keep it private and guard it. But these institutions only employ a small percent of the world's machine learning talent pool. It further argues that this makes financial markets machine learning inefficient.

This article at provides a good background on Numerai.

With a competition style format on encrypted data, Numerai is able to obtain machine learning predictions from a larger pool of data scientist giving it a further competitive edge. Once individual participants submit their predictions, Numerai uses them to construct a meta-model. It uses the predictions made by this meta-model for live trading. Cross-entropy between the meta-model and user predictions determine the leader-board rankings for participants. Rationale for using a meta-model comes from the literature on ensemble learning.

Meta model construction and its bias over time can also be compared to financial markets. Financial markets are based on decisions made by individuals, which, when combined determine the market direction. Similar processes are at work with the movement of this meta model. It is build on the predictions of the individual participants and the direction is determined by them. Additionally, by using a meta-model formed out of a bag of predictions, Numerai is able to take a portfolio theory approach to predictions. Meta-model is made of individual bets made by many data scientists. Averaging of these bets significantly reduces the individual systematic bias and model variance. Resulting leftover bias can be interpreted as learning deduced from the data.

Our Approach

We tried dozens of algorithms to gauge their effectiveness on the data sets. Algorithms with high time, memory complexity or poor learning ability were eliminated step by step. Our aim for model selection was to maintain a quick turnaround time so that we can quickly benchmark results, twerk knobs and resubmit.

Following table shows the time taken, accuracy, loss and f1-score of multiple models. All the algorithms were executed on Google Compute High-CPU (64 core, 60 GB memory and Ubuntu 16.04) instances.

Algorithm Time (in secs) Accuracy Cross-entropy F1-score
Stochastic Gradient Descent with log optimizer 7.0169 0.5172 0.6924 0.52039
Random Forests Classifier 2.3170 0.5130 0.6926 0.4905
Gradient Boosting Classifier 110.1550 0.5140 0.6924 0.5074
Multilevel Perceptron 101.9041 0.5117 0.6926 0.5235
XGBoost Classifier 18.5751 0.5117 0.6925 0.5035
Extra Trees Classifier 26.2257 0.5165 0.6924 0.5221
Decision Trees Classifier 5.0928 0.5147 0.6927 0.4695
Logistic Regression Classifier 20.3553 0.5143 0.6928 0.5116
Keras Classifier 75.5020 0.5078 0.6929 0.4035


We optimized and calibrated individual models using bayesian hyperparameter optimization and came up with three ensemble models that consistently produced low variance and low bias. Models were ensemble by soft voting, hard voting and using predictions from previous models as additional features. Deep neural network in Keras was used as a meta classifier. These same algorithms were used for predictions on the following week’s data-set. 

Soft voting ensemble model consistently ranked among top 5 for both the weeks. Other two models consistently ranked between the 5th and the 20th position.

Give us a shout out if you want to chat about additional details.

LinkedIn - Kamal Sandhu and Abhishek Desai


In the near future, we would like to build a deep learning model using transfer learning along with full generalization and automation from week to week. Implementation using PySpark (for parallelizing) and by incorporating MongoDB (parameter tracking) will also be in the works.

Tools Used

Python, Anaconda, Jupyter Notebook, Pycharm, Pandas, JSON, Scikit-learn, Xgboost, Keras, Hyperopt, Scikit-optimize, Mlxtend, Google Compute, Linux

About Authors

Kamal Sandhu

Kamal Sandhu is a finance professional keenly interested in the potential of data science in combination with financial and management theory. He is working towards the Chartered Financial Analyst (CFA) program and the Financial Risk Manager (FRM) program....
View all posts by Kamal Sandhu >

Abhishek Desai

I'm interested in all things mechanical, but particularly the ability to use machine learning and algorithm design to to locate the areas of development where efficiency can be harnessed to advance business interests. With 10+ years of experience...
View all posts by Abhishek Desai >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp