Numerai Hedge Fund Competition

and
Posted on Mar 30, 2017

Numerai is a hedge fund that uses a machine learning competition to crowd source trade predictions. Competition is based on proprietary hedge fund data collected and curated by Numerai. Data is encrypted before being made public because it is highly valuable, proprietary and its quality provides a competitive edge for Numerai. This enables Numerai to obtain machine learning predictions on private data without ever making it public.

Homomorphic structure preserving encryption is used to transform and encrypt the data. Additionally, competition data is scaled and normalized. This scaling and normalization leaves limited room for feature engineering and participants have to rely on strong algorithms to achieve success. This makes it an algorithm vs algorithm competition rather than competitors spending endless hours feature engineering.

Numerai argues that high quality proprietary data is expensive to collect and provides a significant competitive advantage. Hedge funds and other financial institutions are in an optimal place to collect and curate this data but they have a strong incentive to keep it private and guard it. But these institutions only employ a small percent of the world's machine learning talent pool. It further argues that this makes financial markets machine learning inefficient.

This article at Wired.com provides a good background on Numerai.

With a competition style format on encrypted data, Numerai is able to obtain machine learning predictions from a larger pool of data scientist giving it a further competitive edge. Once individual participants submit their predictions, Numerai uses them to construct a meta-model. It uses the predictions made by this meta-model for live trading. Cross-entropy between the meta-model and user predictions determine the leader-board rankings for participants. Rationale for using a meta-model comes from the literature on ensemble learning.

Meta model construction and its bias over time can also be compared to financial markets. Financial markets are based on decisions made by individuals, which, when combined determine the market direction. Similar processes are at work with the movement of this meta model. It is build on the predictions of the individual participants and the direction is determined by them. Additionally, by using a meta-model formed out of a bag of predictions, Numerai is able to take a portfolio theory approach to predictions. Meta-model is made of individual bets made by many data scientists. Averaging of these bets significantly reduces the individual systematic bias and model variance. Resulting leftover bias can be interpreted as learning deduced from the data.

Our Approach

We tried dozens of algorithms to gauge their effectiveness on the data sets. Algorithms with high time, memory complexity or poor learning ability were eliminated step by step. Our aim for model selection was to maintain a quick turnaround time so that we can quickly benchmark results, twerk knobs and resubmit.

Following table shows the time taken, accuracy, loss and f1-score of multiple models. All the algorithms were executed on Google Compute High-CPU (64 core, 60 GB memory and Ubuntu 16.04) instances.

Algorithm Time (in secs) Accuracy Cross-entropy F1-score
Stochastic Gradient Descent with log optimizer 7.0169 0.5172 0.6924 0.52039
Random Forests Classifier 2.3170 0.5130 0.6926 0.4905
Gradient Boosting Classifier 110.1550 0.5140 0.6924 0.5074
Multilevel Perceptron 101.9041 0.5117 0.6926 0.5235
XGBoost Classifier 18.5751 0.5117 0.6925 0.5035
Extra Trees Classifier 26.2257 0.5165 0.6924 0.5221
Decision Trees Classifier 5.0928 0.5147 0.6927 0.4695
Logistic Regression Classifier 20.3553 0.5143 0.6928 0.5116
Keras Classifier 75.5020 0.5078 0.6929 0.4035

Results

We optimized and calibrated individual models using bayesian hyperparameter optimization and came up with three ensemble models that consistently produced low variance and low bias. Models were ensemble by soft voting, hard voting and using predictions from previous models as additional features. Deep neural network in Keras was used as a meta classifier. These same algorithms were used for predictions on the following week’s data-set. 

Soft voting ensemble model consistently ranked among top 5 for both the weeks. Other two models consistently ranked between the 5th and the 20th position.

Give us a shout out if you want to chat about additional details.

LinkedIn - Kamal Sandhu and Abhishek Desai

Future

In the near future, we would like to build a deep learning model using transfer learning along with full generalization and automation from week to week. Implementation using PySpark (for parallelizing) and by incorporating MongoDB (parameter tracking) will also be in the works.

Tools Used

Python, Anaconda, Jupyter Notebook, Pycharm, Pandas, JSON, Scikit-learn, Xgboost, Keras, Hyperopt, Scikit-optimize, Mlxtend, Google Compute, Linux

About Authors

Kamal Sandhu

Kamal Sandhu is a finance professional keenly interested in the potential of data science in combination with financial and management theory. He is working towards the Chartered Financial Analyst (CFA) program and the Financial Risk Manager (FRM) program....
View all posts by Kamal Sandhu >

Abhishek Desai

I'm interested in all things mechanical, but particularly the ability to use machine learning and algorithm design to to locate the areas of development where efficiency can be harnessed to advance business interests. With 10+ years of experience...
View all posts by Abhishek Desai >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI