NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship 🏆 Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release 🎉
Free Lesson
Intro to Data Science New Release 🎉
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See 🔥
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular 🔥 Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New 🎉 Generative AI for Finance New 🎉 Generative AI for Marketing New 🎉
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular 🔥 Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular 🔥 Data Science R: Machine Learning Designing and Implementing Production MLOps New 🎉 Natural Language Processing for Production (NLP) New 🎉
Find Inspiration
Get Course Recommendation Must Try 💎 An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release 🎉
Free Lessons
Intro to Data Science New Release 🎉
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See 🔥
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Bitcoin LSTM Directional Prediction: 75 Accuracy Rate

Bitcoin LSTM Directional Prediction: 75 Accuracy Rate

Jayce Jiang
Posted on Sep 30, 2019

Introduction

The purpose of this capstone project was to evaluate the efficacy of a neural net-based approach to high-frequency trading in the cryptocurrency market.  The test case considers the Deribit perpetual futures contract limit order book.  Fifty signals were taken as predictors to a binary outcome.  The final model is a Long Short Term Memory (“LSTM”) that achieved an accuracy rate in excess of 75.

Background

Trading models must be dynamic and flexible to be successful. Models may rely upon a multitude of variables both fundamental and technical. In either case, it is well established that an information lag is apparent to either typeof predictor – one need only consider the frequent revisions to past-performance indicators to substantiate this claim. It is therefore also obvious that any trading model worth its salt relies upon time-series data and so is rife the spectre of autoregression. Now, the choice of predictors whether fundamental or technical comes to the play. We assert that while the technical signal will exhibit greater noise than their fundamental counterpart, the underlying predicate to such noise in the form of market activity is of more predictive import. This greater fluidity and exogenous/own time-constraints lead us to base our model on technical indicators alone.

Limit order book data is taken from [April to May 2019] from the Deribit exchange with principal offices in Ermelo, Netherlands.  The choice of exchange was one not strictly of convenience but of prudence; the Deribit exchange is widely regarded as being less subject to the market manipulation that has plagued other exchanges.  The limit order book consists of full order book depth snapshots with incremental updates, tick-by-tick-trades, open interest, funding, liquidations, and quotes.   The perpetual futures contract is a derivative product that is similar to a traditional futures contract but has a few differing specifications to minimize basis including index-linkage to mimic a margin-based spot market and no expiry.  The Deribit BTC Index, in particular, is measured as the exponential weighted average price of the following six (6) constituents: Bitstamp, Gemini, Bitfinex, Itbit, Coinbase, and Kraken.  Limit orders receive a 0.025% rebate while market orders are charged 0.05%.   The following chart illustrates the LOB data best bid from April 1 – April 15, 2019.

The raw form data is, however, inconsistent in time; there are random intervals between observations.  In order to structure the data that it might be suitable for analysis, a timestamp engineering mechanism was employed whereby data was called in 20-millisecond steps where available and where not was imputed as the last known value.  Subsequently, yield a generator to feed the neural network, aggregate the data over one-minute intervals averaging, summing, or otherwise imputing the data as necessary.

The fifty (50) technical indicators are broadly classifiable by type and include measures of (i) order and return, (ii) trend, (iii) momentum, and (iv) volatility.  Indicators in the main were sourced from the Technical Analysis Library in Python.  Certain indicators were constructed, however, to leverage the data structure of the limit order book.  Examples of such indicators include a number of cancel orders, volume oscillator(s), and instant volatility.  Source literature is available by request to the authors.   [Detail to be added re lookback, etc.]

Feature Engineering

Timestamp Engineering

The raw form data is inconsistent in time and comes in random time random intervals ranging from 0 to 500 milliseconds between observations. To create consistent data, we try two different approaches.

Approach 1: 20 Milliseconds Systematically Time Sampling

The first approach we did was sample data every 20 milliseconds and the data points closest to the 20 milliseconds get recorded in. If not a single data point is found within this time-frame, we copy the last known row of data and repeat it. For our final step, we measure the price difference between the two time-frames.  To recap, we use the previous data from the previous timestamp as our X, and we are predicting the price difference from 20 ms later or the current timestamp and that's our Y. The data came in the form of a generator that can be directly fed into Keras's Neural Network framework.

However, we soon realize this approach did not perform well in the three different models, because the time difference was too small for the Sktlearn and Keras Library in python to process effectively and most of the time there was any noticeable change in price. Thus making the model consistently predicting 0 price change as a result and making our predictions bias.

Approach 2: 1 Minute Time Aggregation

The next iteration of our project, we decided to aggregate the data over one minute and depending on the type of columns, we would either average or sum the data as the signal calculation demands. However, this time, since our data-set size has been reduced from a few million rows for one month of data to around 22000 data points, we decided to save the resulting data-set into a CVS file for convenience and general testing purposes. Also, we change our Y from regression to a classification problem; instead of trying to predict the quantity of change, we are just trying to predict the direction of bitcoin.

Signal Engineering

Here are some of the signal we use for our program:

Volume

  • Accumulation/Distribution Index (ADI)
  • On-Balance Volume (OBV)
  • Chaikin Money Flow (CMF)
  • Force Index (FI)
  • Ease of Movement (EoM, EMV)
  • Volume-price Trend (VPT)
  • Negative Volume Index (NVI)

Volatility

  • Instant Volatility (IV)
  • Average True Range (ATR)
  • Bollinger Bands (BB)
  • Keltner Channel (KC)
  • Donchian Channel (DC)

Trend

  • Moving Average Convergence Divergence (MACD)
  • Average Directional Movement Index (ADX)
  • Vortex Indicator (VI)
  • Trix (TRIX)
  • Mass Index (MI)
  • Commodity Channel Index (CCI)
  • Detrended Price Oscillator (DPO)
  • KST Oscillator (KST)
  • Ichimoku Kinkō Hyō (Ichimoku)

Momentum

  • Money Flow Index (MFI)
  • Relative Strength Index (RSI)
  • True strength index (TSI)
  • Ultimate Oscillator (UO)
  • Stochastic Oscillator (SR)
  • Williams %R (WR)
  • Awesome Oscillator (AO)
  • Kaufman's Adaptive Moving Average (KAMA)

Others

  • Cancel Signal (CS)
  • 15 Interval Return (15IR)
  • Daily Return (DR)
  • Daily Log Return (DLR)
  • Cumulative Return (CR)

Machine Learning Modeling

Step 1: Testing Individual Models

Base Model: Logistic Regression Classification

For classification problems, the Logistic Regression model is a good base model to compare performance. However, it has some limitations:

  1. It requires the observations to be independent of each other. In a continuous time-series, this is almost never the case.
  2. Demands little to no multi-collinearity among the independent variables. Since most of our signals, for example, common moving average or momentum signal deviated from the same variables as bidprice, askprice, or volumn. These variables are no completely independent.
  3. The independent variables need to be linearly related to the log odds.

Since our dataset breaks many of these rules from Logistic Regression, we do not expect this model to perform well. After performing hyperparameters tuning using Bayesian optimization, here is the confusion matrix of the Logistic Regression and the model accuracy.

Logistic Regression Score of different time period: 0.453 - 0.6496

Here is one of the outputs for the Logistics Regression Model.

Y_Pred:

  • Sell: 12348
  • Nothing: 5,
  • Buy: 9925

Y_Actual:

  • Sell: 15551
  • Nothing: 773
  • Buy: 5954

As you can see from the result, the LR model was very inconsistent and most of the time, it is just barely better than a coin flip. So we decided to go to another model that can handle the non-linear of the model and all of its complexity signals.

Advance Base Model: Support Vector Machine Classification

The next algorithm we try was SVM, a non-linear, non-parametric classification technique. SVM employs the kernel tricks and maximal margin concepts, so in theory, it performs much better in non-linear and high dimensional tasks. However, there is "no free lunch" for using SVM.

Benefits of SVM:

  • Flexibility: Capable of fitting a large number of functional forms.
  • Power: No assumptions or weak assumptions about the underlying function.
  • Performance for complex model: Can result in higher performance models for prediction.

Limitations:

  • Time Consuming: Find the right kernel function is not easy
  • Slower: Kernelized SVMs require the computation of a distance function between each point in the dataset, which is the dominating cost of O(n features×n^2 observations). Resulting in long training time for large datasets.
  • Overfitting: More of a risk to over-fit the training data and it is harder to explain why specific predictions are made.

After performing hyperparameters tuning using Bayesian optimization, here is the confusion matrix of the SVM classifier and its model accuracy.

SVM Score of different time period: 0.513 - 0.639

Here is one of the outputs for the SVM Model.

Y_Pred:

  • Sell: 19904
  • Nothing: 2345,
  • Buy: 29

Y_Actual:

  • Sell: 15551
  • Nothing: 773
  • Buy: 5954

The SVM does much better consistently than the LR model, which is not a surprise, however, due to the dataset size increase, our training time increases exponentially, making testing of the different models very unrealistic. Every model would take around 4 to 8 hours to train depending on the kernel type. Due to long training timing, we decided to use a neural network to further increase the performance and handle a large amount of data.

Super Advance Model: Long-Short Term Memory Neural Network

While the models thus far have accomplished some level of accuracy, they’re crucially inadequate for time-series data. We needed a model that could accommodate for time, as well as the nonlinearity and size of our data. Given these constraints, we decided to use a Long Short Term Memory (LSTM) neural network. It is a version of a Recurrent Neural Network (RNN), which feeds previous states into the current one, thus accommodating the timestep nature of our data. One of the primary concerns with an RNN is that its weight matrices change too quickly from state to state, thus losing some underlying meaning beyond a certain period. An LSTM accommodates this deficiency by implementing a ‘permanent memory cell’ – the weight matrix of which is independent from each layer matrix.

We then added a dropout layer using the standard value of 0.2 to discourage overfitting on our training data. A dropout layer omits the desired portion of neurons, at random, during backpropagation. This type of augmentation is called regularization, or the bias-variance tradeoff. We also used l1 regularization on our LSTM layer, which drives the weight matrices towards 0, resulting in an omission of certain variables. Given that markets change their regimes at unknown intervals, we wanted to make sure that our model could handle a test data set with entirely different conditions.

Our final layer was a Dense layer with a softmax activation function. Softmax takes a vector and converts it into a probability distribution, which allows us to interpret the probability of the next direction. We then compiled the model with categorical cross-entropy, which measures and minimizes the log loss between prediction and real output.

We then began training the model. After some tuning, we realized a critical component of our data. Accurately predicting direction (up or down) was more important than predicting that price would stay the same between two periods. Thus, we ended up weighting our classes of [sell, hold, buy] as [1, 0.2, 1], meaning, we drastically reduced the importance of accurately predicting hold. An improvement upon this weighting is to scale it with the volatility of the asset class that is being traded.

When it came to model tuning, we noticed that validation loss increased with time, while validation accuracy dropped. This is symptomatic of having an insufficiently large data set, which is one of the next steps in our project. Our model converged on the training and validation datasets rather quickly (sub-50 epochs), and this concluded with us obtaining 74% average accuracy across a number of runs. We want to note that although this accuracy is high, there are still many steps to take before this black box can be effectively used during execution.

Future Work

  • Build-in additional features and adjust model parameters to capture the impact of exogenous factors and develop model intuition.
  • Evaluate liquidity characteristics by exchange including the impact of funding vis-à-vis a thorough analysis of exchange funding algorithms.
  • Develop a multi-exchange framework to capture potential arbitrage opportunities.
  • Evaluate the nuances of optimizing loss versus accuracy.
  • Build additional models including negative affirmation trading, price magnitude sensitivity, and intra-market cross-currency correlation models.

Project GitHub Repository || Jayce Jiang LinkedIn Profile

About Author

Jayce Jiang

Jayce Jiang is previously an NYC Data Science Fellow and Data Engineer with a Dual Bachelors Degree in Aerospace and Mechanical Engineering from the University of Florida. He currently a founder of Strictly By The Numbers, www.strictlybythenumbers.com, and...
View all posts by Jayce Jiang >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    © 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application