Bitcoin LSTM Directional Prediction: 75 Accuracy Rate
Introduction
The purpose of this capstone project was to evaluate the efficacy of a neural net-based approach to high-frequency trading in the cryptocurrency market. The test case considers the Deribit perpetual futures contract limit order book. Fifty signals were taken as predictors to a binary outcome. The final model is a Long Short Term Memory (“LSTM”) that achieved an accuracy rate in excess of 75.
Background
Trading models must be dynamic and flexible to be successful. Models may rely upon a multitude of variables both fundamental and technical. In either case, it is well established that an information lag is apparent to either typeof predictor – one need only consider the frequent revisions to past-performance indicators to substantiate this claim. It is therefore also obvious that any trading model worth its salt relies upon time-series data and so is rife the spectre of autoregression. Now, the choice of predictors whether fundamental or technical comes to the play. We assert that while the technical signal will exhibit greater noise than their fundamental counterpart, the underlying predicate to such noise in the form of market activity is of more predictive import. This greater fluidity and exogenous/own time-constraints lead us to base our model on technical indicators alone.
Limit order book data is taken from [April to May 2019] from the Deribit exchange with principal offices in Ermelo, Netherlands. The choice of exchange was one not strictly of convenience but of prudence; the Deribit exchange is widely regarded as being less subject to the market manipulation that has plagued other exchanges. The limit order book consists of full order book depth snapshots with incremental updates, tick-by-tick-trades, open interest, funding, liquidations, and quotes. The perpetual futures contract is a derivative product that is similar to a traditional futures contract but has a few differing specifications to minimize basis including index-linkage to mimic a margin-based spot market and no expiry. The Deribit BTC Index, in particular, is measured as the exponential weighted average price of the following six (6) constituents: Bitstamp, Gemini, Bitfinex, Itbit, Coinbase, and Kraken. Limit orders receive a 0.025% rebate while market orders are charged 0.05%. The following chart illustrates the LOB data best bid from April 1 – April 15, 2019.
The raw form data is, however, inconsistent in time; there are random intervals between observations. In order to structure the data that it might be suitable for analysis, a timestamp engineering mechanism was employed whereby data was called in 20-millisecond steps where available and where not was imputed as the last known value. Subsequently, yield a generator to feed the neural network, aggregate the data over one-minute intervals averaging, summing, or otherwise imputing the data as necessary.
The fifty (50) technical indicators are broadly classifiable by type and include measures of (i) order and return, (ii) trend, (iii) momentum, and (iv) volatility. Indicators in the main were sourced from the Technical Analysis Library in Python. Certain indicators were constructed, however, to leverage the data structure of the limit order book. Examples of such indicators include a number of cancel orders, volume oscillator(s), and instant volatility. Source literature is available by request to the authors. [Detail to be added re lookback, etc.]
Feature Engineering
Timestamp Engineering
The raw form data is inconsistent in time and comes in random time random intervals ranging from 0 to 500 milliseconds between observations. To create consistent data, we try two different approaches.
Approach 1: 20 Milliseconds Systematically Time Sampling
The first approach we did was sample data every 20 milliseconds and the data points closest to the 20 milliseconds get recorded in. If not a single data point is found within this time-frame, we copy the last known row of data and repeat it. For our final step, we measure the price difference between the two time-frames. To recap, we use the previous data from the previous timestamp as our X, and we are predicting the price difference from 20 ms later or the current timestamp and that's our Y. The data came in the form of a generator that can be directly fed into Keras's Neural Network framework.
However, we soon realize this approach did not perform well in the three different models, because the time difference was too small for the Sktlearn and Keras Library in python to process effectively and most of the time there was any noticeable change in price. Thus making the model consistently predicting 0 price change as a result and making our predictions bias.
Approach 2: 1 Minute Time Aggregation
The next iteration of our project, we decided to aggregate the data over one minute and depending on the type of columns, we would either average or sum the data as the signal calculation demands. However, this time, since our data-set size has been reduced from a few million rows for one month of data to around 22000 data points, we decided to save the resulting data-set into a CVS file for convenience and general testing purposes. Also, we change our Y from regression to a classification problem; instead of trying to predict the quantity of change, we are just trying to predict the direction of bitcoin.
Signal Engineering
Here are some of the signal we use for our program:
Volume
- Accumulation/Distribution Index (ADI)
- On-Balance Volume (OBV)
- Chaikin Money Flow (CMF)
- Force Index (FI)
- Ease of Movement (EoM, EMV)
- Volume-price Trend (VPT)
- Negative Volume Index (NVI)
Volatility
- Instant Volatility (IV)
- Average True Range (ATR)
- Bollinger Bands (BB)
- Keltner Channel (KC)
- Donchian Channel (DC)
Trend
- Moving Average Convergence Divergence (MACD)
- Average Directional Movement Index (ADX)
- Vortex Indicator (VI)
- Trix (TRIX)
- Mass Index (MI)
- Commodity Channel Index (CCI)
- Detrended Price Oscillator (DPO)
- KST Oscillator (KST)
- Ichimoku Kinkō Hyō (Ichimoku)
Momentum
- Money Flow Index (MFI)
- Relative Strength Index (RSI)
- True strength index (TSI)
- Ultimate Oscillator (UO)
- Stochastic Oscillator (SR)
- Williams %R (WR)
- Awesome Oscillator (AO)
- Kaufman's Adaptive Moving Average (KAMA)
Others
- Cancel Signal (CS)
- 15 Interval Return (15IR)
- Daily Return (DR)
- Daily Log Return (DLR)
- Cumulative Return (CR)
Machine Learning Modeling
Step 1: Testing Individual Models
Base Model: Logistic Regression Classification
For classification problems, the Logistic Regression model is a good base model to compare performance. However, it has some limitations:
- It requires the observations to be independent of each other. In a continuous time-series, this is almost never the case.
- Demands little to no multi-collinearity among the independent variables. Since most of our signals, for example, common moving average or momentum signal deviated from the same variables as bidprice, askprice, or volumn. These variables are no completely independent.
- The independent variables need to be linearly related to the log odds.
Since our dataset breaks many of these rules from Logistic Regression, we do not expect this model to perform well. After performing hyperparameters tuning using Bayesian optimization, here is the confusion matrix of the Logistic Regression and the model accuracy.
Logistic Regression Score of different time period: 0.453 - 0.6496
Here is one of the outputs for the Logistics Regression Model.
Y_Pred:
- Sell: 12348
- Nothing: 5,
- Buy: 9925
Y_Actual:
- Sell: 15551
- Nothing: 773
- Buy: 5954
As you can see from the result, the LR model was very inconsistent and most of the time, it is just barely better than a coin flip. So we decided to go to another model that can handle the non-linear of the model and all of its complexity signals.
Advance Base Model: Support Vector Machine Classification
The next algorithm we try was SVM, a non-linear, non-parametric classification technique. SVM employs the kernel tricks and maximal margin concepts, so in theory, it performs much better in non-linear and high dimensional tasks. However, there is "no free lunch" for using SVM.
Benefits of SVM:
- Flexibility: Capable of fitting a large number of functional forms.
- Power: No assumptions or weak assumptions about the underlying function.
- Performance for complex model: Can result in higher performance models for prediction.
Limitations:
- Time Consuming: Find the right kernel function is not easy
- Slower: Kernelized SVMs require the computation of a distance function between each point in the dataset, which is the dominating cost of O(n features×n^2 observations). Resulting in long training time for large datasets.
- Overfitting: More of a risk to over-fit the training data and it is harder to explain why specific predictions are made.
After performing hyperparameters tuning using Bayesian optimization, here is the confusion matrix of the SVM classifier and its model accuracy.
SVM Score of different time period: 0.513 - 0.639
Here is one of the outputs for the SVM Model.
Y_Pred:
- Sell: 19904
- Nothing: 2345,
- Buy: 29
Y_Actual:
- Sell: 15551
- Nothing: 773
- Buy: 5954
The SVM does much better consistently than the LR model, which is not a surprise, however, due to the dataset size increase, our training time increases exponentially, making testing of the different models very unrealistic. Every model would take around 4 to 8 hours to train depending on the kernel type. Due to long training timing, we decided to use a neural network to further increase the performance and handle a large amount of data.
Super Advance Model: Long-Short Term Memory Neural Network
While the models thus far have accomplished some level of accuracy, they’re crucially inadequate for time-series data. We needed a model that could accommodate for time, as well as the nonlinearity and size of our data. Given these constraints, we decided to use a Long Short Term Memory (LSTM) neural network. It is a version of a Recurrent Neural Network (RNN), which feeds previous states into the current one, thus accommodating the timestep nature of our data. One of the primary concerns with an RNN is that its weight matrices change too quickly from state to state, thus losing some underlying meaning beyond a certain period. An LSTM accommodates this deficiency by implementing a ‘permanent memory cell’ – the weight matrix of which is independent from each layer matrix.
We then added a dropout layer using the standard value of 0.2 to discourage overfitting on our training data. A dropout layer omits the desired portion of neurons, at random, during backpropagation. This type of augmentation is called regularization, or the bias-variance tradeoff. We also used l1 regularization on our LSTM layer, which drives the weight matrices towards 0, resulting in an omission of certain variables. Given that markets change their regimes at unknown intervals, we wanted to make sure that our model could handle a test data set with entirely different conditions.
Our final layer was a Dense layer with a softmax activation function. Softmax takes a vector and converts it into a probability distribution, which allows us to interpret the probability of the next direction. We then compiled the model with categorical cross-entropy, which measures and minimizes the log loss between prediction and real output.
We then began training the model. After some tuning, we realized a critical component of our data. Accurately predicting direction (up or down) was more important than predicting that price would stay the same between two periods. Thus, we ended up weighting our classes of [sell, hold, buy] as [1, 0.2, 1], meaning, we drastically reduced the importance of accurately predicting hold. An improvement upon this weighting is to scale it with the volatility of the asset class that is being traded.
When it came to model tuning, we noticed that validation loss increased with time, while validation accuracy dropped. This is symptomatic of having an insufficiently large data set, which is one of the next steps in our project. Our model converged on the training and validation datasets rather quickly (sub-50 epochs), and this concluded with us obtaining 74% average accuracy across a number of runs. We want to note that although this accuracy is high, there are still many steps to take before this black box can be effectively used during execution.
Future Work
- Build-in additional features and adjust model parameters to capture the impact of exogenous factors and develop model intuition.
- Evaluate liquidity characteristics by exchange including the impact of funding vis-à-vis a thorough analysis of exchange funding algorithms.
- Develop a multi-exchange framework to capture potential arbitrage opportunities.
- Evaluate the nuances of optimizing loss versus accuracy.
- Build additional models including negative affirmation trading, price magnitude sensitivity, and intra-market cross-currency correlation models.