Scraping Stocktwits for Sentiment Analysis
Contributed by Kyle Szela. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his third class project - webscraping (due on the 6th week of the program).
Introduction:
For those who don't know, Stocktwits is a platform similar to Twitter, except for stock traders. They have similar restrictions on messages, although one key difference is the ability of traders to tag their Twits with a "Bearish" or "Bullish" tag in order to convey their opinion that the stock is going to fall or rise soon, respectively. I set out to take these Twits an analyze them against various other indicators from the market. I wanted to see if there was any pattern of similarity between Twit sentiment analysis and Bearish/Bullish tagging and the movement of implied volatility of options and the stock value itself. I was also able to procure news sentiment analysis data from quandl. The particular stock that I chose for this analysis is AAPL Apple, Inc.).
Scraping & Saving:
In order to get the Twit data, I needed to scrape the website. They have two versions of their API, one that gives you the most basic data regarding the last 30 StockTwits, which excludes the Bearish and Bullish tagging, and another version that includes all of the above, but is only available to developers. Since I was not able to acquire developer status for StockTwits, scraping was the only option. What I ended up doing was writing a small python script to scrape the most recent 15 Twits regarding AAPL. Additionally, this script used sentiment analysis through Textblob in order to return a value between -1 and 1 for the positivity or negativity of the Twit. This script gets ran 4 times every 10 minutes, so that it can adequately acquire as many of the Twits as possible. Each time it comes in contact with a Twit, it runs the above analysis and then saves the Twit object to a Parse cloud database. Before saving, though, the TwitId is checked against all other Twits in the database (which are constantly being erased if they are older than 24 hours by a Parse cloud code script) in order to make sure that it doesn't save repeat Twits.
Also being scraped and procured from API's is AAPL's stock data Yahoo Finance scraping). The News sentiment analysis is gotten through the quandl API as well as the Implied Volatility data.
Once saved to the cloud database, there are also two additional objects that need to be updated. The first of which is a simple Tally object that I created in order to collect the Twits from the last hour. So, every time a new Twit is added, it's polarity, and Bearish or Bullish tagging gets added to the current tallies in the Tally object. Then, at the end of every hour, a new Tally object is created and the previous Tally object is taken and it's data is added to the DailyAverage object. The DailyAverage object does much the same as the Tally object, just over the period of a day. New DailyAverage objects are created, you guessed it, daily, but are created in a way such that a trading day is defined as the beginning of trading on a given day (Open) to the beginning of trading on the next day. So, a DailyAverage object will have some Twits from before trading began on a given day.
Every day, yet another Daily object is created that aggregates the last 230 days of trading and matches up the news sentiment data, implied volatility data, and stock data based on date. Days where there was no trading are rolled into the previous day. This python script is also run on a heroku server.
Lastly, every hour, the last 700 Twits in the database are taken and analyzed for word frequency. The four different groups for this analysis are the Bearish and Bullish Twits, and the positive and negative Twits. Each time this is run, a new object is created in the Parse database that holds the frequency information for the top 50 words in each group. This python script is run using a heroku server.
Results:
In order to graphically show the results, I made a Shiny App which spoke to the Parse cloud database through http requests and gets the word frequency object as well as the Daily object. The first tab, shown below, plots the news sentiment data against the implied volatility data and the daily stock closes. Unfortunately, there aren't many discernible trends throughout all three types of data. In the future, I would've liked to obtain more of the Twit data for sentiment and Bearish/Bullish tagging.
I also displayed the data that I was able to collect from scraping the Twits:
And observing the hourly variation of different Twit metrics:
And lastly, the different word clouds from the four mentioned groups. For a given day, there aren't usually many Bearish Twits, and since the Twits themselves are restricted to a few words, the corresponding word cloud is somewhat sparse:
Summary
In conclusion, I'd really have liked to be able to obtain more Twit data. At the time of finishing the project, I was only able to obtain about a weeks worth of Twit data and I don't believe that was sufficient to establish any observable trends.