Scraping Stocktwits for Sentiment Analysis

Kyle Szela
Posted on Jun 5, 2016

Contributed by Kyle Szela. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his third class project - webscraping (due on the 6th week of the program).


For those who don't know, Stocktwits is a platform similar to Twitter, except for stock traders.  They have similar restrictions on messages, although one key difference is the ability of traders to tag their Twits with a "Bearish" or "Bullish" tag in order to convey their opinion that the stock is going to fall or rise soon, respectively.  I set out to take these Twits an analyze them against various other indicators from the market.  I wanted to see if there was any pattern of similarity between Twit sentiment analysis and Bearish/Bullish tagging and the movement of implied volatility of options and the stock value itself.  I was also able to procure news sentiment analysis data from quandl.  The particular stock that I chose for this analysis is AAPL Apple, Inc.).

Scraping & Saving:

In order to get the Twit data, I needed to scrape the website.  They have two versions of their API, one that gives you the most basic data regarding the last 30 StockTwits, which excludes the Bearish and Bullish tagging, and another version that includes all of the above, but is only available to developers.  Since I was not able to acquire developer status for StockTwits, scraping was the only option.  What I ended up doing was writing a small python script to scrape the most recent 15 Twits regarding AAPL.  Additionally, this script used sentiment analysis through Textblob in order to return a value between -1 and 1 for the positivity or negativity of the Twit.  This script gets ran 4 times every 10 minutes, so that it can adequately acquire as many of the Twits as possible.  Each time it comes in contact with a Twit, it runs the above analysis and then saves the Twit object to a Parse cloud database.  Before saving, though, the TwitId is checked against all other Twits in the database (which are constantly being erased if they are older than 24 hours by a Parse cloud code script) in order to make sure that it doesn't save repeat Twits.

Also being scraped and procured from API's is AAPL's stock data Yahoo Finance scraping).  The News sentiment analysis is gotten through the quandl API as well as the Implied Volatility data.

Once saved to the cloud database, there are also two additional objects that need to be updated.  The first of which is a simple Tally object that I created in order to collect the Twits from the last hour.  So, every time a new Twit is added, it's polarity, and Bearish or Bullish tagging gets added to the current tallies in the Tally object.  Then, at the end of every hour, a new Tally object is created and the previous Tally object is taken and it's data is added to the DailyAverage object.  The DailyAverage object does much the same as the Tally object, just over the period of a day.  New DailyAverage objects are created, you guessed it, daily, but are created in a way such that a trading day is defined as the beginning of trading on a given day (Open) to the beginning of trading on the next day.  So, a DailyAverage object will have some Twits from before trading began on a given day.

Every day, yet another Daily object is created that aggregates the last 230 days of trading and matches up the news sentiment data, implied volatility data, and stock data based on date.  Days where there was no trading are rolled into the previous day.  This python script is also run on a heroku server.

Lastly, every hour, the last 700 Twits in the database are taken and analyzed for word frequency.  The four different groups for this analysis are the Bearish and Bullish Twits, and the positive and negative Twits.  Each time this is run, a new object is created in the Parse database that holds the frequency information for the top 50 words in each group.  This python script is run using a heroku server.


In order to graphically show the results, I made a Shiny App which spoke to the Parse cloud database through http requests and gets the word frequency object as well as the Daily object.  The first tab, shown below, plots the news sentiment data against the implied volatility data and the daily stock closes.  Unfortunately, there aren't many discernible trends throughout all three types of data.  In the future, I would've liked to obtain more of the Twit data for sentiment and Bearish/Bullish tagging.

Screen Shot 2016-06-05 at 6.43.28 PM

I also displayed the data that I was able to collect from scraping the Twits:

Screen Shot 2016-06-05 at 6.43.37 PM

And observing the hourly variation of different Twit metrics:

Screen Shot 2016-06-05 at 6.43.50 PM

And lastly, the different word clouds from the four mentioned groups.  For a given day, there aren't usually many Bearish Twits, and since the Twits themselves are restricted to a few words, the corresponding word cloud is somewhat sparse:

Screen Shot 2016-06-05 at 6.43.59 PM

Screen Shot 2016-06-05 at 6.44.09 PM

In conclusion, I'd really have liked to be able to obtain more Twit data.  At the time of finishing the project, I was only able to obtain about a weeks worth of Twit data and I don't believe that was sufficient to establish any observable trends.

About Author

Kyle Szela

Kyle Szela

A recent graduate from Northwestern University with a B.S. in Computer Science, Kyle has a strong background in computer engineering and programming concepts. His previous work and academic studies contains a panoply of topics including machine learning, artificial...
View all posts by Kyle Szela >

Related Articles

Leave a Comment

Google June 27, 2020
Google The time to study or take a look at the content or sites we've linked to below.
Google March 26, 2020
Google Although websites we backlink to beneath are considerably not associated to ours, we feel they may be really worth a go by, so possess a look.
nishi March 5, 2019
Hi, How did you scrape the stocktwits website for historical data of ticker tweets? Thank you
مهرجانات June 12, 2017
Hi there,I log on to your new stuff named "Scraping Stocktwits for Sentiment Analysis - NYC Data Science Academy BlogNYC Data Science Academy Blog" on a regular basis.Your writing style is awesome, keep up the good work! And you can look our website about مهرجانات.
Peruvian deep wave November 20, 2016
Expédition ultra-rapide Peruvian deep wave entendu conforme = totale satisfaction
Corinne August 17, 2016
You made some decent points there. I looked on the web for the problem and found most individuals will go along with with your website.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp