Event-Driven Stock Prediction

and
Posted on Apr 5, 2017

EDSP-Cover

Event-Driven Stock-Prediction

Contributed by:

Scott Edenbaum, and Xu Gao

 

proj topic

We conducted research into machine learning techniques for financial modeling. We found the following deep learning techniques in are widely used in finance: Shallow Factor Models, Default Probabilities, and Event Studies. As a result, we decided to follow the nontraditional route of analyzing news content to predict the price movement of a stock. in regards to news content, one preliminary finding is that many news titles do not contain impactful information about a company, rather it is used attract readers to increase page-views and ad revenue.

 

prelim-research

Our project is based on "Deep Learning for Event-Driven Stock Prediction" from Xiao Ding, Yue Zhang, Ting Liu, Junwen Duan. In their research, they use a neural tensor network to transform word embeddings of news headlines into event embeddings, and a convolutional neural network to predict the price trend for one  day, week, or month.

data hurdles

Finding consistent news data was surprisingly difficult since we were looking to gather, mostly due to our need for ~5-10 years of historical data to train our model. In addition, there were many anti-scraping techniques in place to trick web scraping programs, such as CAPTCHA requests, and inconsistent XPath structure. 
data sources

For the news data, we were able to successfully scrape news flow from ~2008 from SeekingAlpha by using a combination of Selenium to 'scroll down' and load new web content, and BeautifulSoup to grab the content from the webpage. In order to avoid generating "split-adjusted" pricing data, and keeping track of ticker changes, we gathered our pricing data from a Bloomberg terminal. For the sake of simplicity, we chose to make the naive assumption that a stock's open price is equal to the previous day's close price, so we only needed to keep track of the close price for our model.
data pre-process

The Java tool, ReVerb was an essential ingredient to our model's recipe. ReVerb identifies binary word relationships and replaces them with a single form. For example the words, "is,"  "was," and "be," are all transformed into "be" with ReVerb, allowing for a much more robust and less fragmented analysis of the news content.

reverb output

The output from ReVerb includes a probability, such as 0.8225 in the above example. This probability represents the relevance of the news content to the given stock, not the impact the news will have on the stock.

data transformation

We used a variety of models for transforming our text into an input for the neural network. Although Doc2Vec performed rather well, there is no clear 'winner' across the board for all the stocks in our analysis.

model devel

Our basic model consists of a neural network with 2 hidden layers with a binary output layer. After tuning the parameters, we found the best performance was with two hidden layers, 300 and 50 nodes respectively, both with hyperbolic tangent activation functions.

dataset devel

Usually for event-driven research, only a portion of the news events have an impact on a given stock's return, so we use that stocks market data as a basic dataset. By adding new factors we analyze from news content, we can get a new dataset and build a prediction model.

  • Dataset 1: Market Data
    • Stock: 5-day lag time series, S&P 500, NASDAQ Composite, and NYSE Volume
  • Dataset 2: Dataset 1 + Sentiment Polarity (ie: positive, neutral, negative)
    • Sentiment polarity is generated with the TextBlob package in Python
  • Dataset 3: Dataset 1 + Word Embeddings (ReVerb)
    • For news with multiple sub-headlines, we chose the sentence with the highest confidence level
      • Subject, Object, and Verb "tuple" form word embeddings
  • Dataset 4: Dataset 1 + Word Embeddings (all news content)
    • Entire news content is converted to single vector using Doc2Vec in Gensim package
      • all information is preserved unlike Dataset 3

Results 1

Our results were fairly consistent across the 4 models (note Doc2Vec ALL includes news content from all stocks), and not too surprising, the inclusion of 'excessive' and unrelated information negatively impacts model performance.

 

Results:

  • Aggressive Strategy:
    • Predict up -> long 1 unit
    • Predict down -> short 1 unit
  • Protective Strategy:
    • Predict up -> long 1 unit
    • Predict down -> close position (sell long)

results aa

Our prediction matched very closely with the actual stair movement for AA.aa trading
BAC trade

BAC res

AAPL trading

AAPL res
Our model prediction for AAPL was weaker than some of the other stocks, perhaps that is due to the much higher frequency of news content that is generated for AAPL.

Conclusions

The main conclusions from our project are that even with a far from ideal news source, there is an enormous amount of content available that can assist in accurately modeling stock price movement without any fundamental or technical analysis. I believe that with a professional news source (such as Reuters or Bloomberg) coupled with use of corporate actions calendars (dividends, splits, earnings releases, etc) would lead to significant improvements on our current model.

 

Complete PDF Presentation Slides: EDSP-presentation

GitHub code

About Authors

Scott Edenbaum

Scott Edenbaum is a recent graduate from the NYC Data Science Academy. He was hired by the Academy to assist in buildout of the learning management system and seeks to pursue a career as a Data Scientist. Scott's...
View all posts by Scott Edenbaum >

Xu

Xu is a Master of Financial Engineering student in New York University. He received Bachelor of Economics in University of International Business and Economics. Xu has a good experience about machine learning and pair trading system. Besides, he...
View all posts by Xu >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI