Headline Sentiment Distribution and Web Scraping
Project GitHub | LinkedIn: Niki Moritz Hao-Wei Matthew Oren
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
My web scraping project explored the distribution of headline sentiment by news source. To do this, I scraped the Nasdaq latest market headlines page and applied sentiment analysis to the retrieved text. It should be noted that I only scraped one web page, but this page aggregates headlines from multiple sources. I wanted to see whether the language used by different news sources varies significantly as this information could be relevant to baseline sentiment scores in a trading strategy.
I used the Scrapy web crawling framework in Python to scrape the headlines. Because the page I scraped only contains recent business headlines, I knew I needed to scrape repeatedly for several days in order to obtain sufficient data. To expedite this process, I created a Python wrapper script to do the actual scraping and time stamp the resulting CSV file. After a week, I read all the data into a single Pandas DataFrame and conduced sentiment analysis with the TextBlob library. The subsequent data visualization and analysis was done with ggplot in R.
The sentiment analysis algorithm in TextBlob returns two metrics for each headline: polarity and subjectivity. Polarity quantifies the emotion, positive or negative in a sentence. A polarity score of 1 signifies a very positive headline where as -1 corresponds to a very negative headline. Subjectivity quantifies the degree to which a headline expresses a personal feeling or opinion. For example, “I don’t like spicy food” is very subjective whereas “The cat is brown” is not.
The charts below show distributions for these metrics for six of the most common new sources in my scraped data. Clearly there are differences in the distributions. The sources “MT Newswires” and “InvestorPlace Media” are perhaps at the opposite ends of the spectrum. “MT Newswires” has almost entirely neutral headlines and no subjectivity. On the other hand, “InvestorPlace Media” frequently publishes headlines with very positive polarity scores and high subjectivity.
I was not familiar with either “InvestorPlace Media” nor “MT Newswires” prior to this project. However, my first impressions of these news sources based on their respective websites aligns with the distributions that I found. “InvestorPlace Media” has a more sensationalist and “click bait-y” feel than “MT Newswires” which gives a somewhat academic impression. Additional data and domain knowledge would be required to truly draw a conclusion, but I was encouraged by this initial result.
As touched upon above, this project also lays the foundations for more ambitious analysis. For example, Zhang & Skiena describe and demonstrate the success of a sentiment analysis-based trading strategy in this paper. To be clear, Zhang & Skiena work with years of data and draw upon a far greater variety of sources (for example twitter and blogs) in their paper than I have done. That said, their work suggests potential business uses for the type of scraping done in this project.
The project code can be found here on Github.