Distribution of Headline Sentiment

Graeme Keleher
Posted on Jul 15, 2019

My web scraping project explored the distribution of headline sentiment by news source. To do this, I scraped the Nasdaq latest market headlines page and applied sentiment analysis to the retrieved text. It should be noted that I only scraped one web page, but this page aggregates headlines from multiple sources. I wanted to see whether the language used by different news sources varies significantly as this information could be relevant to baseline sentiment scores in a trading strategy. 

I used the Scrapy web crawling framework in Python to scrape the headlines. Because the page I scraped only contains recent business headlines, I knew I needed to scrape repeatedly for several days in order to obtain sufficient data. To expedite this process, I created a Python wrapper script to do the actual scraping and time stamp the resulting CSV file.  After a week, I read all the data into a single Pandas DataFrame and conduced sentiment analysis with the TextBlob library. The subsequent data visualization and analysis was done with ggplot in R.

The sentiment analysis algorithm in TextBlob returns two metrics for each headline: polarity and subjectivity. Polarity quantifies the emotion, positive or negative in a sentence. A polarity score of 1 signifies a very positive headline where as -1 corresponds to a very negative headline. Subjectivity quantifies the degree to which a headline expresses a personal feeling or opinion. For example, “I don’t like spicy food” is very subjective whereas “The cat is brown” is not.

The charts below show distributions for these metrics for six of the most common new sources in my scraped data. Clearly there are differences in the distributions. The sources “MT Newswires” and “InvestorPlace Media” are perhaps at the opposite ends of the spectrum. “MT Newswires” has almost entirely neutral headlines and no subjectivity. On the other hand, “InvestorPlace Media” frequently publishes headlines with very positive polarity scores and high subjectivity.

I was not familiar with either “InvestorPlace Media” nor “MT Newswires” prior to this project. However, my first impressions of these news sources based on their respective websites aligns with the distributions that I found. “InvestorPlace Media” has a more sensationalist and “click bait-y” feel than “MT Newswires” which gives a somewhat academic impression. Additional data and domain knowledge would be required to truly draw a conclusion, but I was encouraged by this initial result.

As touched upon above, this project also lays the foundations for more ambitious analysis. For example, Zhang & Skiena describe and demonstrate the success of a sentiment analysis-based trading strategy in this paper. To be clear, Zhang & Skiena work with years of data and draw upon a far greater variety of sources (for example twitter and blogs) in their paper than I have done. That said, their work suggests potential business uses for the type of scraping done in this project.

The project code can be found here on Github.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp