Headline Sentiment Distribution and Web Scraping

Posted on Jul 15, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

My web scraping project explored the distribution of headline sentiment by news source. To do this, I scraped the Nasdaq latest market headlines page and applied sentiment analysis to the retrieved text. It should be noted that I only scraped one web page, but this page aggregates headlines from multiple sources. I wanted to see whether the language used by different news sources varies significantly as this information could be relevant to baseline sentiment scores in a trading strategy. 

I used the Scrapy web crawling framework in Python to scrape the headlines. Because the page I scraped only contains recent business headlines, I knew I needed to scrape repeatedly for several days in order to obtain sufficient data. To expedite this process, I created a Python wrapper script to do the actual scraping and time stamp the resulting CSV file.  After a week, I read all the data into a single Pandas DataFrame and conduced sentiment analysis with the TextBlob library. The subsequent data visualization and analysis was done with ggplot in R.

The sentiment analysis algorithm in TextBlob returns two metrics for each headline: polarity and subjectivity. Polarity quantifies the emotion, positive or negative in a sentence. A polarity score of 1 signifies a very positive headline where as -1 corresponds to a very negative headline. Subjectivity quantifies the degree to which a headline expresses a personal feeling or opinion. For example, “I don’t like spicy food” is very subjective whereas “The cat is brown” is not.

The charts below show distributions for these metrics for six of the most common new sources in my scraped data. Clearly there are differences in the distributions. The sources “MT Newswires” and “InvestorPlace Media” are perhaps at the opposite ends of the spectrum. “MT Newswires” has almost entirely neutral headlines and no subjectivity. On the other hand, “InvestorPlace Media” frequently publishes headlines with very positive polarity scores and high subjectivity.

I was not familiar with either “InvestorPlace Media” nor “MT Newswires” prior to this project. However, my first impressions of these news sources based on their respective websites aligns with the distributions that I found. “InvestorPlace Media” has a more sensationalist and “click bait-y” feel than “MT Newswires” which gives a somewhat academic impression. Additional data and domain knowledge would be required to truly draw a conclusion, but I was encouraged by this initial result.

As touched upon above, this project also lays the foundations for more ambitious analysis. For example, Zhang & Skiena describe and demonstrate the success of a sentiment analysis-based trading strategy in this paper. To be clear, Zhang & Skiena work with years of data and draw upon a far greater variety of sources (for example twitter and blogs) in their paper than I have done. That said, their work suggests potential business uses for the type of scraping done in this project.

The project code can be found here on Github.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI