Webscraping the WSJ

Posted on Nov 21, 2016

There are many newspapers available in the New York City area that cater to different segments of the population. This project focuses on the Wall Street Journal (WSJ), an international newspaper with a high circulation in the New York area. For this project, I scraped details of articles over a three-week period in order to analyze some basic metrics of the newspaper as well as some of the topics that the WSJ focuses on.

Using Python and Scrapy, I scraped several metrics for each article:

  1. titles of articles
  2. subtitle
  3. section(s)
  4. author(s)
  5. time that article was published

Below is an example of an article from the WSJ that I scraped, with the individual metrics highlighted:


In total, the project encompassed a total of 4,126 articles from three consecutive weeks in August 2016. As a result of collecting these metrics, several interesting findings emerged. First, the number of articles published online varied by the day of the week. The number of articles appears to grow during the week, peaking on Wednesday and Thursday, with Saturday and Sunday having the lowest number of articles on average. This may reflect user demand, as most readers may be more likely to check news sources throughout the week, compared to the weekend. As a result, there may also be fewer writers working on the weekend and fewer articles being published.


Additionally, the number of articles varies by section. The largest section is actually articles syndicated from the Associated Press (AP), which do not usually show up on the WSJ home page, and include short articles on sport scores and winning lottery numbers. The next largest sections in terms of number of articles are the Business and Markets sections. As its name indicates, the WSJ has a large focus on the markets and business news, which we see reflected in the number of articles published in these sections.


More specific than the section, we can focus on the topics discussed by looking at the words that appear in article titles. During this three-week period, "Trump" was the most commonly found word across all published article titles, appearing close to 150 times (over 3.5% of titles). By comparison, "Clinton" appeared around 60 times (1.5%). Another popular topic by word count during this period was "China" (around 80 times). The discrepancy between "Trump" and "Clinton" could be a result of Trump being a part of more newsworthy events during this period, or a result of the WSJ focusing on one of the candidates more than the other.


To further explore the WSJ's coverage of both candidates, we can look at the correlations between mentions of the candidates' names and other words found in the same title.


As seen above, the candidates are both most correlated to their first names. After his first name, Trump is most often associated with the Trump Tower, immigration or immigrants, and his campaign. Clinton, on the other hand, is most often associated with the Clinton Foundation, Chelsea, Huma Abedin, and emails. Some of these associations may have more negative connotations, but it is hard to measure whether the newspaper is merely transmitting the news or if the paper is overly focusing on certain negative events for one candidate over the other (such as "emails" for Hillary). By looking at multiple newspapers during the same period, we would be able to compare which newspapers focus relatively more or less on news events during that period that boost or hurt a candidate. That way, we might be able to estimate whether a newspaper may lean towards one candidate more than another newspaper.

Lastly, using sentiment analysis on WSJ article titles proved largely fruitless. The classification algorithm was unable to interpret many number of words in the titles, possibly due to the significant number of names and places that often appear. The words that were able to be categorized ended up being skewed to the positive, which many would argue is not the emotion most often evoked by newspaper titles.


Future research on other newspapers, such as the New York Times or New York Post, could provide additional insight as to topics that different newspapers focus on and whether different newspapers might lean politically to one side on a topic compared to other newspapers. Additionally, this project was unable to look at popularity of an article measured by comment count due to the WSJ's website structure. Using Selenium, we might be able to glean more color on article popularity and look at full article text for additional text analysis.

As always, please feel free to reach out to me with any comments, criticism, or other feedback on this project. Thank you!

About Author

Joseph van Bemmelen

Joseph van Bemmelen worked in equity research for Stifel Nicolaus, a mid-sized investment bank, for close to two years before joining NYCDSA. In his role, he wrote reports on publicly traded companies and worked extensively with financial models...
View all posts by Joseph van Bemmelen >

Related Articles

Leave a Comment

Bernardo Lares December 22, 2016
Would you be so kind and share your scrapping and sentiment codes?

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI