Webscraping the WSJ

Joseph van Bemmelen
Posted on Nov 21, 2016

There are many newspapers available in the New York City area that cater to different segments of the population. This project focuses on the Wall Street Journal (WSJ), an international newspaper with a high circulation in the New York area. For this project, I scraped details of articles over a three-week period in order to analyze some basic metrics of the newspaper as well as some of the topics that the WSJ focuses on.

Using Python and Scrapy, I scraped several metrics for each article:

  1. titles of articles
  2. subtitle
  3. section(s)
  4. author(s)
  5. time that article was published

Below is an example of an article from the WSJ that I scraped, with the individual metrics highlighted:

sample_article2

 

In total, the project encompassed a total of 4,126 articles from three consecutive weeks in August 2016. As a result of collecting these metrics, several interesting findings emerged. First, the number of articles published online varied by the day of the week. The number of articles appears to grow during the week, peaking on Wednesday and Thursday, with Saturday and Sunday having the lowest number of articles on average. This may reflect user demand, as most readers may be more likely to check news sources throughout the week, compared to the weekend. As a result, there may also be fewer writers working on the weekend and fewer articles being published.

articles_by_day

Additionally, the number of articles varies by section. The largest section is actually articles syndicated from the Associated Press (AP), which do not usually show up on the WSJ home page, and include short articles on sport scores and winning lottery numbers. The next largest sections in terms of number of articles are the Business and Markets sections. As its name indicates, the WSJ has a large focus on the markets and business news, which we see reflected in the number of articles published in these sections.

articles_by_section

More specific than the section, we can focus on the topics discussed by looking at the words that appear in article titles. During this three-week period, "Trump" was the most commonly found word across all published article titles, appearing close to 150 times (over 3.5% of titles). By comparison, "Clinton" appeared around 60 times (1.5%). Another popular topic by word count during this period was "China" (around 80 times). The discrepancy between "Trump" and "Clinton" could be a result of Trump being a part of more newsworthy events during this period, or a result of the WSJ focusing on one of the candidates more than the other.

words

To further explore the WSJ's coverage of both candidates, we can look at the correlations between mentions of the candidates' names and other words found in the same title.

word_correlations

As seen above, the candidates are both most correlated to their first names. After his first name, Trump is most often associated with the Trump Tower, immigration or immigrants, and his campaign. Clinton, on the other hand, is most often associated with the Clinton Foundation, Chelsea, Huma Abedin, and emails. Some of these associations may have more negative connotations, but it is hard to measure whether the newspaper is merely transmitting the news or if the paper is overly focusing on certain negative events for one candidate over the other (such as "emails" for Hillary). By looking at multiple newspapers during the same period, we would be able to compare which newspapers focus relatively more or less on news events during that period that boost or hurt a candidate. That way, we might be able to estimate whether a newspaper may lean towards one candidate more than another newspaper.

Lastly, using sentiment analysis on WSJ article titles proved largely fruitless. The classification algorithm was unable to interpret many number of words in the titles, possibly due to the significant number of names and places that often appear. The words that were able to be categorized ended up being skewed to the positive, which many would argue is not the emotion most often evoked by newspaper titles.

sentiment_analysis

Future research on other newspapers, such as the New York Times or New York Post, could provide additional insight as to topics that different newspapers focus on and whether different newspapers might lean politically to one side on a topic compared to other newspapers. Additionally, this project was unable to look at popularity of an article measured by comment count due to the WSJ's website structure. Using Selenium, we might be able to glean more color on article popularity and look at full article text for additional text analysis.

As always, please feel free to reach out to me with any comments, criticism, or other feedback on this project. Thank you!

 

About Author

Joseph van Bemmelen

Joseph van Bemmelen

Joseph van Bemmelen worked in equity research for Stifel Nicolaus, a mid-sized investment bank, for close to two years before joining NYCDSA. In his role, he wrote reports on publicly traded companies and worked extensively with financial models...
View all posts by Joseph van Bemmelen >

Related Articles

Leave a Comment

Avatar
Bernardo Lares December 22, 2016
Would you be so kind and share your scrapping and sentiment codes?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data Book Launch Book-Signing bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp