Web-scraping Wall Street Journal articles for sentiment analysis

Philippe Heitzmann
Posted on Sep 21, 2020

Link to Shiny App: https://philippe1.shinyapps.io/WSJApp2/

Online reader engagement has become an increasing focus[1] of news publications seeking to foster increased reader loyalty in order to bolster web advertising revenues.[2] As the share of U.S. newspaper advertising revenue generated digitally has steadily grown from an estimated 17 percent in 2011 to nearly 35 percent in 2018,[3] how newspapers engage with the nearly 70 million U.S. adults who say they prefer to get their daily news through online means has become an area of increasing focus for most publications.[4] While number of monthly unique visitors is one closely watched metric that informs auction prices for digital news advertising, the percentage of a website’s traffic that is made of ‘power users’ – those who visit a website 10 times a month or more and spend more than an hour there over that time – has become an increasingly monitored metric as some of the most successful news publications, such as CNN (17.8% power user rate), Fox News (16.4%) or Yahoo (14.0%), enjoy the highest power user readings today in the online news landscape.[5] As, perhaps unsurprisingly, commenting behavior is positively correlated with repeat website visits and time spent on a website, online comments and inter-user discussion are more and more driving the value of online news platforms.[6]

Opinion pieces and news articles presenting a subjective and at times emotionally charged version of the facts have been shown to drive user engagement by confirming the innate political or ideological leanings of that readership.[7] This bias can result in a higher number of comments and shares for those articles, in turn yielding higher digital advertising revenues.[8] In order to investigate a possible relationship between article emotionality, subjectivity, positivity/negativity and user engagement, I scraped news and opinion articles from the Wall Street Journal (“WSJ”) and set out to answer the following questions:

  1. Question 1: Can a statistically significant, causal relationship be demonstrated between a WSJ article's degree of subjectivity/objectivity and positivity/negativity in its writing (as defined by widely used Python sentiment analysis libraries), and the number of online comments posted by readers for that article?
  2. Question 2: To the degree that the WSJ enjoys a wide readership in the financial world and that previous findings in the literature[9] indicate a weak linkage between objectivity, emotionality, frequency of media coverage and financial markets fluctuations, can a statistically significant causal relationship be demonstrated between the WSJ's coverage of financial news, specifically WSJ articles’ degree of subjectivity/objectivity and positivity/negativity polarity on a given day t, and stock price movements from the S&P 500 Index on day t + n for 0 <= n <= 1?

Data & Scraping

To answer my first question, I scraped 22,772 full text WSJ articles published between Jan-19 and July-20 from the Wall Street Journal’s news archives (https://www.wsj.com/news/archive/years). For each article, I scraped the article text, headline, sub-headline, date published, author name, number of comments and rubric name in order to gather as much text data as possible for sentiment analysis. Below is a descriptive table from the R Shiny App I built showing the different variables scraped, followed by a sample view of the where this text data is located on a WSJ article page:

I found Selenium to be the best framework to scrape this data for a number of reasons, chief of which were (i) the WSJ website requiring a user to login when accessing the website, which Selenium is able to do rather easily by locating the username and password fields by ID and (ii) building in a wait time so the WSJ article page could load all elements on the page before attempting to locate these elements, which Selenium can accommodate with a simple explicit WebDriverWait call (full code here). The login process using Selenium can be seen below:

The Exploratory Data Analysis (“EDA”) tab of the R Shinny app includes a word cloud that presents some of the most common keywords in the text dataset. The user can adjust the input slider at the top in order to choose how many words to display in the wordcloud. Interestingly, and perhaps unsurprisingly in an election year, some of the most common keywords relate to politics, with the words democrats, political, Trump, Biden, and Schumer being some of the words appearing in the dataset with the highest frequencies. Also perhaps unsurprisingly for a national news journal such as the WSJ that is focused on reporting on global economic and political news, more generic words such as public, rule, federal, law, or policy also make the list as some of the other highest frequency words. Furthermore, the EDA tab also shows a bar plot ranking WSJ sections by highest average number of comments posted per article. As expected, Politics and Opinion, two of the typically most polarizing sections, exhibit the highest average number of comments posted per article, with Politics even boasting more than double the average number of comments than the third highest-commented section U.S.

Data Cleaning and Preprocessing

As the end goal for our S&P regression analysis was to concatenate all paragraph text data for a given day as a single cell value linked to a single day, empty text observations did not have to be deleted, which reduced data processing time considerably. Text data of paragraphs published on the same day were concatenated using groupby and by applying a lambda join function. In order to answer our second research question regarding a regression analysis with the S&P 500, the resulting dataset of 232 unique days was inner join’ed with a dataframe of stock prices and trading volumes for that given day from finance.yahoo.com. Due to equity markets being closed on weekends and holidays, the resulting dataframe contains 158 unique observations, or approximately three percentage points less than five sevenths of the original 232 observations, which makes sense in the context of the trading schedule of equity markets.

Python Sentiment Analysis

Two commonly used Python sentiment analysis frameworks, namely Valence Aware Dictionary and sEntiment Reasoner (“VADER”) and TextBlob, were used to perform sentiment analysis on the combined data. The first, VADER, is a Natural Language Processing sentiment analysis model available through the Python nltk package that outputs polarity (positive/negative) and intensity of emotion scores. Specifically, the four sentiment analysis scores outputted by VADER and used in these regression models are:

(i) VADER Variables

Negative: float in the range [0,1] representing negativity score

Neutral: float in the range [0,1] representing neutrality score

Positive: float in the range [0,1] representing positivity score

Compound: Computed by normalizing the negative, neutral and positive scores.

Similarly, TextBlob is a Python API used for common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and translation. The two sentiment analysis scores computed by TextBlob are:

(ii) TextBlob Variables

Polarity: float lying in the range of [-1,1] where 1 means positive statement and -1 means a negative statement

Subjectivity: float lying in the range of [0,1] where 1 means subjective statement and 0 means an objective statement.

Visualizations & Data Manipulation

TextBlob - Interestingly, and counterintuitively to our previous EDA that showed Politics and Opinion, two sections with typically polarizing content, as scoring the highest in average number of comments posted on news articles, at first glance the TextBlob polarity of an article doesn't seem to have as much explanatory power in relation to the number of comments generated. Polarity scores for WSJ articles also seem to be very tightly distributed around the mean. Similarly, the TextBlob subjectivity score of an article seems to explain little of the variation in number of comments posted on that article. Subjectivity scores also appear to vary more widely than polarity scores and seem to contain more outliers.

Looking at the distribution of number of comments for the 12,329 articles in my dataset, it became apparent that the many low-comment articles were disproportionately contributing to this flattish relationship, and that outliers in the data were also skewing my results. Indeed, a quick filter search revealed that 2,497 articles in the DataFrame of 12,329 unique article observations had 5 or less comments posted. Boxplot and barplot visualizations of the data (see below) both further showed how skewed to the left the comment distribution was. The dataset was therefore filtered down to include only observations with more than five individual comments. Furthermore, with the goal of controlling for the effect of outliers skewing the data, I calculated an interquartile range of 147 comments for this filtered dataset and removed outlier articles with more than 3rd quartile (165 comments) + 1.50 * 147 comments, or 386 comments. The resulting dataset contained 8,578 unique articles with comment values ranging between 5 and 386.  

With this new filtered dataset in hand, I proceeded to do a similar visual regression plot analysis using the seaborn library. This time, in addition to looking at the relationship between number of comments and TextBlob Polarity and Subjectivity scores of the article paragraph text, I also examined the relationship between number of comments and TextBlob Polarity and Subjectivity scores of the headline text. Similar to our preliminary visualizations of the relationship between these variables in the full dataset, at first glance these plots (rather disappointingly) also show no real relationship amongst these variables.



VADER – The observed relationship between VADER positivity score and number of comments for the full dataset also appears to be flattish on the whole. This relationship seems weaker than I expected as I would have thought that more optimistic and feel-good articles would be shared more widely, while the below graph seems to show there is close to no relationship between VADER positivity score and number of comments posted on that article.

A seemingly high positive correlation between VADER negativity score and number of comments for the full dataset was perhaps the most interesting and surprising finding of this visual analysis, as I would have initially theorized that more pessimistic and negative articles would be read and shared less and therefore lead to a lower number of posted comments for that article.

Wanting to control for the effect of outliers on our visualizations, the same VADER sentiment analysis scores were also calculated for the paragraph text of the filtered dataset of 8,578 unique articles. Interestingly, while VADER positivity score and number of comments maintained a flat relationship, the positive relationship between VADER negativity score and number of comments lessened somewhat, indicating that outlier scores were likely driving much of that relationship in the full dataset. The next step in the process was using a simple linear regression to quantify these relationships between sentiment analysis scores and number of comments, however slight these may be.  

Simple Linear Regression Analysis - Results

Number of comments - I performed a simple linear regression analysis of the sentiment analysis variables on number of comments in order to better quantify the relationship between these variables on the full dataset of 12,329 unique article observations. As evidenced by the very low adj R^2 of 0.014, a simple linear regression of number of comments against TextBlob polarity, TextBlob subjectivity, VADER positivity, and VADER negativity is a poor model for explaining the variance in number of comments posted on WSJ articles. We cannot reject the null hypothesis that the beta coefficients of the polarity, subjectivity, positivity and negativity variables are zero based on the Prob (F-stat) of 0.2045. Interestingly, however, the below regression shows VADER negativity scores as being statistically significant to the 1% level in predicting the number of comments posted on WSJ articles.

To further investigate the explanatory power of VADER negativity score on number of comments, a simple linear regression with VADER negativity score as the only independent variable was conducted. Perhaps unsurprisingly, a standalone linear regression model with just the negativity score as a predictor yields a model with a similarly low Adj R^2 value of 0.014.

Although this would still be a poor model for predicting the number of comments on WSJ articles, the simplicity of this model makes it preferable to the linear model with all sentiment analysis variables included. One possible explanation for this positive & significant relationship between number of comments and negativity score may be that WSJ articles with higher negativity scores may be more likely to announce events that would entail a public showing of grief or support. These tragic events, such as the announcement of the death of a public figure or some other calamity that would affect large numbers of people, could therefore generate more comments on that article. Further investigation is needed to pinpoint some of the reasons driving this relationship.

Same-day S&P 500 % Change - Lastly, to answer our second research question, a total of four simple linear regression analyses of the sentiment analysis variables against (i) same-day percentage change in the S&P 500 and (ii) following day percentage change in S&P 500 were made. The first three models respectively regressed same-day S&P 500 percentage change against (i) TextBlob variables only, (ii) VADER variables only, and (iii) all variables combined. One can observe that the Adj R^2 values of ~0.01 and high p-values across the board shows these are poor models with low predictive power, even with the VADER neutral values and SPX Volume numbers added as extra independent variables. As expected, our results show WSJ sentiment analysis has low predictive power in relation to same-day SPX moves, although, quite interestingly, TextBlob polarity of WSJ article is significant to the 10% level. This finding is interesting in the context of our previous EDA of polarity vs number of article comments which showed a quasi-flat relationship. It may be that articles that score higher on the polarity index fall on more volatile days in the stock market, leading to more emotionally polarized articles, although further investigation is required to better ascertain this.

Following-day S&P 500 % Change – A simple linear regression model with following-day S&P 500 % change as the dependent variable similarly shows low predictive power. Indeed, a Prob(F-Statistic) of 0.315 for this model implies a 68.5% chance all the beta coefficients for these variables are not equal to zero, and that we cannot therefore reject the null hypothesis. Furthermore, the low Adj R^2 value of 0.008 also indicates low explanatory power. Interestingly, the VADER compound variable, with a p-value of 0.028, is statistically significant at the 5% level; its beta coefficient of 0.36 suggests that a 1 unit increase in the VADER compound index score can explain a +0.36% increase in the S&P 500 the following day. Nevertheless, given the modest Prob(F-stat) and Adj R^2 values as well as relatively low number of observations in the dataset, further investigation and data collection is necessary in order to better ascertain some of the interactions between these variables. 

My email is [email protected] if you would like to discuss this project. Thanks for reading!



[1] Masullo Chen, G., Ng, Y. M. M., Riedl, M. J., & Chen, V. Y. (2020). Exploring how online political quizzes boost interest in politics, political news, and political engagement. Journal of Information Technology & Politics17(1), 33-47.

[2] Farhi, P. (2007). Salvation? The embattled newspaper business is betting heavily on web advertising revenue to secure its survival. But that wager is hardly a sure thing. American Journalism Review29(6), 18-24.

[3] (2018). Share of newspaper advertising revenue coming from digital advertising. Pew Research Center.

[4] Geiger, A. W. (2019). Key findings about the online news landscape in America. Pew Research Center.

[5] Olmstead, K. & Rosenstiel, T. (2011). Navigating News Online: Where People Go, How They Get There and What Lures Them Away. Pew Research Center.

[6] Ziegele, M., Weber, M., Quiring, O., & Breiner, T. (2018). The dynamics of online news discussions: Effects of news articles and reader comments on users’ involvement, willingness to participate, and the civility of their contributions. Information, Communication & Society21(10), 1419-1435

[7] Bakir, V., & McStay, A. (2018). Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism6(2), 154-175.

[8] Bakir, V., & McStay, A. (2018). Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism6(2), 154-175.

[9] Aman, H. (2013). An analysis of the impact of media coverage on stock price crashes and jumps: Evidence from Japan. Pacific-Basin Finance Journal24, 22-38.



About Author

Philippe Heitzmann

Philippe Heitzmann

Philippe is an aspiring data scientist with a track record of using data to drive significant and tangible business results in the Sales & Trading and financial advisory fields. He has hands on experience in R and Python...
View all posts by Philippe Heitzmann >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp