Online News Popularity
Contributed by Bin Lin. Bin is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his third class project - web scraping (due on the 6th week of the program).
Introduction:
The digital media web site mashable.com provides online news. The goals of this project were:
- Use Python to web scrape the web page of a list online news.
- Explore the information on the web page for each news.
- Build some machine learning models to predict the popularity of online news. Classify popular articles as High, otherwise "Low".
Web Scraping:
Python Code:
The mashable.com has REST API that returns a list of recent news: http://mashable.com/stories.json?hot_per_page={}&new_per_page={}&rising_per_page={}&new_after={}&hot_after={}&rising_after={}. The response of the API call contains a list of news in JSON format. An example of a news' information in JSON format is below:
{ "id": "56cd7a54b589e4723c000002", "author": "Rachel Thompson", "channel": "Lifestyle", "content": "content of the article....", "link": "http://mashable.com/2016/02/24/helping-parents-after-a-death/", "post_date": "2016-02-24T09:39:15+00:00", "shares": 695, "title": "How I helped my father grieve when his mother died", "type": "new" }
With the URLs given in the list, web requests were made to download the web pages of each news. Python's BeatifulSoup library was used for web scrapping. to parse and extract the useful information of the news, such as title, author, total number of shares, number of external links, number of images, number of videos, etc.
bs = BeautifulSoup(articleFile, 'html.parser') # get total share shareNode = bs.find('div', {'class': 'total-shares'}) if(shareNode): article.shares = shareNode.get_text().replace('\n', '').replace('Shares', '') else: shareNode = bs.find(lambda tag: tag.has_attr('data-shares')) article.shares = shareNode.get('data-shares')
For convince, an class name "Article" has been created to store the information of these news articles.
class Article(object): def __init__(self): self.id = None self.link = None self.post_date = None self.title = None self.author = None self.shares = 0 self.channel = None self.type = None self.content = None self.timedelta = None self.n_tokens_title = 0 self.n_tokens_content = 0 self.num_hrefs = 0 self.num_self_hrefs = 0 self.num_imgs = 0 self.num_videos = 0 self.num_keywords = 0 self.topics = None self.content_sentiment_polarity = None self.content_subjectivity = None self.title_sentiment_polarity = None self.title_subjectivity = None
Data was stored in a JSON file and converted back to CSV file when R and Shiny were used for data visualization.
Data:
- The date of web scrapping was 02/24/2016.
- For the purpose of this project, I only requested a list of news that were published between 02/08/2016 - 02/24/2016.
- There were 1,747 news article collected for this project.
- Data dimension: 1747 rows x 22 columns
Data Visualization Analysis:
What were Hot in Topics and Titles:
Based on the topics labeled and the titles of the news, during the time of period 02/08/2016 - 02/24/2016, the hot words among topics and titles are: entertainment, video, grammys, valentines, etc.
Hot Topics | Hot Titles |
---|---|
![]() |
![]() |
Channels:
Mashable.com has categorized the news into 6 channels: Watercooler, World, Entertainment, Tech, Lifestyle, Business, Social Medias. Watercooler category is for stories that mashable.com think are cool. I don't know how they define "cool". But of course the number of articles in "Watercooler" category is the highest. It is followed by "World" news, "Entertainment" news, "tech" news. See the "Figure: Compare Number of Stories among Categories".
Figure: Compare Number of Stories among Categories
Authors:
The articles I collected were published by ore than 50 authors. I ranked the authors by the number of published articles and plotted the Top 10 in the Box Plot.
- Most of them have close density on the number shares
- Brian Koerber has more articles and has higher number of shares than the rest.
- Emily Blake is ranked No. 2 in number of articles, but the number of shares seem lower than others.
Figure: Top Authors by Number of Published Articles
Distribution of Social Shares:
Even though the time of period of collected articles were only two weeks, the distribution of number of shares can still give a sense of article popularity.
- Range for number of shares: 0 - 50000
- Majority number of shares: 700 - 1000
Figure: Histogram and Density of Number of Shares
Correlation Tile Map:
It is always important to look at the correlation among the variables. From the correlation tile map, there were no highly correlated numeric variables.
Data Transform:
Sentiment Analysis:
Some of the text-based variables, such as "content", "title", are free-text variables and can not be categorized. In order to use them for my prediction model, TextBlob library is used to apply sentiment analysis on these variables. TextBlog is a Python library for processing textual data. It provides common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and etc.
Through sentiment analysis, three new variables would be added to represent:
- how positivie or negative the text is
- how subjective the text is
- how many meaningful words in the text
Thus the output of transformation:
- "content" column is transformed to : n_tockens_content, content_sentiment_poparity, content_subjectivity
- "title" column is transformed to: n_tokens_title, title_sentiment_polarity, title_subjectivity
contentBlob = TextBlob(article.content) # Number of words in the content article.n_tokens_content = len(contentBlob.words) # article sentiment article.content_sentiment_polarity = contentBlob.sentiment.polarity article.content_subjectivity = contentBlob.sentiment.subjectivity titleBlob = TextBlob(article.title) # Number of words in the title article.n_tokens_title = len(titleBlob.words) # title sentiment article.title_sentiment_polarity = titleBlob.sentiment.polarity article.title_subjectivity = titleBlob.sentiment.subjectivity
Categorical Data Transform:
Since Python only takes numeric as input when running Machine Learning algorithms, the variable "channel" was converted to binary numbers by using the get_dummies() function in Pandas library.
def dummify(df, cate_variables): ''' @Summary: convert the categorical variables to numeric variables by using dummies (binary). Old categorical variables will be dropped. @return: A copy of the old dataframe with new converted numeric variables. ''' # make a copy before creating dummies df_new = df.copy() # convert text-based columns to dummies (except v22) for var_name in cate_variables: dummies = pd.get_dummies(df[var_name], prefix=var_name) # Drop the current variable, concat/append the dummy dataframe to the dataframe. df_new = pd.concat([df_new.drop(var_name, 1), dummies.iloc[:,1:]], axis = 1) return df_new
Prediction Model:
Since predicting with Machine Learning algorithm was not the primary goal of the web scraping project, only Random Forest algorithm was used for practicing purpose.
I chose Random Forests algorithm for practicing for the following reasons:
- One of the best among classification algorithms - able to classify large amounts of data with accuracy.
- Gives estimates of what variables are important in the classification
- Easy to learn and use
Gridsearch was used to fine-tune the parameters for find the best Random Forest model. Three parameters were being fine tuned. Gridsearch might take long time to finish if many parameters are being tuned and the range of each paraemters are wide. Because this was a relatively small dataset, I supplied relatively wide range to max_depth and n_estimators parameters.
The values for the three parameters that I fine-tuned are:
- criterion: ['gini', 'entropy']
- max_depth: integer values from 2 - 19: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
- n_estimators: integer values from 10 - 90 (incremented by 10): 10, 20, 30, 40, 50, 60, 70, 80, 90
import sklearn.grid_search as gs grid_para_forest = {'criterion': ['gini', 'entropy'], 'max_depth': range(2, 20), "n_estimators": range(10, 100, 10)} grid_search_forest = gs.GridSearchCV(randomForest, grid_para_forest, cv=5, scoring='accuracy').fit(x_train, y_train) print 'The best score is %.4f' %grid_search_forest.best_score_ print 'The best parameters are %s' %grid_search_forest.best_params_ print 'The training error is %.4f' %(1 - grid_search_forest.score(x_train, y_train)) print 'The testing error is %.4f' %(1 - grid_search_forest.score(x_test, y_test))
The output of the gridsearch gave the best model. It gives the testing error 0.3150 (accuracy0.6850).
The best score is 0.7113 The best parameters are {'n_estimators': 60, 'criterion': 'entropy', 'max_depth': 16} The training error is 0.0154 The testing error is 0.3150
The best Random Forest model from my Gridsearch gave me the important features. See Figure: Feature Importance.
Figure: Feature Importance
Conclusion:
Collecting data through web scraping can be challenging if a web page is not well coded or formatted. I ran into this kind of challenging when the news articles on Mashable.com have different format for different types of articles (for example, video-typed article vs text-type article).
Also online news popularity is hard to predict because it is hard to know what elements have impact on the user's sharing behavior. Furthermore, I believe that the content of the article should have impact on the user's likeness. However, I am not able to do deeper analysis on the content itself even though simple sentiment analysis was applied. Random Forest algorithm only gave me 0.6850 accuracy.
For future works, I would like to try other algorithms and compare them to the Random Forest algorithm. I would also like to see if there is a better way to analysis the article content. I have found that there was actually a past Kaggle project regarding the mashable.com online news popularity. They were given a datasest with all the articles from year 2015. I would love to dig into this to see what the other Kagglers' have done.