Online News Popularity

Bin Lin

Posted on Mar 21, 2016

Contributed by Bin Lin. Bin is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his third class project - web scraping (due on the 6th week of the program).

Introduction:

The digital media web site mashable.com provides online news. The goals of this project were:

Use Python to web scrape the web page of a list online news.
Explore the information on the web page for each news.
Build some machine learning models to predict the popularity of online news. Classify popular articles as High, otherwise "Low".

Web Scraping:

Python Code:

The mashable.com has REST API that returns a list of recent news: http://mashable.com/stories.json?hot_per_page={}&new_per_page={}&rising_per_page={}&new_after={}&hot_after={}&rising_after={}. The response of the API call contains a list of news in JSON format. An example of a news' information in JSON format is below:

{
"id": "56cd7a54b589e4723c000002",
"author": "Rachel Thompson",
"channel": "Lifestyle",
"content": "content of the article....",
"link": "http://mashable.com/2016/02/24/helping-parents-after-a-death/",
"post_date": "2016-02-24T09:39:15+00:00",
"shares": 695,
"title": "How I helped my father grieve when his mother died",
"type": "new"
}

With the URLs given in the list, web requests were made to download the web pages of each news. Python's BeatifulSoup library was used for web scrapping. to parse and extract the useful information of the news, such as title, author, total number of shares, number of external links, number of images, number of videos, etc.

bs = BeautifulSoup(articleFile, 'html.parser')
# get total share
shareNode = bs.find('div', {'class': 'total-shares'})
if(shareNode):
  article.shares = shareNode.get_text().replace('\n', '').replace('Shares', '')
else:
  shareNode = bs.find(lambda tag: tag.has_attr('data-shares'))
article.shares = shareNode.get('data-shares')

For convince, an class name "Article" has been created to store the information of these news articles.

class Article(object):
    def __init__(self):
        self.id = None
        self.link = None
        self.post_date = None
        self.title = None
        self.author = None
        self.shares = 0
        self.channel = None
        self.type = None
        self.content = None
        self.timedelta = None
        self.n_tokens_title = 0
        self.n_tokens_content = 0
        self.num_hrefs = 0
        self.num_self_hrefs = 0
        self.num_imgs = 0
        self.num_videos = 0
        self.num_keywords = 0
        self.topics = None
        self.content_sentiment_polarity = None
        self.content_subjectivity = None
        self.title_sentiment_polarity = None
        self.title_subjectivity = None

Data was stored in a JSON file and converted back to CSV file when R and Shiny were used for data visualization.

Data:

The date of web scrapping was 02/24/2016.
For the purpose of this project, I only requested a list of news that were published between 02/08/2016 - 02/24/2016.
There were 1,747 news article collected for this project.
Data dimension: 1747 rows x 22 columns

Data Visualization Analysis:

What were Hot in Topics and Titles:

Based on the topics labeled and the titles of the news, during the time of period 02/08/2016 - 02/24/2016, the hot words among topics and titles are: entertainment, video, grammys, valentines, etc.

Hot Topics	Hot Titles

Channels:

Mashable.com has categorized the news into 6 channels: Watercooler, World, Entertainment, Tech, Lifestyle, Business, Social Medias. Watercooler category is for stories that mashable.com think are cool. I don't know how they define "cool". But of course the number of articles in "Watercooler" category is the highest. It is followed by "World" news, "Entertainment" news, "tech" news. See the "Figure: Compare Number of Stories among Categories".

Figure: Compare Number of Stories among Categories

Authors:

The articles I collected were published by ore than 50 authors. I ranked the authors by the number of published articles and plotted the Top 10 in the Box Plot.

Most of them have close density on the number shares
Brian Koerber has more articles and has higher number of shares than the rest.
Emily Blake is ranked No. 2 in number of articles, but the number of shares seem lower than others.

Figure: Top Authors by Number of Published Articles

Distribution of Social Shares:

Even though the time of period of collected articles were only two weeks, the distribution of number of shares can still give a sense of article popularity.

Range for number of shares: 0 - 50000
Majority number of shares: 700 - 1000

Figure: Histogram and Density of Number of Shares

Correlation Tile Map:

It is always important to look at the correlation among the variables. From the correlation tile map, there were no highly correlated numeric variables.

Data Transform:

Sentiment Analysis:

Some of the text-based variables, such as "content", "title", are free-text variables and can not be categorized. In order to use them for my prediction model, TextBlob library is used to apply sentiment analysis on these variables. TextBlog is a Python library for processing textual data. It provides common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and etc.

Through sentiment analysis, three new variables would be added to represent:

how positivie or negative the text is
how subjective the text is
how many meaningful words in the text

Thus the output of transformation:

"content" column is transformed to : n_tockens_content, content_sentiment_poparity, content_subjectivity
"title" column is transformed to: n_tokens_title, title_sentiment_polarity, title_subjectivity

                
                contentBlob = TextBlob(article.content)
                                
                # Number of words in the content
                article.n_tokens_content = len(contentBlob.words)
                
                # article sentiment
                article.content_sentiment_polarity = contentBlob.sentiment.polarity
                article.content_subjectivity = contentBlob.sentiment.subjectivity 

                titleBlob = TextBlob(article.title)

                # Number of words in the title
                article.n_tokens_title = len(titleBlob.words)

                # title sentiment
                article.title_sentiment_polarity = titleBlob.sentiment.polarity
                article.title_subjectivity = titleBlob.sentiment.subjectivity

Categorical Data Transform:

Since Python only takes numeric as input when running Machine Learning algorithms, the variable "channel" was converted to binary numbers by using the get_dummies() function in Pandas library.

def dummify(df, cate_variables):
    '''
    @Summary: convert the categorical variables to numeric variables by using dummies (binary).
    Old categorical variables will be dropped.
    @return: A copy of the old dataframe with new converted numeric variables. 
    '''
    # make a copy before creating dummies
    df_new = df.copy()
    
    # convert text-based columns to dummies (except v22)
    for var_name in cate_variables:
        dummies = pd.get_dummies(df[var_name], prefix=var_name)
        
        # Drop the current variable, concat/append the dummy dataframe to the dataframe.
        df_new = pd.concat([df_new.drop(var_name, 1), dummies.iloc[:,1:]], axis = 1)
    
    return df_new

Prediction Model:

Since predicting with Machine Learning algorithm was not the primary goal of the web scraping project, only Random Forest algorithm was used for practicing purpose.

I chose Random Forests algorithm for practicing for the following reasons:

One of the best among classification algorithms - able to classify large amounts of data with accuracy.
Gives estimates of what variables are important in the classification
Easy to learn and use

Gridsearch was used to fine-tune the parameters for find the best Random Forest model. Three parameters were being fine tuned. Gridsearch might take long time to finish if many parameters are being tuned and the range of each paraemters are wide. Because this was a relatively small dataset, I supplied relatively wide range to max_depth and n_estimators parameters.

The values for the three parameters that I fine-tuned are:

criterion: ['gini', 'entropy']
max_depth: integer values from 2 - 19: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
n_estimators: integer values from 10 - 90 (incremented by 10): 10, 20, 30, 40, 50, 60, 70, 80, 90

import sklearn.grid_search as gs

grid_para_forest = {'criterion': ['gini', 'entropy'], 'max_depth': range(2, 20), "n_estimators": range(10, 100, 10)}
grid_search_forest = gs.GridSearchCV(randomForest, grid_para_forest, cv=5, scoring='accuracy').fit(x_train, y_train)

print 'The best score is %.4f' %grid_search_forest.best_score_
print 'The best parameters are %s' %grid_search_forest.best_params_
print 'The training error is %.4f' %(1 - grid_search_forest.score(x_train, y_train))
print 'The testing error is %.4f' %(1 - grid_search_forest.score(x_test, y_test))

The output of the gridsearch gave the best model. It gives the testing error 0.3150 (accuracy0.6850).

The best score is 0.7113
The best parameters are {'n_estimators': 60, 'criterion': 'entropy', 'max_depth': 16}
The training error is 0.0154
The testing error is 0.3150

The best Random Forest model from my Gridsearch gave me the important features. See Figure: Feature Importance.

Figure: Feature Importance

Conclusion:

Collecting data through web scraping can be challenging if a web page is not well coded or formatted. I ran into this kind of challenging when the news articles on Mashable.com have different format for different types of articles (for example, video-typed article vs text-type article).

Also online news popularity is hard to predict because it is hard to know what elements have impact on the user's sharing behavior. Furthermore, I believe that the content of the article should have impact on the user's likeness. However, I am not able to do deeper analysis on the content itself even though simple sentiment analysis was applied. Random Forest algorithm only gave me 0.6850 accuracy.

For future works, I would like to try other algorithms and compare them to the Random Forest algorithm. I would also like to see if there is a better way to analysis the article content. I have found that there was actually a past Kaggle project regarding the mashable.com online news popularity. They were given a datasest with all the articles from year 2015. I would love to dig into this to see what the other Kagglers' have done.

About Author

Bin Lin

Bin is a former professional in software development with now positioned for Data Scientist role. With both strong programming skill and newly acquired machine learning skills, Bin is able to apply scientific thinking to predict or uncover insights...

View all posts by Bin Lin >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

O Globo esporte October 2, 2017

Do you love it regular posting when you must keep regular content in your blog in order to commercialize it. However, save on your closest friends and family, nobody usually mind your blog post. They will purchase from you some money to examine a web site or a product for the owner.

montre rolex Datejust occasion boite July 15, 2017

I would love to make an album of grandkid pictures for my grandmother. thanks! montre rolex Datejust occasion boite http://www.finewristwatchshop.com/fr/rolex-datejust-36mm-jubilee-steel-watch-domed-bezel-blue-dial-index-hour-markers-116200blsj-p779/

ronak goyal November 3, 2016

Hi Bin Lin, I have working on the similar type of project. Can you help me in my proect. Regards Ronak

agence web maroc May 6, 2016

What's up Dear, are you in fact visiting this website regularly, if so then you will definitely get nice knowledge.

Online News Popularity

Introduction:

Web Scraping:

Python Code:

Data:

Data Visualization Analysis:

What were Hot in Topics and Titles:

Channels:

Authors:

Distribution of Social Shares:

Correlation Tile Map:

Data Transform:

Sentiment Analysis:

Categorical Data Transform:

Prediction Model:

Conclusion:

About Author

Bin Lin

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Online News Popularity

Introduction:

Web Scraping:

Python Code:

Data:

Data Visualization Analysis:

What were Hot in Topics and Titles:

Channels:

Authors:

Distribution of Social Shares:

Correlation Tile Map:

Data Transform:

Sentiment Analysis:

Categorical Data Transform:

Prediction Model:

Conclusion:

About Author

Bin Lin

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!