Online News Popularity

Posted on Mar 21, 2016

Contributed by Bin Lin. Bin is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his third class project - web scraping (due on the 6th week of the program).

Introduction:

The digital media web site mashable.com provides online news. The goals of this project were:

  • Use Python to web scrape the web page of a list online news.
  • Explore the information on the web page for each news.
  • Build some machine learning models to predict the popularity of online news. Classify popular articles as High, otherwise "Low".

Web Scraping:

Python Code:

The mashable.com has REST API that returns a list of recent news: http://mashable.com/stories.json?hot_per_page={}&new_per_page={}&rising_per_page={}&new_after={}&hot_after={}&rising_after={}. The response of the API call contains a list of news in JSON format. An example of a news' information in JSON format is below:

{
"id": "56cd7a54b589e4723c000002",
"author": "Rachel Thompson",
"channel": "Lifestyle",
"content": "content of the article....",
"link": "http://mashable.com/2016/02/24/helping-parents-after-a-death/",
"post_date": "2016-02-24T09:39:15+00:00",
"shares": 695,
"title": "How I helped my father grieve when his mother died",
"type": "new"
}

With the URLs given in the list, web requests were made to download the web pages of each news. Python's BeatifulSoup library was used for web scrapping. to parse and extract the useful information of the news, such as title, author, total number of shares, number of external links, number of images, number of videos, etc.

bs = BeautifulSoup(articleFile, 'html.parser')
# get total share
shareNode = bs.find('div', {'class': 'total-shares'})
if(shareNode):
  article.shares = shareNode.get_text().replace('\n', '').replace('Shares', '')
else:
  shareNode = bs.find(lambda tag: tag.has_attr('data-shares'))
article.shares = shareNode.get('data-shares')

For convince, an class name "Article" has been created to store the information of these news articles.

class Article(object):
    def __init__(self):
        self.id = None
        self.link = None
        self.post_date = None
        self.title = None
        self.author = None
        self.shares = 0
        self.channel = None
        self.type = None
        self.content = None
        self.timedelta = None
        self.n_tokens_title = 0
        self.n_tokens_content = 0
        self.num_hrefs = 0
        self.num_self_hrefs = 0
        self.num_imgs = 0
        self.num_videos = 0
        self.num_keywords = 0
        self.topics = None
        self.content_sentiment_polarity = None
        self.content_subjectivity = None
        self.title_sentiment_polarity = None
        self.title_subjectivity = None

Data was stored in a JSON file and converted back to CSV file when R and Shiny were used for data visualization.

Data:

  • The date of web scrapping was 02/24/2016.
  • For the purpose of this project, I only requested a list of news  that were published between 02/08/2016 - 02/24/2016.
  • There were  1,747 news article collected for this project.
  • Data dimension: 1747 rows x 22 columns

Data Visualization Analysis:

What were Hot in Topics and Titles:

Based on the topics labeled and the titles of the news, during the time of period 02/08/2016 - 02/24/2016, the hot words among topics and titles are: entertainment,  video, grammys, valentines, etc.

Hot Topics Hot Titles
topic_word_cloud title_word_cloud

Channels:

Mashable.com has categorized the news into 6 channels: Watercooler, World, Entertainment, Tech, Lifestyle, Business, Social Medias. Watercooler category is for stories that mashable.com think are cool. I don't know how they define "cool". But of course the number of articles in "Watercooler" category is the highest. It is followed by "World" news, "Entertainment" news, "tech" news. See the "Figure: Compare Number of Stories among Categories".

Figure: Compare Number of Stories among Categories

channels

Authors:

The articles I collected were published by ore than 50 authors. I ranked the authors by the number of published articles and plotted the Top 10 in the Box Plot.

  • Most of them have close density on the number shares
  • Brian Koerber has more articles and has higher number of shares than the rest.
  • Emily Blake is ranked No. 2 in number of articles, but the number of shares seem lower than others.

Figure: Top Authors by Number of Published Articles

authors_box_plot

Distribution of Social Shares:

Even though the time of period of collected articles were only two weeks, the distribution of number of shares can still give a sense of article popularity.

  • Range for number of shares: 0 - 50000
  • Majority number of shares: 700 - 1000

Figure: Histogram and Density of Number of Shares

shares_distribution

Correlation Tile Map:

It is always important to look at the correlation among the variables. From the correlation tile map, there were no highly correlated numeric variables.

correlation_matrix

Data Transform:

Sentiment Analysis:

Some of the text-based variables, such as "content", "title", are free-text variables and can not be categorized.  In order to use them for my prediction model,  TextBlob library is used to apply sentiment analysis on these variables. TextBlog is a Python library for processing textual data. It provides common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and etc.

Through sentiment analysis, three new variables would be added to represent:

  • how positivie or negative the text is
  • how subjective the text is
  • how many meaningful words in the text

Thus the output of transformation:

  • "content" column is transformed to : n_tockens_content, content_sentiment_poparity, content_subjectivity
  • "title" column is transformed to: n_tokens_title, title_sentiment_polarity, title_subjectivity
                
                contentBlob = TextBlob(article.content)
                                
                # Number of words in the content
                article.n_tokens_content = len(contentBlob.words)
                
                # article sentiment
                article.content_sentiment_polarity = contentBlob.sentiment.polarity
                article.content_subjectivity = contentBlob.sentiment.subjectivity 

                titleBlob = TextBlob(article.title)

                # Number of words in the title
                article.n_tokens_title = len(titleBlob.words)

                # title sentiment
                article.title_sentiment_polarity = titleBlob.sentiment.polarity
                article.title_subjectivity = titleBlob.sentiment.subjectivity

Categorical Data Transform:

Since Python only takes numeric as input when running Machine Learning algorithms, the variable "channel" was converted to binary numbers by using the get_dummies() function in Pandas library.

def dummify(df, cate_variables):
    '''
    @Summary: convert the categorical variables to numeric variables by using dummies (binary).
    Old categorical variables will be dropped.
    @return: A copy of the old dataframe with new converted numeric variables. 
    '''
    # make a copy before creating dummies
    df_new = df.copy()
    
    # convert text-based columns to dummies (except v22)
    for var_name in cate_variables:
        dummies = pd.get_dummies(df[var_name], prefix=var_name)
        
        # Drop the current variable, concat/append the dummy dataframe to the dataframe.
        df_new = pd.concat([df_new.drop(var_name, 1), dummies.iloc[:,1:]], axis = 1)
    
    return df_new

Prediction Model:

Since predicting with Machine Learning algorithm was not the primary goal of the web scraping project, only Random Forest algorithm was used for practicing purpose.

I chose Random Forests algorithm for practicing for the following reasons:

  • One of the best among classification algorithms - able to classify large amounts of data with accuracy.
  • Gives estimates of what variables are important in the classification
  • Easy to learn and use

Gridsearch was used to fine-tune the parameters for find the best Random Forest model. Three parameters were being fine tuned. Gridsearch might take long time to finish if many parameters are being tuned and the range of each paraemters  are wide. Because this was a relatively small dataset, I supplied relatively wide range to max_depth and n_estimators parameters.

The values for the three parameters that I fine-tuned are:

  • criterion: ['gini', 'entropy']
  • max_depth: integer values from 2 - 19: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
  • n_estimators: integer values from 10 - 90 (incremented by 10): 10, 20, 30, 40, 50, 60, 70, 80, 90
import sklearn.grid_search as gs

grid_para_forest = {'criterion': ['gini', 'entropy'], 'max_depth': range(2, 20), "n_estimators": range(10, 100, 10)}
grid_search_forest = gs.GridSearchCV(randomForest, grid_para_forest, cv=5, scoring='accuracy').fit(x_train, y_train)

print 'The best score is %.4f' %grid_search_forest.best_score_
print 'The best parameters are %s' %grid_search_forest.best_params_
print 'The training error is %.4f' %(1 - grid_search_forest.score(x_train, y_train))
print 'The testing error is %.4f' %(1 - grid_search_forest.score(x_test, y_test))

The output of the gridsearch gave the best model. It gives the testing error 0.3150 (accuracy0.6850).

The best score is 0.7113
The best parameters are {'n_estimators': 60, 'criterion': 'entropy', 'max_depth': 16}
The training error is 0.0154
The testing error is 0.3150

The best Random Forest model from my Gridsearch gave me the important features. See Figure: Feature Importance.

Figure: Feature Importance

feature_importance

Conclusion:

Collecting data through web scraping can be challenging if a web page is not well coded or formatted. I ran into this kind of challenging when the news articles on Mashable.com have different format for different types of articles (for example, video-typed article vs text-type article).

Also online news popularity is hard to predict because it is hard to know what elements have impact on the user's sharing behavior. Furthermore, I believe that the content of the article should have impact on the user's likeness. However, I am not able to do deeper analysis on the content itself even though simple sentiment analysis was applied. Random Forest algorithm only gave me 0.6850 accuracy.

For future works, I would like to try other algorithms and compare them to the Random Forest algorithm. I would also like to see if there is a better way to analysis the article content. I have found that there was actually a past Kaggle project regarding the mashable.com online news popularity. They were given a datasest with all the articles from year 2015. I would love to dig into this to see what the other Kagglers' have done.

About Author

Bin Lin

Bin is a former professional in software development with now positioned for Data Scientist role. With both strong programming skill and newly acquired machine learning skills, Bin is able to apply scientific thinking to predict or uncover insights...
View all posts by Bin Lin >

Related Articles

Leave a Comment

O Globo esporte October 2, 2017
Do you love it regular posting when you must keep regular content in your blog in order to commercialize it. However, save on your closest friends and family, nobody usually mind your blog post. They will purchase from you some money to examine a web site or a product for the owner.
montre rolex Datejust occasion boite July 15, 2017
I would love to make an album of grandkid pictures for my grandmother. thanks! montre rolex Datejust occasion boite http://www.finewristwatchshop.com/fr/rolex-datejust-36mm-jubilee-steel-watch-domed-bezel-blue-dial-index-hour-markers-116200blsj-p779/
ronak goyal November 3, 2016
Hi Bin Lin, I have working on the similar type of project. Can you help me in my proect. Regards Ronak
agence web maroc May 6, 2016
What's up Dear, are you in fact visiting this website regularly, if so then you will definitely get nice knowledge.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI