Headlines in the News: a Data Visualization Project

Frederick Cheung
Posted on Oct 23, 2016

Hello!

For my first project we were tasked with performing EDA (Exploratory Data Analysis). The purpose of this as in the beginning any type of work with data is to see if there are any insights that might prompt further questions about visual inspection.

Our eyes have a greater ability to pick up on patterns much more quickly with a simple glance than a wall of text on a multi page report.

The dataset I chose was the "News Aggregator Data Set." The rationale for choosing this dataset was the fact that there were 8 features for each row and a total of >400k observations. My concerns in this initial project were first, not having a large enough data set to challenge myself, and second, not having enough data with which to work.

Let's talk about the dataset.

Being my first project, there were a few very important lessons learned in the process of going through this project. One of the most important being, understanding the dataset.

There were 7 main parameters which were of interest to me in this data set: the headline, the cluster id, the time stamp, the publisher, category, host name, url.

The cluster id is the specific id for an occurrence in history. The occurrence of a war in another country published in two different papers might have two different headlines, but in this data set, those two headlines would share the same cluster id. My first thought on dealing with a large dataset would be to separate the observations in to more manageable chunks.

I chose to do so by creating a bar chart based on the 4 categories provided by the observations: Business, Entertainment, Health and Technology.

bargraph

 

At first glance...

you will notice that the Entertainment category accounts for more than the other three categories combined. I thought the distinction would be better clarified when placed in the context of the time stamps in my data set. My hope was that with the time stamp I would be able to see how long it took for a news occurrence to propagate throughout the news networks, essentially I would be able to see how long it took for a news to go viral. Based on the cluster id, this was my initial plan.

https://gist.github.com/ggionx/b7720ecc741219463e2a9ee15a741a37

Problems

This is what ended up happening...

https://gist.github.com/ggionx/c961a2608840a7bd739eea36157bd5d7

Remember how I said that I learned some important lessons? The biggest lesson I learned was making sure you understand your data set. In my case I should have taken a smaller sampling of my data to see if there was in fact a time differential between the time stamps. From what I was able to discern from the Readme file, the aggregation was taken from a single session and not over a period of time, hence a difference of 0.

 

Word Cloud

With time running short on my delivery I decided to settle on a word cloud. My data set is particularly suited for a word cloud because often times headlines are based on key attention grabbing words. By plotting which words were most used it is possible to demonstrate what the publishers most hoped to use to gain readers' attention.

Business

technologywordcloud2

Entertainment

entertainmentwordcloud2

Health

healthwordcloud2

Technology

businesswordcloud2

 

Final words, Future work

I would submit to you that my first graph, though only a bar graph, is the most interesting. I did not know how much of published content was based on entertainment. However because of my unfamiliarity with this data set I was not able to give as complete of an analysis as I had hoped.

My future plans for this project include a Shiny app that would allow me to select for publishers and show their relative bar graph distributions among the four categories allowing readers to focus on publishers whose distribution fits their interests best.

Personal Thoughts

In this election season there has been a lot of talk about how the media has greatly influenced the current Presidential election situation. We are currently in an era of the most political division in America and I have heard that may be in part due to the echo chamber we encounter on the internet as we are only recommended things and ideas that cater to our current tastes. As I delved into this project I could not help but wonder about the implications of the media in their traditional or modern forms and how their interests as a business may shape how we perceive events.

About Author

Frederick Cheung

Frederick Cheung

Hi my name is Fred. Although my educational background is an M.S. in Medical Science, my professional experience is with Small Business management, operations and sustainable business practices. I’ve recently completed a Data Science program working with languages...
View all posts by Frederick Cheung >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp