Headlines in the News: a Data Visualization Project

Posted on Oct 23, 2016


For my first project we were tasked with performing EDA (Exploratory Data Analysis). The purpose of this as in the beginning any type of work with data is to see if there are any insights that might prompt further questions about visual inspection.

Our eyes have a greater ability to pick up on patterns much more quickly with a simple glance than a wall of text on a multi page report.

The dataset I chose was the "News Aggregator Data Set." The rationale for choosing this dataset was the fact that there were 8 features for each row and a total of >400k observations. My concerns in this initial project were first, not having a large enough data set to challenge myself, and second, not having enough data with which to work.

Let's talk about the dataset.

Being my first project, there were a few very important lessons learned in the process of going through this project. One of the most important being, understanding the dataset.

There were 7 main parameters which were of interest to me in this data set: the headline, the cluster id, the time stamp, the publisher, category, host name, url.

The cluster id is the specific id for an occurrence in history. The occurrence of a war in another country published in two different papers might have two different headlines, but in this data set, those two headlines would share the same cluster id. My first thought on dealing with a large dataset would be to separate the observations in to more manageable chunks.

I chose to do so by creating a bar chart based on the 4 categories provided by the observations: Business, Entertainment, Health and Technology.



At first glance...

you will notice that the Entertainment category accounts for more than the other three categories combined. I thought the distinction would be better clarified when placed in the context of the time stamps in my data set. My hope was that with the time stamp I would be able to see how long it took for a news occurrence to propagate throughout the news networks, essentially I would be able to see how long it took for a news to go viral. Based on the cluster id, this was my initial plan.


This is what ended up happening...

Remember how I said that I learned some important lessons? The biggest lesson I learned was making sure you understand your data set. In my case I should have taken a smaller sampling of my data to see if there was in fact a time differential between the time stamps. From what I was able to discern from the Readme file, the aggregation was taken from a single session and not over a period of time, hence a difference of 0.


Word Cloud

With time running short on my delivery I decided to settle on a word cloud. My data set is particularly suited for a word cloud because often times headlines are based on key attention grabbing words. By plotting which words were most used it is possible to demonstrate what the publishers most hoped to use to gain readers' attention.










Final words, Future work

I would submit to you that my first graph, though only a bar graph, is the most interesting. I did not know how much of published content was based on entertainment. However because of my unfamiliarity with this data set I was not able to give as complete of an analysis as I had hoped.

My future plans for this project include a Shiny app that would allow me to select for publishers and show their relative bar graph distributions among the four categories allowing readers to focus on publishers whose distribution fits their interests best.

Personal Thoughts

In this election season there has been a lot of talk about how the media has greatly influenced the current Presidential election situation. We are currently in an era of the most political division in America and I have heard that may be in part due to the echo chamber we encounter on the internet as we are only recommended things and ideas that cater to our current tastes. As I delved into this project I could not help but wonder about the implications of the media in their traditional or modern forms and how their interests as a business may shape how we perceive events.

About Author

Frederick Cheung

Hi my name is Fred. Although my educational background is an M.S. in Medical Science, my professional experience is with Small Business management, operations and sustainable business practices. I’ve recently completed a Data Science program working with languages...
View all posts by Frederick Cheung >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI