Headlines in the News: a Data Visualization Project
Hello!
For my first project we were tasked with performing EDA (Exploratory Data Analysis). The purpose of this as in the beginning any type of work with data is to see if there are any insights that might prompt further questions about visual inspection.
Our eyes have a greater ability to pick up on patterns much more quickly with a simple glance than a wall of text on a multi page report.
The dataset I chose was the "News Aggregator Data Set." The rationale for choosing this dataset was the fact that there were 8 features for each row and a total of >400k observations. My concerns in this initial project were first, not having a large enough data set to challenge myself, and second, not having enough data with which to work.
Let's talk about the dataset.
Being my first project, there were a few very important lessons learned in the process of going through this project. One of the most important being, understanding the dataset.
There were 7 main parameters which were of interest to me in this data set: the headline, the cluster id, the time stamp, the publisher, category, host name, url.
The cluster id is the specific id for an occurrence in history. The occurrence of a war in another country published in two different papers might have two different headlines, but in this data set, those two headlines would share the same cluster id. My first thought on dealing with a large dataset would be to separate the observations in to more manageable chunks.
I chose to do so by creating a bar chart based on the 4 categories provided by the observations: Business, Entertainment, Health and Technology.
At first glance...
you will notice that the Entertainment category accounts for more than the other three categories combined. I thought the distinction would be better clarified when placed in the context of the time stamps in my data set. My hope was that with the time stamp I would be able to see how long it took for a news occurrence to propagate throughout the news networks, essentially I would be able to see how long it took for a news to go viral. Based on the cluster id, this was my initial plan.
Problems
This is what ended up happening...
Remember how I said that I learned some important lessons? The biggest lesson I learned was making sure you understand your data set. In my case I should have taken a smaller sampling of my data to see if there was in fact a time differential between the time stamps. From what I was able to discern from the Readme file, the aggregation was taken from a single session and not over a period of time, hence a difference of 0.
Word Cloud
With time running short on my delivery I decided to settle on a word cloud. My data set is particularly suited for a word cloud because often times headlines are based on key attention grabbing words. By plotting which words were most used it is possible to demonstrate what the publishers most hoped to use to gain readers' attention.
Business | Entertainment |
Health |
Technology |
Final words, Future work
I would submit to you that my first graph, though only a bar graph, is the most interesting. I did not know how much of published content was based on entertainment. However because of my unfamiliarity with this data set I was not able to give as complete of an analysis as I had hoped.
My future plans for this project include a Shiny app that would allow me to select for publishers and show their relative bar graph distributions among the four categories allowing readers to focus on publishers whose distribution fits their interests best.
Personal Thoughts
In this election season there has been a lot of talk about how the media has greatly influenced the current Presidential election situation. We are currently in an era of the most political division in America and I have heard that may be in part due to the echo chamber we encounter on the internet as we are only recommended things and ideas that cater to our current tastes. As I delved into this project I could not help but wonder about the implications of the media in their traditional or modern forms and how their interests as a business may shape how we perceive events.