Television Trends as a Social Indicator

Emil Parikh
Posted on Feb 19, 2017

Contributed by Emil Parikh. He is currently in the NYC Data Science Academy 12-week, full-time Data Science Bootcamp program taking place between January 9th to March 31st, 2017. This post is based on his second class project - Web Scraping.

Links:   GitHub   |   App

 

Introduction

There are various indicators in disciplines such as economics and politics that measure the state of different aspects of their fields. That is why—when events around the country in the past few years have caused people to question the state of the US and how surprised they are about "who this country is"—I am surprised there is no indicator that can tell us who we are and where we are going socially as a country; there are a collection of indicators that describe the social environment in terms of such things as poverty, obesity and suicide rates, but these largely describe outcomes and consequences rather than preferences and personality.

Spoiler Alert! A full solution to such a complicated task is beyond the scope of this project; a full solution would require multiple scraping projects and continued feedback from professionals in social psychology. I will address this again in the next steps section. Instead, I used this time to take a first step in building a social indicator by scraping and visualizing information about television shows.

 

Data Collection

I used scrapy and IMDbPY to gather television data from Wikipedia and IMDb respectively. There was some information I could only get from Wikipedia and some only from IMDb.

venn

While show titles could be found in both, I needed to scrape them off of Wikipedia in order to

  • retrieve the Wikipedia URLs for the shows in order to get the network information and
  • specify in IMDbPY which shows I wanted information for

 

Screenshots of two Wikipedia pages I scraped TV show titles and URLs from:

 showlist1showlist2

 

Screenshots of a Wikipedia show page from which I retrieved information:

 showinfoshowinfo2

 

For fields common to both Wikipedia and IMDb such as genre and start/end date, I still retrieved their information from Wikipedia; Once the scraping was finished, I filled in any missing data by collecting the same information from IMDb along with IMDb rating and number of votes.

 

A sample of my Wikipedia TV show scraper

https://gist.github.com/eparikh/a188fa71f279bc04b14175e671512079

 

Using IMDbPY to get information about TV shows using show titles gathered from Wikipedia as the search term:

https://gist.github.com/eparikh/36e08f304cba67428c56345ea28bd93d

 

Visualization and Analysis

 In the app, I have visualizations on

  • count of new shows created
  • median IMDb rating of new shows
  • median number of years shows ran for
  • total number of votes on IMDb

This information is displayed for each year from the 1940s until 2016 by genre and by network.

Screenshots of some of the visualizations:

Count of new shows by genre from 1940s to 2016:

comedy dramareality2

Count of new shows by network from 1940s to 2016:

abc cbs nbc

 

What we can get out of the genre plots is that the networks and show creators believe that audiences want more comedies and reality shows (shows that tend to require less thinking). Dramas have not spiked up as much. While the shows created in these genres have been on a consistent rise, the number of shows created by the major networks has been on a decline since the mid-1980s. I will need to look into this further.

Next Steps

TV show data alone is not enough to answer "who are we as a society?", especially without viewership data. Some future steps I would take to build upon this project are:

  • Scrape more lists of TV shows; it seems that the lists of TV shows I scraped may not have been thorough for 2015 and 2016.
  • Obtain numbers on the audience side, such as viewership of shows/genres/networks in order to get a better sense of audience preferences rather than just the creators' and networks' predictions of audience preferences
  • Compare various data (like genre, viewership) of traditional networks with streaming services such as Netflix along with viewership, as this may give a sense of demographic contributors
  • Include movies, music, books, magazines, news, etc to the analysis since one alone will not capture society
  • Expand beyond entertainment. Include the trend in degrees and jobs.

About Author

Emil Parikh

Emil Parikh

Data Scientist with professional experience in web scraping, predictive modeling, data visualization, and big data with intensive software development experience. Strength in interpreting and converting business needs into solutions. Quick learner and thorough planner with a passion for...
View all posts by Emil Parikh >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp