Web Scraping and EDA - Uncovering Predictors of Netflix Paid Membership from Netflix Originals
Introduction and Motivation
Netflix is a fantastic company. It has maintained a long-run growth over the last number of years, and its shares have steadily risen over ten years, with over 40 times growth since 2012. A $1,000 investment made on Jan 2007, would be worth more than $110,000 as of April 2019.
As an industry leader of video streaming services, Netflix has invested a whopping $13 billion on streaming content in 2018, comprising around 85% of the total spending, since the sustainable capability of producing high-quality original content is one of the core competitiveness to keep them win among the market players and original content has brought them significant return, which gives me the incentive to collect and explore the relevant data from Netflix Original.
By looking at the Netflix Quarter report of 1st Q in 2019 from their official website, we can see that the revenue from paid memberships comprises 98.22% of total revenue in the 1st Q of 2019, so as the past quarters and years because of its profit model with a relatively single source. Therefore, being able to forecast paid memberships would be highly valuable since it can be a reliable indicator of Netflix's revenue and profit, which can directly influence the stock price that widely concerned by various investors.
Data Resources
The data are collected using Scrapy (Python) from IMBD - Shows sorted by Netflix as distributors, Wikipedia - List of Netflix Original programming and film, and Netflix Media Center - Upcoming shows, with variables like Title, Genre, Premiere of each season, Length, Language, Distribution, Number of reviews, Count of rating, Average rating, etc.
After cleaning and merging, I got 565 rows and did a series of analyses, including the relationship between the independent variables and the dependent variable.
Key Findings
There is a perfect and stable growing pattern of quarter paid membership that matches with the quarter revenue, indicating that paid membership does act as a primary driver of revenue, which proves my previous assumption and motivation.
Taking a look at the pairwise correlation between independent and dependent variables, I found a positive and nearly linear correlation between the number of released shows and paid membership in quarters, which can be a strong predictor.
Since Netflix release upcoming shows in the following moths and quarters at the media center, we can utilize this predictor together with other relevant predictors to build models to predict revenue or profit of the company, which is a real practice in the hedge fund that a friend of mine who is working in the hedge fund told me.
The other explorations:
Urban fantasy, political, thriller, and Science fiction/thriller are the most popular genres of Netflix originals. If the upcoming released shows are in these popular genres, we can consider to give them higher weight in the prediction.
English dominates and is followed by Spanish and Hindi. Spanish and Hindi may be the submarkets that they are investing in based on languages.
March, April, and May are the most productive months.
August, October, and June are the months with the most number of reviews. Students can watch Netflix more often during summer vacation like August and June. In future work, I may Find popular and productive months for different genres. For instance, animation and cartoons can be more prevalent in the vacations of schools, and the reviews can be more from their parents.
Future works
I would find more meaningful variables to test their correlation with the paid membership and build up a prediction model based on them, improve the web scraping code to catch more completed datasets and do more data analysis.