Web Scraping - Uncovering Predictors of Netflix Paid Membership from Netflix Originals
Introduction and Motivation
Netflix is a fantastic company. Its shares have steadily risen over ten years, with over 40 times growth since 2012. A $1,000 investment made on Jan 2007 would have been worth more than $110,000 in April 2019.
As an industry leader of video streaming services, Netflix has invested a whopping $13 billion on streaming content in 2018, comprising around 85% of the total spending. Keeping up high-quality original content is one of the core capabilities that keeps Netflix ahead of its competition. Its investment in original content has paid off. Looking further into what contributes to Netflix’s phenomenal success was the motivation for this project, which collects and explores the relevant data from Netflix Original.
By looking at the Netflix Quarter report of 1st Q in 2019 from their official website, we can see that the revenue from paid memberships comprises 98.22% of total revenue, which is pretty much the same for all quarters. That indicates that Netlix’s profit model centers around a single source. Therefore, being able to forecast paid memberships would be highly valuable since it can be a reliable indicator of Netflix's revenue and profit, which can directly influence the stock price.
The data are collected using Scrapy (Python) from IMBD - Shows sorted by Netflix as distributors, Wikipedia - List of Netflix Original programming and film, and Netflix Media Center - Upcoming shows, with variables like Title, Genre, Premiere of each season, Length, Language, Distribution, Number of reviews, Count of rating, Average rating, etc.
After cleaning and merging, I got 565 rows and did a series of analyses, including the relationship between independent and dependent variables.
There is a perfect and stable growing pattern of quarter paid membership that matches with the quarter revenue, indicating that paid membership does act as a primary driver of revenue, which proves my hypothesis..
Taking a look at the pairwise correlation between independent and dependent variables, I found a positive and nearly linear correlation between the number of released shows and paid membership in quarters, which can be a strong predictor.
Since Netflix will release upcoming shows in the following months and quarters at the media center, we can utilize this predictor together with other relevant predictors to build models that can forecast the company’s profit., This is the approach employed by hedge funds to make predictions, as attested to by a friend of mine who works at one.
The other explorations:
Urban fantasy, political, thriller, and science fiction/thriller are the most popular genres of Netflix originals. If the upcoming released shows are include these popular genres, we can consider assigning them greater weight in the prediction.
English dominates and is followed by Spanish and Hindi. Spanish and Hindi may be the submarkets that they are investing in based on languages.
March, April, and May are the most productive months.
In future work, I may find popular and productive months for different genres. For instance, animation and cartoons can be more prevalent in the vacations of schools, and the reviews can be more from their parents.
I would find more meaningful variables to test their correlation with the paid membership and build up a prediction model based on them, improve the web scraping code to catch more completed datasets and do more data analysis.