Web Scraping - Uncovering Predictors of Netflix Paid Membership from Netflix Originals

Posted on Feb 1, 2020

Project Code | Linkedin | Github | Presentation | Slides | Email: [email protected]

Introduction and Motivation

Netflix is a fantastic company. Its shares have steadily risen over ten years, with over 40 times growth since 2012. A $1,000 investment made on Jan 2007 would have been worth more than $110,000 in April 2019.

As an industry leader of video streaming services, Netflix has invested a whopping $13 billion on streaming content in 2018, comprising around 85% of the total spending. Keeping up high-quality original content is one of the core capabilities that  keeps Netflix ahead of its competition. Its investment in original content has paid off. Looking further into what contributes to Netflix’s phenomenal success was the motivation for this project, which collects and explores the relevant data from Netflix Original.

By looking at the Netflix Quarter report of 1st Q in 2019 from their official website, we can see that the revenue from paid memberships comprises 98.22% of total revenue, which is pretty much the same for all quarters. That indicates that Netlix’s profit model centers around a single source. Therefore, being able to forecast paid memberships would be highly valuable since it can be a reliable indicator of Netflix's revenue and profit, which can directly influence the stock price.

Data Resources

The data are collected using Scrapy (Python) from IMBD - Shows sorted by Netflix as distributors, Wikipedia - List of Netflix Original programming and film, and Netflix Media Center - Upcoming shows, with variables like Title, Genre, Premiere of each season, Length, Language, Distribution, Number of reviews, Count of rating, Average rating, etc.

After cleaning and merging, I got 565 rows and did a series of analyses, including the relationship between independent and dependent variables.

Key Findings

There is a perfect and stable growing pattern of quarter paid membership that matches with the quarter revenue, indicating that paid membership does act as a primary driver of revenue, which proves my hypothesis..

Taking a look at the pairwise correlation between independent and dependent variables,  I found a positive and nearly linear correlation between the number of released shows and paid membership in quarters, which can be a strong predictor.

Since Netflix will release upcoming shows in the following months and quarters at the media center, we can utilize this predictor together with other relevant predictors to build models that can forecast the company’s  profit., This is the approach employed by hedge funds to make predictions, as attested to by a friend of mine who works at one. 

The other explorations:

Urban fantasy, political, thriller, and science fiction/thriller are the most popular genres of Netflix originals. If the upcoming released shows are include these popular genres, we can consider assigning  them greater weight in the prediction.

English dominates and is followed by Spanish and Hindi. Spanish and Hindi may be the submarkets that they are investing in based on languages.

March, April, and May are the most productive months.

In future work, I may find popular and productive months for different genres. For instance, animation and cartoons can be more prevalent in the vacations of schools, and the reviews can be more from their parents.

Future works

I would find more meaningful variables to test their correlation with the paid membership and build up a prediction model based on them, improve the web scraping code to catch more completed datasets and do more data analysis.


About Author

Fred (Lefan) Cheng

Fred Cheng is a certified data scientist who is working as a data science consultant in Zenon. He owns a Master’s Degree in Management and Systems from New York University with a bachelor’s in business management from The...
View all posts by Fred (Lefan) Cheng >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup music Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp