Netflix: Scraping & Uncovering Predictors of Netflix Members

Posted on Feb 1, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Project Code | Linkedin | Github | Presentation | Slides | Email: [email protected]

Introduction and Motivation

Netflix is a fantastic company. Its shares have steadily risen over ten years, with over 40 times growth since 2012. A $1,000 investment made on Jan 2007 would have been worth more than $110,000 in April 2019.

As an industry leader of video streaming services, Netflix has invested a whopping $13 billion on streaming content in 2018, comprising around 85% of the total spending. Keeping up high-quality original content is one of the core capabilities that  keeps Netflix ahead of its competition. Its investment in original content has paid off. Looking further into what contributes to Netflix’s phenomenal success was the motivation for this project, which collects and explores the relevant data from Netflix Original.

By looking at the Netflix Quarter report of 1st Q in 2019 from their official website, we can see that the revenue from paid memberships comprises 98.22% of total revenue, which is pretty much the same for all quarters. That indicates that Netlix’s profit model centers around a single source. Therefore, being able to forecast paid memberships would be highly valuable since it can be a reliable indicator of Netflix's revenue and profit, which can directly influence the stock price.

NetflixData Resources

The data are collected using Scrapy (Python) from IMBD - Shows sorted by Netflix as distributors, Wikipedia - List of Netflix Original programming and film, and Netflix Media Center - Upcoming shows, with variables like Title, Genre, Premiere of each season, Length, Language, Distribution, Number of reviews, Count of rating, Average rating, etc.

After cleaning and merging, I got 565 rows and did a series of analyses, including the relationship between independent and dependent variables.

Netflix Key Findings

There is a perfect and stable growing pattern of quarter paid membership that matches with the quarter revenue, indicating that paid membership does act as a primary driver of revenue, which proves my hypothesis..

Taking a look at the pairwise correlation between independent and dependent variables,  I found a positive and nearly linear correlation between the number of released shows and paid membership in quarters, which can be a strong predictor.


Since Netflix will release upcoming shows in the following months and quarters at the media center, we can utilize this predictor together with other relevant predictors to build models that can forecast the company’s  profit., This is the approach employed by hedge funds to make predictions, as attested to by a friend of mine who works at one. 

The other explorations Netflix:

Urban fantasy, political, thriller, and science fiction/thriller are the most popular genres of Netflix originals. If the upcoming released shows are include these popular genres, we can consider assigning  them greater weight in the prediction.


English dominates and is followed by Spanish and Hindi. Spanish and Hindi may be the submarkets that they are investing in based on languages.

March, April, and May are the most productive months.


In future work, I may find popular and productive months for different genres. For instance, animation and cartoons can be more prevalent in the vacations of schools, and the reviews can be more from their parents.


Future works

I would find more meaningful variables to test their correlation with the paid membership and build up a prediction model based on them, improve the web scraping code to catch more completed datasets and do more data analysis.


About Author

Fred (Lefan) Cheng - 程乐帆

Fred Cheng is a certified data scientist who is working as a data science consultant in Zenon. He owns a Master’s Degree in Management and Systems from New York University with a bachelor’s in business management from The...
View all posts by Fred (Lefan) Cheng - 程乐帆 >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI