Who's Behind IMDb's Ratings?

Stephen Shafer
Posted on Jun 11, 2018

Introduction

IMDb (Internet Movie Database) is one of the most frequented movie rating sites, and for many people, these rating dictate whether or not a movie is even worth watching. The rating system isn't based on hand selected critics, it revolves around its users (which is open to anyone). These anonymous users hold great power in what movies are deemed must watch, or don't even bother. So the question is, WHO are the individuals behind these reviews? With this question in mind, I decided to scrape IMDb's demographic rating breakdown to look a little more closely into who's behind the scene. The answer is below.

For more information on what I used for scraping and analysis, please visit my GitHub Repo.

 

Data Scraping

My scraping method of choice for this project was Scrapy. Below is the criteria I used to populate the list of movies I worked with as well as a sample of the Rating By Demographic which can be found with each movie.

 

Analysis

The first (and most obvious) thing I looked into was the breakdown of votes based on gender. It is extremely clear that the majority of IMDb's users are male. Which was an extremely surprising discovery for me.

 

Next I decided to look into how each demographic votes. I was able to break gender down even further by including the age range of the user as well. As you can see, the younger the user, the more likely they were to give a movie a higher rating. The users who are more likely to give a lower rating are 45+ year old men. Please keep in mind that there were far fewer ratings by users under 18, so this could be a major factor in why they have an average higher rating. With more votes, the mean would likely sink a little.

To further demonstrate how user gender/age differs, below is a heatmap analyzing how each user votes based on genre.

 

I also looked into how users have been voting based on the year. As we can see below, the divide between male and female users has not decreased even in recent years. We can also see that the number of ratings has diminished in the past few years. This may be due to the fact that it takes a few years for everyone to see each movie from one year, or perhaps IMDb eliminated bots that may have been responsible for inflating movie ratings.

 

Lastly, below is a graph analyzing the difference in rating based on who the Lead Actor is. These were randomly selected.

As we can see, on average, female users tend to rate higher than their male counterparts. However, there are certain lead actors (and movies) where there is an exception.

 

With the information that I gathered (much of which I didn't post here), I would like to look further into the actual reviews left by each user and use NLP to analyze the difference. That's just one path. Another one that interests me is analyzing what movies perform best based on lead actor, genre, and director. It would be extremely valuable to predict a movie's potential ROI based on what I was able to scrape.

 

Conclusion 

IMDb is clearly dominated with male users, and it would be ideal if they're able to somehow interest more female users to make the site less male biased. All in all I greatly enjoyed working on this project as I'm a huge movie buff, and this was a nice peek inside one of the review sites I frequent most often. Thanks for reading!

About Author

Stephen Shafer

Stephen Shafer

BS in Accounting with a concentration in Management Information Systems (MIS) at Binghamton Universtiy. Previous FinTech sales experience has allowed me to more clearly understand where true value lies in data, and how it can be directly translated...
View all posts by Stephen Shafer >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data Book Launch Book-Signing bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job Jon Krohn JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp