Who's Behind IMDb's Ratings?

Posted on Jun 11, 2018

Introduction

IMDb (Internet Movie Database) is one of the most frequented movie rating sites, and for many people, these rating dictate whether or not a movie is even worth watching. The rating system isn't based on hand selected critics, it revolves around its users (which is open to anyone). These anonymous users hold great power in what movies are deemed must watch, or don't even bother. So the question is, WHO are the individuals behind these reviews? With this question in mind, I decided to scrape IMDb's demographic rating breakdown to look a little more closely into who's behind the scene. The answer is below.

For more information on what I used for scraping and analysis, please visit my GitHub Repo.

 

Data Scraping

My scraping method of choice for this project was Scrapy. Below is the criteria I used to populate the list of movies I worked with as well as a sample of the Rating By Demographic which can be found with each movie.

 

Analysis

The first (and most obvious) thing I looked into was the breakdown of votes based on gender. It is extremely clear that the majority of IMDb's users are male. Which was an extremely surprising discovery for me.

 

Next I decided to look into how each demographic votes. I was able to break gender down even further by including the age range of the user as well. As you can see, the younger the user, the more likely they were to give a movie a higher rating. The users who are more likely to give a lower rating are 45+ year old men. Please keep in mind that there were far fewer ratings by users under 18, so this could be a major factor in why they have an average higher rating. With more votes, the mean would likely sink a little.

To further demonstrate how user gender/age differs, below is a heatmap analyzing how each user votes based on genre.

 

I also looked into how users have been voting based on the year. As we can see below, the divide between male and female users has not decreased even in recent years. We can also see that the number of ratings has diminished in the past few years. This may be due to the fact that it takes a few years for everyone to see each movie from one year, or perhaps IMDb eliminated bots that may have been responsible for inflating movie ratings.

 

Lastly, below is a graph analyzing the difference in rating based on who the Lead Actor is. These were randomly selected.

As we can see, on average, female users tend to rate higher than their male counterparts. However, there are certain lead actors (and movies) where there is an exception.

 

With the information that I gathered (much of which I didn't post here), I would like to look further into the actual reviews left by each user and use NLP to analyze the difference. That's just one path. Another one that interests me is analyzing what movies perform best based on lead actor, genre, and director. It would be extremely valuable to predict a movie's potential ROI based on what I was able to scrape.

 

Conclusion 

IMDb is clearly dominated with male users, and it would be ideal if they're able to somehow interest more female users to make the site less male biased. All in all I greatly enjoyed working on this project as I'm a huge movie buff, and this was a nice peek inside one of the review sites I frequent most often. Thanks for reading!

About Author

Stephen Shafer

BS in Accounting with a concentration in Management Information Systems (MIS) at Binghamton Universtiy. Previous FinTech sales experience has allowed me to more clearly understand where true value lies in data, and how it can be directly translated...
View all posts by Stephen Shafer >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI