Predict Movie Rating

Chuan Sun

Posted on Aug 22, 2016

(The dataset described in this post is currently publicly available in Kaggle.)

1 Question

How can we tell the greatness of a movie before it is released in cinema? This question puzzled me for a long time since there is no universal way to claim the goodness of movies. Many people rely on critics to gauge the quality of a film, while others use their instincts. But it takes the time to obtain a reasonable amount of critics review after a movie is released. And human instinct sometimes is unreliable.

Given that thousands of movies were produced each year, is there a better way for us to tell the greatness of movie without relying on critics or our own instincts?

2 Scraping 5000+ movies from IMDB

The tools I used for scraping all 5000+ movies is a Python library called "scrapy". Below are some brief steps. The source codes and documentations can be found in github page here.

Use scrapy in Python to obtain a list of 5043 movie titles of from "the-numbers" website. (code)
Save the titles into a JSON file
Search those titles from IMDB website to get the real IMDB movie links (code)
Send HTTP request to each movie page using the links, and scrapy the page and get all data (code)
Perform face detection for all posters to get face number (python code)
Parse the aggregated data, clean it, and reformat it to CSV file (python code).
The final CSV file can be found here.

Many important movie information were considered and scraped from IMDB website. For example, movie title, director name, cast list, genres, etc.

The scraping process took 2 hours to finish. The scraping of movie posters took a little longer than pure text data. In the end, I was able to obtain all needed variables for 5043 movies and 4096 posters. Overall, they span across 100 years in 66 countries. There are 2399 unique director names, and 30K actors/actresses.

The image below shows all the 28 variables that I scraped. Roughly speaking, half of the variables is directly related to movies themselves, such as title, year, duration, etc. Another half is related to the people who involved in the production of the movies, eg, director names, director facebook popularity, movie rating from critics, etc.

3 Face detection from movie posters

I am especially interested in knowing the answer to this question: Will the number of human faces in movie poster correlate with the movie rating?

Movie poster is an important way make public aware of the movie before its release. It is quite common to see faces in movie posters. It should be pointed out that, most movies have more than one posters. Some may argue it is unreliable to detect faces only from one poster. Well, it is indeed true. However, just like a great book usually having a single cover, I believe a great movie needs to have a "main" poster, the one that the director likes most, or long-remembered by viewers. I have no way to tell which posters are the "main" posters. I assume the poster that I webscraped from IMDB main page of a movie is the "main" poster.

Below are the movie posters from 8 great movies (IMDB rating scores are above 7.5). They all have only one human face.

Below are the the movie posters from 8 movies that are not so great in terms of IMDB rating score (below 5). They tend to have many faces.

It should be pointed out that, it is unfair to rate movie solely based on the number of human faces in poster, because there are great movies whose posters have many faces. For example, the poster of the movie "(500) Days of summer" has 43 faces, all from the same actress.

But remember that having large face number (> 10) in poster and simultaneously being a great movie is uncommon based on my findings.

Interestingly, many posters made my face recognition algorithm fail to work, such as:

Overall, nearly 95% of all the 4096 posters have less than 5 faces. Besides,

Great movies tend to have fewer faces in posters
If a poster has one or no human faces, we cannot tell if the movie is great simply from poster
If a poster has more than 5 faces, the likelihood of the movie being great is low

4 IMDB rating score

Out of the 28 variables, I am especially interested in know how does the IMDB rating score correlate with other variables. From the 3D gross-country-rating plot below, we can see that United States produced the largest amount of movies across the past 100 years (1905-2015). The sheer amount dwarfs other countries in the number of produced movies. The points at the top corner of the plot denote the movies having the highest gross in the movie history. Many countries produced great movies, but still there were quite a few bad movies.

Movies having rating larger than 8.0 are listed in the IMDB top 250, and they are truly great movies from many perspective. Movies with rating from 7.0 to 8.0 are probably still good movies. Viewers can gain something from them. Movies with rating from 1 to 5 are sometimes considered as ones that "sucks", in one way or the other. One should avoid those movies unless they have to. Life is short.

4.1 IMDB score VS country

USA and UK are the two countries that produced the most number of movies in the past century, including a large amount of bad movies. The median IMDB scores for both USA and UK are, however, not the highest among all countries. Some developing countries, such as Libya, Iran, Brazil, and Afghanistan, produced a small amount of movies with high median IMDB scores.

4.2 IMDB score VS movie year

In the last century, it seems that the number of movies produced annually largely increased since 1960. This is understandable since the development of filming industry goes hand in hand with the development of science and technology. But we should be aware that along with the boom of movie industry since 2000, there are many movies with low IMDB score.

4.3 IMDB score VS movie facebook popularity

The social network is a good way to estimate the popularity of certain phenomena. Therefore it is interesting to know how does the IMDB score correlate with the movie popularity in the social network. From the scatter plot below, we can find that overall, the movies that have very high facebook likes tend to be the ones that have IMDB scores around 8.0. As we know, IMDB scores of higher than 8.0 are considered as the greatest movies in the IMDB top 250 list. It is interesting to see that those greatest movies do not have the highest facebook popularity.

I highlighted several movies to illustrate this finding. The movie "Mad Max" and "Batman vs Superman" both have very high facebook likes, but their IMDB scores are slightly above 8.0. The movie "The Godfather" is deemed as one of the greatest movies, but its facebook popularity is hugely dwarfed by that of the "Interstellar".

4.4 IMDB score VS director facebook popularity

It is plausible to believe that the greatness of a movie is highly affected by its director. How does the movie IMDB scores compare with the director facebook popularity? From the plot below, it can be seen that the directors who directed movies of rating higher than 6.0 tend to have more facebook popularity than the ones who directed movies of rating lower than 6.0. And I listed the top four directors who have the most number of facebook popularity (Christopher Nolan, David Fincher, Martin Scorsese, and Quentin Tarantino), along with their four representative movies.

4.5 IMDB score VS top 3 actors/actresses facebook popularity

Great actors/actresses make a movie great. They are the souls of movies. How does their facebook popularity look like?

For a given movie, I scraped all the available cast members in the IMDB movie page. After retrieved the number of facebook likes for all cast members, I ranked the numbers in descending order and picked the top 3 actors/actresses. This is based on a simple assumption: leading actor/actress tends to have more facebook popularity than supporting actor/actress; and no matter how great a movie is, there will be no more than 2 leading actors/actresses. For notation purpose, I named the facebook popularity for the top 3 actor/actress as "actor_1_facebook_likes", "actor_2_facebook_likes", and "actor_3_facebook_likes". Note also that the variable "cast_total_facebook_likes" is calculated by summing up the facebook popularity of all the available cast members.

The assumption indeed matches with the plotted graph below. The top first actor/actress has the most number of facebook popularity, while the second and the third actor/actress have much lower popularity. But it can also be shown that, high facebook popularity of the leading actor/actress does not mean that a movie is of high rating.

5 Movie rating prediction

The prediction of movie ratings in this article is based on the following assumptions:

The IMDB score reflects the greatness of movies. The higher, the better.
Watching good movies is preferable to bad ones for many people.

With those 28 variables available for all scraped movies, can we predict movie rating? Before we begin, it is necessary to investigate the correlation of those variables.

Inspired by student projects? Now it's your turn.

Get information about our data science programs and see how we can help you launch your data science career.

5.1 Correlation analysis

Choosing 15 continuous variables, I plotted the correlation matrix below. Note that "imdb_score" in the matrix denote the IMDB rating score of a movie. The matrix reveals that:

The "cast_total_facebook_likes" has a strong positive correlation with the "actor_1_facebook_likes", and has smaller positive correlation with both "actor_2_facebook_likes" and "actor_3_facebook_likes"
The "movie_facebook_likes" has strong correlation with "num_critic_for_reviews", meaning that the popularity of a movie in social network can be largely affected by the critics
The "movie_facebook_likes" has relatively large correlation with the "num_voted_users"
The movie "gross" has strong positive correlation with the "num_voted_users"

Surprisingly, there are some pairwise correlations that are perhaps counter-intuitive:

The "imdb_score" has very small but positive correlation with the "director_facebook_likes", meaning a popular director does not necessarily mean his directed movie is great.
The "imdb_score" has very small but positive correlation with the "actor_1_facebook_likes", meaning that an actor is popular in social network does not mean that a movie is high rating if he is the leading actor. So do supporting actors.
The "imdb_score" has small but positive correlation with "duration". Long movies tend to have high rating.
The "imdb_score" has small but negative correlation with "facenumber_in_poster". It is perhaps not a good idea to have many faces in movie poster if a movie wants to be great.
The "imdb_score" has almost no correlation with "budget". Throwing money at a movie will not necessarily make it great.

5.2 Dimensionality reduction

The three-dimensional PCA plot shown below reveals more information than the correlation matrix. For the 15 continuous variables, we can see their relationship with the three principal components in space. The colorful points denotes all the movies. We can see that some variable vectors tend to cluster and point at similar directions, meaning that those 15 variables have multicollinearity between some variable pairs. This may lead to problem when we want to fit linear regression model to predict movie rating.

5.3 Multiple linear regression

Although initially I scraped 28 variables from IMDB website, many variables are not applicable to predict movie rating. I will therefore only select several critical variables.

Both the correlation matrix and the 3D PCA plot show that multicollinearity exists in the 15 continuous variables. When fitting a multiple linear regression model to predict movie rating, we need to further remove some variables to reduce multicollinearity. Therefore, I remove the following variables: "gross", "cast_total_facebook_likes", "num_critic_for_reviews", "num_voted_users", and "movie_facebook_likes". Some variables are not applicable for prediction, such as "num_voted_users" and "movie_facebook_likes", because these numbers will be unavailable before a movie is released.

The plot of the fitted multiple linear regression is illustrated below. From the "Normal Q-Q" plot, we find that the normality assumption of regression is somewhat violated.

Thus, I apply the box-cox transformation and refit the model. Although the model became uninterpretable, the assumptions of multiple linear regression, namely, no multicollinearity, normality, constant variability, and independence, are well-satisfied.

From the detailed information of the fitted model, we find that the model is significant since the p-value 2.2e-16 is very small. The "title_year" and "facenumber_in_poster" has negative weight. The "actor_3_facebook_likes" variable was not included in the model at all, meaning that the social network popularity of the third actor in the cast member is not significant to predict the movie rating. This model has multiple R-squared score of 0.201, meaning that around 20% of the variability can be explained by this model.

5.4 Random Forest regression

Random Forest model was fitted to predict movie rating using the following variables:

imdb_score
director_facebook_likes
duration
actor_1_facebook_likes
actor_2_facebook_likes
actor_3_facebook_likes
facenumber_in_poster
budget

The movie dataset was divided into two parts, 80% of the movies were treated as the training set, and the rest 20% belonged to the testing set. Up to 4000 trees were generated to fit the random forest. The number of variables tried at each split of the decision tree is 2. The mean of squared residuals is 0.89023, and the percentage of variable explained is 27.21%, better than that of multiple linear regression.

From the fitted random forest model, the variable importance can be revealed in the graph below. It is interesting to see that duration is the most important variables, followed by the budget and the director facebook popularity. Different from the multiple linear regression model above, the "actor_3_facebook_likes" is considered as an important variable, even slightly more important than the "actor_1_facebook_likes".

6 Insights

Since the fitted Random Forest model explains more variability than that of multiple linear regression, I will use the results from Random Forest to explain the insights found so far:

The most important factor that affects movie rating is the duration. The longer the movie is, the higher the rating will be.
Budget is important, although there is no strong correlation between budget and movie rating.
The facebook popularity of director is an important factor to affect a movie rating.
The facebook popularity of the top 3 actors/actresses is important.
The number of faces in movie poster has a non-neglectable effect to the movie rating.

Inspired by student projects? Now it's your turn.

Get information about our data science programs and see how we can help you launch your data science career.

About Author

Chuan Sun

Chuan is interested in uncovering the relationship of things. He likes to seek order from chaos. Previously, he worked on a unannounced project in Amazon Seattle as a software engineer. The project is related to machine learning and...

View all posts by Chuan Sun >

Cancel reply

You must be logged in to post a comment.

Google January 2, 2021

Google That will be the end of this write-up. Here youll uncover some web-sites that we consider you will appreciate, just click the links.

Google December 20, 2020

Google Wonderful story, reckoned we could combine some unrelated information, nonetheless really really worth taking a search, whoa did one particular discover about Mid East has got much more problerms too.

Google October 5, 2019

Google Sites of interest we've a link to.

Google September 26, 2019

Google Here is a great Blog You might Uncover Exciting that we encourage you to visit.

Chandima January 13, 2018

Hi Chuan Sun, This is an awesome article. I'm doing the same kind of research for my masters. If you can share the dataset you used, it will be a great help for my research. Could you please help me? Thanks,

Predicting the Oscar Best Picture Winner: Part 1” – Sam Veverka October 18, 2017

[…] Movies make a great research topic in the age of the internet. It turns out a lot of people love movies, so there are multitudes of sites that track the earnings, critical reception, technical details, etc. of films. However, it can still be cumbersome to gather a good data set, as the information is usually not readily accessible. Sites like IMDB secure their official api behind a paywall. One can get around official APIs by using scrapers. I used several scrapers for this project. For much of the movie details, I used a lightly adapted Python script written by Chuan Sun detailed at: https://nycdatascience.edu/blog/student-works/machine-learning/movie-rating-prediction/. […]

collier van cleef et arpels c艙ur Knockoff July 18, 2017

Отличная идея для современного мира. Искренне желаю вам удачи collier van cleef et arpels c艙ur Knockoff http://www.vcaalhambra.com/magic-replica-van-cleef-arpels-alhambra-long-necklace-1-motif-yellow-gold-malachite-p313/

Chelsey July 9, 2017

We are a buncch of volunteers aand opening a brand new scheme in our community. Your website provided us with helpful info to paintings on. You've performed a formidable activity and our wole group woll likely be thankful to you.

bvlgari snake watch yellow gold review May 10, 2017

I can definitely agree with what you are saying in this article but just to show another point, these were not my biggest concerns. They were concerining but my largest concern was, above all else, the technology associated with collage. It was scary to think how much technology has become a part of our society, and me not even owning a laptop was scared i wasn’t going to be able to keep up. Obviously it was not a legitimate fear concitering im posting on the blog right now, on my new laptop, but i feel i was not the only one having this concern and feel it should be addressed. bvlgari snake watch yellow gold review http://www.bzero.cn/en/bvlgari-serpenti-tubogas-yellow-gold-diamond-watch-sp35c6gdg1t-p-225.html

rolex oyster perpetual datejust prix replique May 8, 2017

Paragraph writing is also a fun, if you know afterward you can write otherwise it is complicated to write.| rolex oyster perpetual datejust prix replique http://www.bestwatchnews.com/popularite-rolex-submariner-date-vs-gmt-master-ii.html

Fernando GP April 17, 2017

Amazing analysis, Chuan! I am intrigued by one factor you mention in your analysis, and it is the fact that the movies that have a longer duration seem to have a higher IMDB score. Nonetheless, I presume there must be a threshold where the score drops, isn't it? Did you plot this somewhere? I was particularly intrigued by this assumption. Anyhow, great job of feature engineering and Data Analysis. Best, Fernando

nidhi teli March 24, 2017

Hello sir, I am a final BE student from india and i am currently working on final year project on movie recommendation. We have aimed to recommend movies to users based upon the genres of the movie, but the problem is that we are not getting standard dataset for it. We need dataset having movie information along with the feature values which is rating for movies based on its genre. For ex. X is the movie with comedy rating of 4.5 , action rating 3, drama rating 2.5, Horror rating 2,etc . And since you have experience i thought of asking for the same here . So if possible can you suggest me some datasets with such data? Nidhi

Jack G January 14, 2017

Was this dataset legally obtained? The IMDB terms of service prohibit scaping.

Samuel Rosenblatt January 11, 2017

Dear Chuan Sun and others, I I have noticed that between 2 and 20 percent of the observations in this dataset are incorrect due to an assumption made while scraping which may be possible to fix fairly easily. From this https://github.com/sundeepblue/movie_rating_prediction I learned that when getting the IMDB URL to scrape, the code simply searches IMDB for the name of the movie desired, and takes the URL of the first result. However, this first result is not always the correct page. At least 2% and as many of 20% of observations had their data scraped from pages which are the TV, video game, or fan made versions of the desired movie, or a different page entirely with a similar name, and thus do not represent the desired observation. For example, "Shaun the Sheep" http://www.imdb.com/title/tt0983983/?ref_=fn_tt_tt_1 , and for further examples, it is fairly easy to see if you sort the dataset by duration or rating. I think it may be possible that there is an easy fix here which could fix many of these if we could do some sort of try-catch with the IMDB URL to make sure we do in fact get a movie. However this would require somewhat significant restructuring of the scraping methods. Is anyone interested in helping me do this?

mu2zen January 11, 2017

This is a very good game, I hope to play with you mu2zen http://ask.fm/mu2zen

Poe4orbs January 5, 2017

You are funny, so play is. Poe4orbs http://www.tennis-motion-connect.com/blogs/post/17498

SONG JIANG December 12, 2016

1. In the code of parsed_scraped_data.py: def parse_facebook_likes_number(num_likes_string): # eg: "8.5K" --> "85000" if not num_likes_string: return 0 size = len(num_likes_string) if num_likes_string[-1] == 'K' and num_likes_string[ : size-1].isdigit(): return int(float(num_likes_string[ : size - 1]) * 1000) elif num_likes_string.isdigit(): return int(num_likes_string) else: return 0 If the fb_likes_number = 3.5k, then the result will be 0. This is why in director_fb_likes, actor1、2、3_fb_likes and movie_fb_likes, there are many 0. And in cast_total_fb_likes, sometimes it will be smaller than the top1 fb likes of one actor/actress. 2. I think choosing three actors or actresses who have the most fb_likes is not reasonable, we should choose the top3 main actors/actresses. If one has the highest fb likes, but he/she only slightly involved in the film. If we choose top3 main actors/actresses, then the code in imdb_spider.py should be modified a little: "# combine the two lists cast_name_link_pairs = pairs_for_odd_rows + pairs_for_even_rows" In your code, this will upset the order of the main actors/actresses.

SONG JIANG December 12, 2016

So, I think it may result in actor 1's facebook likes is less important in the final conclusion. And the cast total facebook likes are calculated by adding all the actors and actresses' fb likes on the film main page. But actually, the sum of only top 3's fb likes is bigger than that number. So this one is also incorrect.

SONG JIANG December 12, 2016

But really appreciate!

SONG JIANG December 12, 2016

Found many faults. Like in Avatar, actor 1,2,3 are incorrect, so their facebook like are incorrect. And director facebook like is not 0.

IMDB Dataset – R and Beyond December 7, 2016

[…] This data set was posted on Kaggle. The entire process of data acquisition and cleaning can be found here. […]

Chuan Sun November 23, 2016

Hi Quinton, Thank you very much for your suggestion! You are right. When I parsed the currency, I didn't take the Korean currency into consideration due to limited timespan. I will post your valuable comment into the Kaggle dataset page such that other users will be aware of it. Thanks!

Quinton Huffman November 23, 2016

Hi Chuan Sun, We actually used your IMDB dataset for an Advanced Data Mining class at Rockhurst University in Kansas City, MO. We love the data set and we really appreciate the time it took to create the it. However, we believe we found a small flaw in the data. Not all of the IMDB movie budget numbers are in US dollars, for example, the South Korean movie "The Host" has its budget numbers in S. Korean Won (Korean currency). But there is no data in the dataset that tells you the currency. The existance of foreign currencies skews the budget data for foreign films particularly for currencies with extreme exchange rates when compared to USD. For instance, many could assume the data set shows "The Host" cost $12 billion to make when it truthfully cost only 12 billion Won, but the dataset doesn't make the distinction. It is not just an issue with Korean movies we found Turkish and Japanese movies with the same issue. Quinton

Hassan November 17, 2016

Hello Chuan, This is amazing stuff you have here! It is really helping me understand how Machine Learning works pracitcally There is one thing however that I am unsure about. In this article you are using Random Forest Regression and Multiple Linear Regression. However according to the code on github (https://github.com/sundeepblue/movie_rating_prediction/blob/master/movie_rating_prediction.r) you are using Ridge Regression and Lasso Regression. Basically my qiuestion is can you please help me understand which learning algorithm(s) you are using in the code? Thank you! Hassan

Ariella Katz October 17, 2016

Thanks for the reply. Interesting to think about. Best, Ariella

Chuan Sun October 15, 2016

Hi Ariella, Thank you for your interests in the project! I totally agree with you that using solely the IMDB's facebook like seems not enough. But there were several practical considerations: 1. It is not easy to collect the verified movie pages on Facebook for all 5000 movies. Particularly, many unofficial pages were created by fans. 2. The number of some extremely popular movies like "The Godfather" could largely dwarf those not so popular but still great movies. To get unbiased facebook likes for movies is very challenging. 3. If we follow this direction, then how about all the directors/actors/actresses in Facebook? Using only the IMDB's facebook likes is simple, easy to implement, but still meaningful. It gives us a baseline to begin with. If in the future we find it is really necessary to dig deeper into the precise relationship between social network and movie popularity, we can use a rather systematic approach, such as graph analysis, a.k.a, building a big graph consisting of nodes (directors, actors/actresses, and movies) and edges (if actor A shows up in movie B, then we have an edge from A to B). Many insights could be distilled from there. Thanks!

Ariella Katz October 14, 2016

*edit - Meant to type The Godfather (1972), not '74.

Ariella Katz October 14, 2016

Hi Chuan, Excellent work. One issue I found, however, is with the Facebook analysis - using IMDB's Facebook likes is flawed, since they only count likes that have been "thumbed up" through the link on their website. This excludes any Facebook likes made outside of the IMDB links. For example, let's look at the number of likes for The Godfather (1974) - on IMDB it is showing 44k likes, but on Facebook's verified movie page, it is over 9.2 million likes. https://www.facebook.com/thegodfather/ Just something to think about :) Best Regards, Ariella Katz Student

Karthik September 16, 2016

Thanks Chuan !

Chuan Sun September 15, 2016

Thank you! The dataset is under "Open Database License". Visit here for both the dataset and the licence information: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

Karthik September 15, 2016

Hi Chuan Sun, A brilliant piece of analysis and lot of creative feature engineering. The dataset provided is very rich in information. Currently, I am writing a book on Machine Learning using R, does this dataset has any copyright or licensing terms if I would like to use the dataset and cite your name in the reference? Please let me know. You can contact me on my email.

Chuan Sun August 23, 2016

I mostly used "plotly" to generate the graph.

Flexic August 23, 2016

What tool is used to draw the figures in your post? Thanks

Predict Movie Rating

1 Question

2 Scraping 5000+ movies from IMDB

3 Face detection from movie posters

4 IMDB rating score

4.1 IMDB score VS country

4.2 IMDB score VS movie year

4.3 IMDB score VS movie facebook popularity

4.4 IMDB score VS director facebook popularity

4.5 IMDB score VS top 3 actors/actresses facebook popularity

5 Movie rating prediction

5.1 Correlation analysis

5.2 Dimensionality reduction

5.3 Multiple linear regression

5.4 Random Forest regression

6 Insights

About Author

Chuan Sun

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Predict Movie Rating

1 Question

2 Scraping 5000+ movies from IMDB

3 Face detection from movie posters

4 IMDB rating score

4.1 IMDB score VS country

4.2 IMDB score VS movie year

4.3 IMDB score VS movie facebook popularity

4.4 IMDB score VS director facebook popularity

4.5 IMDB score VS top 3 actors/actresses facebook popularity

5 Movie rating prediction

5.1 Correlation analysis

5.2 Dimensionality reduction

5.3 Multiple linear regression

5.4 Random Forest regression

6 Insights

About Author

Chuan Sun

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!