Data Study on The Meaning Behind a Movie Recommendation

Posted on Aug 13, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background Data

Over the last two decades, data shows the accessibility of media has increased dramatically. One of the main reasons for this surge in available content is the evolution of online streaming platforms. Services like Netflix, Hulu, Amazon Prime, and others enable millions of consumers access to a seemingly countless number of movies and television shows.

With so many choices now available, it is more important than ever to have effective ways to filter all the possible titles and find options that best match your interests. Many streaming services have implemented functionality to suggest possible titles of interests for their users. Without these guiding tools, deciding what to watch next can seem like a monstrous undertaking.


The Goal

In my experience, the “you may also like” style suggestions based on algorithms work well. However, I find that when I’m searching for something to watch I want to know more than just whether or not I will like it. What kind of movie is it? What parts of the movie were done better than others? In short, I want to know what aspects of the movie suggest that I will enjoy it. This information is not provided by the suggestion functionality of the streaming platforms I have encountered. So I thought of a different way to approach the problem.

Movie critics work hard to produce reviews that help the general audience decide whether a movie is worth watching. A critic's review can provide a much more detailed look into a movie’s worth than an algorithm that simply suggests “watch this.” Still, there is the issue of individual critics' personal taste and preference to consider.

Although general trends usually do arise, movie reviews can be highly subjective. How do I know I can trust a particular critic? Do they value the same qualities in a movie that I do? These uncertainties can compromise the validity of a review on a person to person basis. A critic’s reviews would carry more weight if I knew that our interests were similar. With this in mind, I decided to create an application that would match a user with the most similarly-minded critics. The user could then follow their matched critics to explore a range of movies they could potentially enjoy.


The Data

I chose to extract movie review data from the movie and TV show review site RottenTomatoes (RT). RT is a great resource for anyone looking for a new TV show to watch or to choose between movies that are currently playing. The site provides ratings by both professional critics and average viewers to give a general idea of the quality of a film or show. RT also gives a certification of “Fresh” or “Rotten” so users can quickly get an idea if the title is worth checking out. I used the scrapy package in Python to code a script that would extract certain values from the RT site.

From RT, I scraped certified critics’ reviews of the top 100 movies per year from 2000 to 2018. From the top 100 movies list, I only inspected those movies that were reviewed by 100 or more unique critics. The 100 critic count mark seemed to be a threshold for widely recognizable movies. On the critic reviews page for each movie, if a critic gave a rating for the movie (some reviews only included a description) I scraped the critic’s name, the organization they belonged to, and the rating they gave. I also scraped the average rating for each of the movies.


Data Cleaning

I used Python to clean the data that I scraped from RT. The main issue that I encountered was that there were many different rating formats used among the different critics. For example, here are some of the formats used:

  • 3/5
  • 3/4
  • 6/10
  • 60/100
  • 3 out of 5
  • A +
  • A plus
  • *** 1/2
  • 3 stars
  • Recommended vs. Not Recommended

In order to normalize the critics’ ratings to make them comparable, I decided to convert them all to fractional values between 0 and 1. The different rating formats made this conversion difficult. I wrote if-statements containing regular expressions for each format to match and extract the rating information. Then I was able to reassign each value as a fractional representation of the original rating (e.g. *** 1/2 = 3.5/5 = 0.7).

Because there were many different format types, I only retained the ratings that had the most common formats. I also dropped the ratings that I did not think would add value to the matching algorithm, such as recommended vs. not recommended (too coarse of a rating). After the conversion, many ratings had more than five decimal places. For simplicity, I truncated them all to three decimal places.


The App

The main focus of this project was to use web scraping to collect unorganized data and then analyze that data. Instead of exploring the RT data to find trends among movies, such as comparing reviews among different genres or movie release dates, I decided to use the data to help people find movies they might like. This was my thought process for the application's functionality:

  1. Have a user pick five movies they have seen
  2. Have the user rate each movie according to how much they enjoyed it
  3. Compare the user's ratings to the critics' ratings and find the critics with the most similar ratings
  4. Show the top critics ordered by closest match and show other movies those critics rated

In order to transform this process into reality, I used the Shiny package in R to code an application that could ask the user for input and return a table of top matched critics for them to follow.


Data Study on The Meaning Behind a Movie Recommendation

The user selects five movies they have seen.


Data Study on The Meaning Behind a Movie Recommendation

The user rates the movies they selected based on how much they enjoyed them.


The matching algorithm is pretty straightforward, although I utilized two different methods.

The first method I used was to take the absolute value of the difference between the user scores and the critic scores and then add them up. I called this match score the "Absolute Distance Score" (the lower the score, the better the match). This approach was simple, but I found that it created some issues. For example, a critic that rated all five movies exactly one point off from the user ratings would be given the same match score as a critic that rated four movies exactly the same as the user but one movie five points off.


In this example, the first critic rated each movie almost exactly the same as the user, and the second critic rated four movies exactly the same and rated one very differently. I decided that in this type of situation the critics should not be given the same match score so I implemented an additional method. I added a "Squared Distance Score" which squared the difference between the user and critic's ratings and summed them up.

This weights distance exponentially so larger differences will increase the match score more. I included both score types in the table that shows the top matches, though I sorted the table by lowest squared distance score. I hoped that allowing the user to see both scores would provide more insight into the meaning behind the match score.


An important complication arose in the case where a critic did not review all five of the user's chosen movies. This could lead to misleading match scores in the match table. If one critic reviewed all five of the user's movies, there is a greater chance that their match score will be higher than a that of a critic that reviewed only three movies (five differences will be summed for the first critic and only three differences will be summed for the second).

In the match table, two critics could have the same score but perhaps only because data was missing for one or more of the movies. Because of this, I added a review count to the match table. I ordered the table by highest review count and selected the top five match scores for critics who rated five, four, or three of the user's movies. I did not include critics that rated two or less of the user's movies. However, I made it clear in the text above the tables that differences in match scores between review count groups should be taken with a grain of salt. Overall, the five review count group will give the most robust matches.


Data Study on The Meaning Behind a Movie Recommendation

This table shows the user's top matched critics based on the movies and ratings chosen.


These tables show other movies that the currently selected critic has reviewed, specifically those movies that the critic rated high or low.



Below the match table, I included two tables that show high and low rated movies for the currently selected critic. My goal here was to recommend movies to the user based on the opinion of their matched critics. Here the user can browse the list of highly rated movies to see which titles they might enjoy. Conversely, the low rated movie table will allow the user to know which movies to avoid.

The last tables in the application is something I was very excited to implement. When searching for a movie to watch, I generally only stumble upon movies that are universally loved or universally disliked because those are the movies that receive the most publicity. Obviously there are many other movies out there that I might really enjoy, but it's hard to find the diamonds in the rough. The final tables in my application show the user which movies their critics rated much differently from the general public.

The left table shows movies that the critic rated highly but had a low average rating; these are the movies no one would ever think to recommend to you, but are ones you might really enjoy based on your matched critic. On the other hand, the right table shows movies that the critic rated low but had a high average rating; these are the movies that everyone keeps telling you to see, but you might be better off avoiding them. I think this is a very useful tool in finding movies that cater to your interests and can open up a whole new world of movies for you to see.


These tables show the movies that the currently selected critic rated much differently from the general public.


Going Forward

I am really happy with how this application turned out, but there are still a few kinks I would like to fix:

  1. In the match table, I include a column with a list of all the organisations that the critics have been associated with. I populate this column by aggregating all the unique organisations for a critic from the original data set and concatenating them with a "|" as a separator. While the matching calculation does not take very long, this organisation column creation requires more than doubles the required time for the table to be created. I would like to add a few lines in the python cleaning code to replace each organisation value in the original data set with all the critic's organisations so that this calculation does not need to be done each time a user submits a new request.
  2. Due to my filtering criteria during scraping, I only included movies that were reviewed by 100 or more critics. Additionally, the list of movies only includes the top 100 movies from each year between 2000 and 2018.  I have not explored how loosening these criteria would affect the data set size or how it would affect the match calculation time, but I would like to add more options in the future for users to choose from (and also to potentially make the algorithm more accurate by adding more uncommon movies).
  3. Also due to the criteria above, the movies in the data set are necessarily popular. This gave rise to a complication in the "Critic Rating High / Average Rating Low" table. Because all of the movies are popular, there are few to no movies with average ratings below 6/10 or so. This somewhat defeats the purpose of the table, which is supposed to introduce the user to unknown or generally disliked movies. On RT, each critic has their own page with every review they have submitted along with RT's T-Meter rating (a percentage of critics who gave the movie a "Fresh" rating). In the future, I would like to scrape each critic's page and use this second data set to populate the two critic vs. average tables. This should give the user a more exhaustive list of movies they should consider seeing or avoiding.

Hey, Thanks!

Thank you for taking the time to read about my project! I was really excited to create this application because I think it serves a real purpose and can be helpful for a large group of people. Feel free to check out my application, and please let me know if it helped you find a movie you enjoyed! Here is a link to my GitHub repository if you would like to explore my code.

About Author

Alex Baransky

Alex graduated from Columbia University with training in natural and technical sciences. He enjoys finding ways to utilize data science to answer questions efficiently and to improve the interpretability of results. Alex takes pride in his ability to...
View all posts by Alex Baransky >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI