Data Study on The Meaning Behind a Movie Recommendation
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Over the last two decades, data shows the accessibility of media has increased dramatically. One of the main reasons for this surge in available content is the evolution of online streaming platforms. Services like Netflix, Hulu, Amazon Prime, and others enable millions of consumers access to a seemingly countless number of movies and television shows.
With so many choices now available, it is more important than ever to have effective ways to filter all the possible titles and find options that best match your interests. Many streaming services have implemented functionality to suggest possible titles of interests for their users. Without these guiding tools, deciding what to watch next can seem like a monstrous undertaking.
In my experience, the “you may also like” style suggestions based on algorithms work well. However, I find that when I’m searching for something to watch I want to know more than just whether or not I will like it. What kind of movie is it? What parts of the movie were done better than others? In short, I want to know what aspects of the movie suggest that I will enjoy it. This information is not provided by the suggestion functionality of the streaming platforms I have encountered. So I thought of a different way to approach the problem.
Movie critics work hard to produce reviews that help the general audience decide whether a movie is worth watching. A critic's review can provide a much more detailed look into a movie’s worth than an algorithm that simply suggests “watch this.” Still, there is the issue of individual critics' personal taste and preference to consider.
Although general trends usually do arise, movie reviews can be highly subjective. How do I know I can trust a particular critic? Do they value the same qualities in a movie that I do? These uncertainties can compromise the validity of a review on a person to person basis. A critic’s reviews would carry more weight if I knew that our interests were similar. With this in mind, I decided to create an application that would match a user with the most similarly-minded critics. The user could then follow their matched critics to explore a range of movies they could potentially enjoy.
I chose to extract movie review data from the movie and TV show review site RottenTomatoes (RT). RT is a great resource for anyone looking for a new TV show to watch or to choose between movies that are currently playing. The site provides ratings by both professional critics and average viewers to give a general idea of the quality of a film or show. RT also gives a certification of “Fresh” or “Rotten” so users can quickly get an idea if the title is worth checking out. I used the scrapy package in Python to code a script that would extract certain values from the RT site.
From RT, I scraped certified critics’ reviews of the top 100 movies per year from 2000 to 2018. From the top 100 movies list, I only inspected those movies that were reviewed by 100 or more unique critics. The 100 critic count mark seemed to be a threshold for widely recognizable movies. On the critic reviews page for each movie, if a critic gave a rating for the movie (some reviews only included a description) I scraped the critic’s name, the organization they belonged to, and the rating they gave. I also scraped the average rating for each of the movies.
I used Python to clean the data that I scraped from RT. The main issue that I encountered was that there were many different rating formats used among the different critics. For example, here are some of the formats used:
- 3 out of 5
- A +
- A plus
- *** 1/2
- 3 stars
- Recommended vs. Not Recommended
In order to normalize the critics’ ratings to make them comparable, I decided to convert them all to fractional values between 0 and 1. The different rating formats made this conversion difficult. I wrote if-statements containing regular expressions for each format to match and extract the rating information. Then I was able to reassign each value as a fractional representation of the original rating (e.g. *** 1/2 = 3.5/5 = 0.7).
Because there were many different format types, I only retained the ratings that had the most common formats. I also dropped the ratings that I did not think would add value to the matching algorithm, such as recommended vs. not recommended (too coarse of a rating). After the conversion, many ratings had more than five decimal places. For simplicity, I truncated them all to three decimal places.
The main focus of this project was to use web scraping to collect unorganized data and then analyze that data. Instead of exploring the RT data to find trends among movies, such as comparing reviews among different genres or movie release dates, I decided to use the data to help people find movies they might like. This was my thought process for the application's functionality:
- Have a user pick five movies they have seen
- Have the user rate each movie according to how much they enjoyed it
- Compare the user's ratings to the critics' ratings and find the critics with the most similar ratings
- Show the top critics ordered by closest match and show other movies those critics rated
In order to transform this process into reality, I used the Shiny package in R to code an application that could ask the user for input and return a table of top matched critics for them to follow.
The matching algorithm is pretty straightforward, although I utilized two different methods.
The first method I used was to take the absolute value of the difference between the user scores and the critic scores and then add them up. I called this match score the "Absolute Distance Score" (the lower the score, the better the match). This approach was simple, but I found that it created some issues. For example, a critic that rated all five movies exactly one point off from the user ratings would be given the same match score as a critic that rated four movies exactly the same as the user but one movie five points off.
In this example, the first critic rated each movie almost exactly the same as the user, and the second critic rated four movies exactly the same and rated one very differently. I decided that in this type of situation the critics should not be given the same match score so I implemented an additional method. I added a "Squared Distance Score" which squared the difference between the user and critic's ratings and summed them up.
This weights distance exponentially so larger differences will increase the match score more. I included both score types in the table that shows the top matches, though I sorted the table by lowest squared distance score. I hoped that allowing the user to see both scores would provide more insight into the meaning behind the match score.
An important complication arose in the case where a critic did not review all five of the user's chosen movies. This could lead to misleading match scores in the match table. If one critic reviewed all five of the user's movies, there is a greater chance that their match score will be higher than a that of a critic that reviewed only three movies (five differences will be summed for the first critic and only three differences will be summed for the second).
In the match table, two critics could have the same score but perhaps only because data was missing for one or more of the movies. Because of this, I added a review count to the match table. I ordered the table by highest review count and selected the top five match scores for critics who rated five, four, or three of the user's movies. I did not include critics that rated two or less of the user's movies. However, I made it clear in the text above the tables that differences in match scores between review count groups should be taken with a grain of salt. Overall, the five review count group will give the most robust matches.
Below the match table, I included two tables that show high and low rated movies for the currently selected critic. My goal here was to recommend movies to the user based on the opinion of their matched critics. Here the user can browse the list of highly rated movies to see which titles they might enjoy. Conversely, the low rated movie table will allow the user to know which movies to avoid.
The last tables in the application is something I was very excited to implement. When searching for a movie to watch, I generally only stumble upon movies that are universally loved or universally disliked because those are the movies that receive the most publicity. Obviously there are many other movies out there that I might really enjoy, but it's hard to find the diamonds in the rough. The final tables in my application show the user which movies their critics rated much differently from the general public.
The left table shows movies that the critic rated highly but had a low average rating; these are the movies no one would ever think to recommend to you, but are ones you might really enjoy based on your matched critic. On the other hand, the right table shows movies that the critic rated low but had a high average rating; these are the movies that everyone keeps telling you to see, but you might be better off avoiding them. I think this is a very useful tool in finding movies that cater to your interests and can open up a whole new world of movies for you to see.
I am really happy with how this application turned out, but there are still a few kinks I would like to fix:
- In the match table, I include a column with a list of all the organisations that the critics have been associated with. I populate this column by aggregating all the unique organisations for a critic from the original data set and concatenating them with a "|" as a separator. While the matching calculation does not take very long, this organisation column creation requires more than doubles the required time for the table to be created. I would like to add a few lines in the python cleaning code to replace each organisation value in the original data set with all the critic's organisations so that this calculation does not need to be done each time a user submits a new request.
- Due to my filtering criteria during scraping, I only included movies that were reviewed by 100 or more critics. Additionally, the list of movies only includes the top 100 movies from each year between 2000 and 2018. I have not explored how loosening these criteria would affect the data set size or how it would affect the match calculation time, but I would like to add more options in the future for users to choose from (and also to potentially make the algorithm more accurate by adding more uncommon movies).
- Also due to the criteria above, the movies in the data set are necessarily popular. This gave rise to a complication in the "Critic Rating High / Average Rating Low" table. Because all of the movies are popular, there are few to no movies with average ratings below 6/10 or so. This somewhat defeats the purpose of the table, which is supposed to introduce the user to unknown or generally disliked movies. On RT, each critic has their own page with every review they have submitted along with RT's T-Meter rating (a percentage of critics who gave the movie a "Fresh" rating). In the future, I would like to scrape each critic's page and use this second data set to populate the two critic vs. average tables. This should give the user a more exhaustive list of movies they should consider seeing or avoiding.
Thank you for taking the time to read about my project! I was really excited to create this application because I think it serves a real purpose and can be helpful for a large group of people. Feel free to check out my application, and please let me know if it helped you find a movie you enjoyed! Here is a link to my GitHub repository if you would like to explore my code.