NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Python > Data Study on The Meaning Behind a Movie Recommendation

Data Study on The Meaning Behind a Movie Recommendation

Alex Baransky
Posted on Aug 13, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background Data

Over the last two decades, data shows the accessibility of media has increased dramatically. One of the main reasons for this surge in available content is the evolution of online streaming platforms. Services like Netflix, Hulu, Amazon Prime, and others enable millions of consumers access to a seemingly countless number of movies and television shows.

With so many choices now available, it is more important than ever to have effective ways to filter all the possible titles and find options that best match your interests. Many streaming services have implemented functionality to suggest possible titles of interests for their users. Without these guiding tools, deciding what to watch next can seem like a monstrous undertaking.

 

The Goal

In my experience, the โ€œyou may also likeโ€ style suggestions based on algorithms work well. However, I find that when Iโ€™m searching for something to watch I want to know more than just whether or not I will like it. What kind of movie is it? What parts of the movie were done better than others? In short, I want to know what aspects of the movie suggest that I will enjoy it. This information is not provided by the suggestion functionality of the streaming platforms I have encountered. So I thought of a different way to approach the problem.

Movie critics work hard to produce reviews that help the general audience decide whether a movie is worth watching. A critic's review can provide a much more detailed look into a movieโ€™s worth than an algorithm that simply suggests โ€œwatch this.โ€ Still, there is the issue of individual critics' personal taste and preference to consider.

Although general trends usually do arise, movie reviews can be highly subjective. How do I know I can trust a particular critic? Do they value the same qualities in a movie that I do? These uncertainties can compromise the validity of a review on a person to person basis. A criticโ€™s reviews would carry more weight if I knew that our interests were similar. With this in mind, I decided to create an application that would match a user with the most similarly-minded critics. The user could then follow their matched critics to explore a range of movies they could potentially enjoy.

 

The Data

I chose to extract movie review data from the movie and TV show review site RottenTomatoes (RT). RT is a great resource for anyone looking for a new TV show to watch or to choose between movies that are currently playing. The site provides ratings by both professional critics and average viewers to give a general idea of the quality of a film or show. RT also gives a certification of โ€œFreshโ€ or โ€œRottenโ€ so users can quickly get an idea if the title is worth checking out. I used the scrapy package in Python to code a script that would extract certain values from the RT site.

From RT, I scraped certified criticsโ€™ reviews of the top 100 movies per year from 2000 to 2018. From the top 100 movies list, I only inspected those movies that were reviewed by 100 or more unique critics. The 100 critic count mark seemed to be a threshold for widely recognizable movies. On the critic reviews page for each movie, if a critic gave a rating for the movie (some reviews only included a description) I scraped the criticโ€™s name, the organization they belonged to, and the rating they gave. I also scraped the average rating for each of the movies.

 

Data Cleaning

I used Python to clean the data that I scraped from RT. The main issue that I encountered was that there were many different rating formats used among the different critics. For example, here are some of the formats used:

  • 3/5
  • 3/4
  • 6/10
  • 60/100
  • 3 out of 5
  • A +
  • A plus
  • *** 1/2
  • 3 stars
  • Recommended vs. Not Recommended

In order to normalize the criticsโ€™ ratings to make them comparable, I decided to convert them all to fractional values between 0 and 1. The different rating formats made this conversion difficult. I wrote if-statements containing regular expressions for each format to match and extract the rating information. Then I was able to reassign each value as a fractional representation of the original rating (e.g. *** 1/2 = 3.5/5 = 0.7).

Because there were many different format types, I only retained the ratings that had the most common formats. I also dropped the ratings that I did not think would add value to the matching algorithm, such as recommended vs. not recommended (too coarse of a rating). After the conversion, many ratings had more than five decimal places. For simplicity, I truncated them all to three decimal places.

 

The App

The main focus of this project was to use web scraping to collect unorganized data and then analyze that data. Instead of exploring the RT data to find trends among movies, such as comparing reviews among different genres or movie release dates, I decided to use the data to help people find movies they might like. This was my thought process for the application's functionality:

  1. Have a user pick five movies they have seen
  2. Have the user rate each movie according to how much they enjoyed it
  3. Compare the user's ratings to the critics' ratings and find the critics with the most similar ratings
  4. Show the top critics ordered by closest match and show other movies those critics rated

In order to transform this process into reality, I used the Shiny package in R to code an application that could ask the user for input and return a table of top matched critics for them to follow.

 

Data Study on The Meaning Behind a Movie Recommendation

The user selects five movies they have seen.

 

Data Study on The Meaning Behind a Movie Recommendation

The user rates the movies they selected based on how much they enjoyed them.

Methods

The matching algorithm is pretty straightforward, although I utilized two different methods.

The first method I used was to take the absolute value of the difference between the user scores and the critic scores and then add them up. I called this match score the "Absolute Distance Score" (the lower the score, the better the match). This approach was simple, but I found that it created some issues. For example, a critic that rated all five movies exactly one point off from the user ratings would be given the same match score as a critic that rated four movies exactly the same as the user but one movie five points off.

Critics

In this example, the first critic rated each movie almost exactly the same as the user, and the second critic rated four movies exactly the same and rated one very differently. I decided that in this type of situation the critics should not be given the same match score so I implemented an additional method. I added a "Squared Distance Score" which squared the difference between the user and critic's ratings and summed them up.

This weights distance exponentially so larger differences will increase the match score more. I included both score types in the table that shows the top matches, though I sorted the table by lowest squared distance score. I hoped that allowing the user to see both scores would provide more insight into the meaning behind the match score.

Complication

An important complication arose in the case where a critic did not review all five of the user's chosen movies. This could lead to misleading match scores in the match table. If one critic reviewed all five of the user's movies, there is a greater chance that their match score will be higher than a that of a critic that reviewed only three movies (five differences will be summed for the first critic and only three differences will be summed for the second).

In the match table, two critics could have the same score but perhaps only because data was missing for one or more of the movies. Because of this, I added a review count to the match table. I ordered the table by highest review count and selected the top five match scores for critics who rated five, four, or three of the user's movies. I did not include critics that rated two or less of the user's movies. However, I made it clear in the text above the tables that differences in match scores between review count groups should be taken with a grain of salt. Overall, the five review count group will give the most robust matches.

 

Data Study on The Meaning Behind a Movie Recommendation

This table shows the user's top matched critics based on the movies and ratings chosen.

 

These tables show other movies that the currently selected critic has reviewed, specifically those movies that the critic rated high or low.

 

Ratings

Below the match table, I included two tables that show high and low rated movies for the currently selected critic. My goal here was to recommend movies to the user based on the opinion of their matched critics. Here the user can browse the list of highly rated movies to see which titles they might enjoy. Conversely, the low rated movie table will allow the user to know which movies to avoid.

The last tables in the application is something I was very excited to implement. When searching for a movie to watch, I generally only stumble upon movies that are universally loved or universally disliked because those are the movies that receive the most publicity. Obviously there are many other movies out there that I might really enjoy, but it's hard to find the diamonds in the rough. The final tables in my application show the user which movies their critics rated much differently from the general public.

The left table shows movies that the critic rated highly but had a low average rating; these are the movies no one would ever think to recommend to you, but are ones you might really enjoy based on your matched critic. On the other hand, the right table shows movies that the critic rated low but had a high average rating; these are the movies that everyone keeps telling you to see, but you might be better off avoiding them. I think this is a very useful tool in finding movies that cater to your interests and can open up a whole new world of movies for you to see.

 

These tables show the movies that the currently selected critic rated much differently from the general public.

 

Going Forward

I am really happy with how this application turned out, but there are still a few kinks I would like to fix:

  1. In the match table, I include a column with a list of all the organisations that the critics have been associated with. I populate this column by aggregating all the unique organisations for a critic from the original data set and concatenating them with a "|" as a separator. While the matching calculation does not take very long, this organisation column creation requires more than doubles the required time for the table to be created. I would like to add a few lines in the python cleaning code to replace each organisation value in the original data set with all the critic's organisations so that this calculation does not need to be done each time a user submits a new request.
  2. Due to my filtering criteria during scraping, I only included movies that were reviewed by 100 or more critics. Additionally, the list of movies only includes the top 100 movies from each year between 2000 and 2018.  I have not explored how loosening these criteria would affect the data set size or how it would affect the match calculation time, but I would like to add more options in the future for users to choose from (and also to potentially make the algorithm more accurate by adding more uncommon movies).
  3. Also due to the criteria above, the movies in the data set are necessarily popular. This gave rise to a complication in the "Critic Rating High / Average Rating Low" table. Because all of the movies are popular, there are few to no movies with average ratings below 6/10 or so. This somewhat defeats the purpose of the table, which is supposed to introduce the user to unknown or generally disliked movies. On RT, each critic has their own page with every review they have submitted along with RT's T-Meter rating (a percentage of critics who gave the movie a "Fresh" rating). In the future, I would like to scrape each critic's page and use this second data set to populate the two critic vs. average tables. This should give the user a more exhaustive list of movies they should consider seeing or avoiding.

Hey, Thanks!

Thank you for taking the time to read about my project! I was really excited to create this application because I think it serves a real purpose and can be helpful for a large group of people. Feel free to check out my application, and please let me know if it helped you find a movie you enjoyed! Here is a link to my GitHub repository if you would like to explore my code.

About Author

Alex Baransky

Alex graduated from Columbia University with training in natural and technical sciences. He enjoys finding ways to utilize data science to answer questions efficiently and to improve the interpretability of results. Alex takes pride in his ability to...
View all posts by Alex Baransky >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application