NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > AWS > Metarecommendr: A recommendation system for video games, movies and TV shows

Metarecommendr: A recommendation system for video games, movies and TV shows

Stefan Heinz, Yvonne Lau and Daniel Epstein
Posted on Apr 5, 2017

Metarecommendr is a recommendation system for video games, TV shows and movies created by Yvonne Lau , Stefan Heinz, and Daniel Epstein.  It uses word-embedding neural networks, sentiment analysis and collaborative filtering to deliver the best suggestions to match your preferences. It is part of our capstone project delivered at the end of the NYC Data Science Academy Data Science Bootcamp program.

You can take a look at our app here. Please keep in mind that for the time being only a scaled-down version of our models is running online due to memory restrictions. Only "Content-based" is functional at this time. The code is online on GitHub.

Introduction

Finding a piece of media today can be difficult. There are so many games, movies, and tv shows coming out every week, that it is difficult to keep up with. It can take hours to look through blogs, videos, and reviews to determine if a new piece of media is something you will like. Finding a game from the past that you are sure you will like is even harder. Websites like metacritic.com attempt to simplify this process by aggregating reviews. However, there are still some major flaws including:

  • Product suggestions are generally obvious and tied to the title of a product (i.e. if you like Super Mario 64, then you will get inundated with other Mario games)
  • User interface is too crowded with ancillary and unnecessary information
  • The text of reviews does not always match up with the scores associated with them

Hence, for our capstone project, we decided to address these issues by creating an application to improve your search for your next game (and even let you find movies and TV shows if you wish!).  Metarecommendr is a web application that combines a sleek and intuitive user interface with the powers of content-filtering and collaborative-filtering in order to deliver the best recommendation for you.

Project Workflow

Metarecommendr was designed and built in the span of 2 weeks. The project workflow is summarized below:

Project Worfklow

Project Worfklow

Data Collection

To collect all the data and reviews about our items - games, movies and TV shows - , we used the Python web scraping framework Scrapy. In total we implemented 12 spiders - one for each items list, one for the summary and details of each specific item, and one each for the critics and user reviews of each item. While some spiders were finished quickly, the longest one - scraping games reviews - took 10 days in total to finish.

Because we were already expecting a rather big amount of data, we decided to scrape directly into a database instead of using text files. A preliminary version of our database was set in SQlite, a self-contained SQL database engine, which was set up within minutes. After the scraping was finished, we exported the data to a MySQL database running as an Amazon Web Services (AWS) RDS service. To not have to insert 584mb of scraped data from a local machine into a remote database, we uploaded all our data to AWS Simple Storage Service (S3) and implemented an AWS Data Pipeline to directly stream from S3 to RDS via an AWS Elastic Compute Cloud (EC2) instance. This reduced the migration time dramatically by factor 7. Our final app was then ready to read the data directly from the MySQL database.

Exploratory Data Analysis

One of the reasons we opted to implement both content and collaborative-based recommendations was the distribution of ratings found in our dataset. There were in total roughly a million reviews - half from critics, half from users. We found that for both critic and user reviews scores, the distribution of ratings were negatively skewed. Hence, relying solely on ratings (for collaborative filtering) would not offer enough granularity to produce sensible reviews as most products are perceived positively.

EDA

EDA

In terms of observations scraped from metacritic.com we ended up with:

Item  Games Movies TV Shows Reviews
Observations 20,416 5,470 1,978 998,582

Interestingly, in our early exploration of the dataset, we found that the number of reviews was not necessarily indicative of the quality of a product. Infestation: Survivor Stories(The war Z) is among the most reviewed items and yet it has a very poor average critic and user review. This makes some intuitive sense. Games that skew either very positive or very negative create more discussion. Extremely bad games can be fun to talk about with others, similarly to how bad movies can live on as cult favorites. Mediocre games, where there isnโ€™t much to say, tend to have less discussion, and therefore less reviews.

Reviews Review

Recommendation Systems

There are mainly two types of recommendation algorithms: content-filtering and collaborative filtering.

  • Content-filtering:  makes recommendations based on a productโ€™s metadata. A classic example is how Pandora works.
  • Collaborative filtering; takes into account userโ€™s behaviors and interactions with items. It can be further subdivided into two kinds:
    • User-based: recommendation are items from users who are similar to you.   A classic example is how Spotify works.
    • Item-based: recommendations happen according to an item-item similarity metric which is based on ratings from users. An example is how Amazon works    

 

 

a) Content Filtering

Content Filtering

Content Filtering

Since a big portion of the dataset was composed of text data from reviews, the chosen approach for feature engineering on content-based recommendations was Doc2Vec. This is an unsupervised algorithm to generate vectors for documents. It is an extension of the Word2Vec algorithm, where a document (instead of a word) is turned into a vector representation.  Its implementation in Python can be found under Gensim library.

Doc2Vec is able to learn semantical similarities among words, making its implementation more sophisticated than tf-idf. An example output of our model on critic reviews shows that it was able to learn pretty well similar words to the word โ€œExcellentโ€ . Pretty good job!

Doc2Vec

Doc2Vec

For metarecommendr, two Doc2Vec models were trained separately on Summary and Critic Reviews. We opted for not using user reviews since there were not enough descriptive words to yield a meaningful recommendation. On the user interface, a user selects a product they like. Products are then recommended according to a cosine similarity metric. The closer to 1, the more similar two vectors(products) are.

b) Collaborative Filtering

i) SVD - Singular Value Decomposition

Collaborative Filtering: SVD

Collaborative Filtering: SVD

A major challenge to implementing collaborative filtering on this particular dataset was the high dimensionality and sparsity of the user-item matrix. There were a total of around 27,500 products and 63,000 users, with an average number of less than 3  reviews per user. To reduce the dimensionality of the user-item matrix, truncated Singular Value Decomposition (SVD) was implemented.

Consider a user-to-item matrix A where aij represents the ratings from user i for product j. SVD states that every matrix Anxp can be approximated by the following equation:

SVD: Formula

where Unxn and Vpxp are orthogonal matrices and Snxp is a nxp diagonal matrix with singular values of A along the diagonal. As S is a diagonal matrix, we can obtain a more compact representation through SVD. Truncated SVD takes this approach one step further by using only the k most significant values of S instead of all values. Under this approach, we compute a rank-k approximation to A such that it minimizes the Frobenius norm error as follows:

SVD: Formula

For metarecommendr, the dataset was split into train and test, and k was chosen to be 13 according to Cattelโ€™s scree plot.

Scree Plot

Once we obtain the rank-k matrix A', we can make recommendations according to the entries in the matrix.  In the context of our dataset, Aโ€™ corresponds to a matrix of predicted user ratings where aij'is the predicted user rating from user i for item j. Compared to a baseline where all user ratings for products are simply predicted to be the average user rating (RMSE = 7.50), truncated SVD improves 19% upon the error term on predicted user rating (RMSE = 6.07) .

To sum up, for collaborative filtering-SVD,  a user inputs and ranks a few items. A user-item matrix is then generated and decomposed by SVD. For a given user i, this approach allows us to get a predicted user rating for different items, and recommend items with highest predicted rating.

ii) Pearson's Correlation

To better understand the relationship between item review scores, we compared items against each other using a modified Pearsonโ€™s correlation formula. To help scale down this correlation matrix, items with less than 3 overlapping reviews were disregarded, and given a score of 0, or no correlation.

Pearson's Correlation

This item-item matrix approach also allowed us to make cross-category recommendations since the algorithm was no longer bound to an itemโ€™s metadata(such as in collaborative filtering). On the user interface, a user has the option to select a product they like, and they receive products with the highest correlation metric.

c) Sentiment Analysis

As mentioned in the introduction, a major problem with Metacriticโ€™s dataset was the fact that sentiment of reviews did not necessarily match the text data. To address this issue, we performed sentiment analysis on the critic reviews. Positive and negative were defined as follows: reviews with scores of 55 and below were classified as negative, and those with scores of 85 and above were classified as positive. Reviews with scores in between these values were not used for sentiment analysis.

Sentiment Analysis used vectors from doc2vec as features. We attempted a few different machine learning models, including: Logistic regression, Naive Bayes, SVM, and different types of Neural Network. The performance of each model is described below:

  • Logistic regression: 75% accuracy
  • SVM: ~ 65% accuracy
  • Naive Bayes ~65% accuracy
  • Long short term memory (LTSM) recurrent neural networks (RNN)[known method for NLP, good for assessing sequential data: ~75% accuracy
  • Convolutional neural networks (CNN) [commonly used in image processing, but also in NLP tasks]: ~88% accuracy

At the end, the best model ended up being a CNNs with an added RNN component, with the following features: 2 convolution and pool layers, 2 recurrent LTSM layers, and 3 dense, fully connected layers. This model lead us to an accuracy rate above 90%.Screen Shot 2017-04-06 at 10.31.32 AM

On Metarecommendr, this sentiment analysis is showcased interactively:  a user types in a review and the text is evaluated according to our model. Users are able to receive feedback on whether the given score aligned or diverged from the text. We hope to continue with this aspect of the project to improve accuracy and use it as another pre-processing step for our recommendation system

Flask App

Since models were built in Python, a natural choice was to use Flask framework to implement our web application.The frontend is an interactive application built on top of Bootstrap, AngularJS and Angular Material. On the backend, The app is able to directly pull data from the aforementioned MySQL database on AWS. Models were exported to Pickle and H5 files which were stored on AWS S3. When a user visits our application, such files are loaded from AWS s3.

Future Improvements

There are a few improvements that could be made to metarecommendr, including:

  • Creating a hybrid recommendation system that blends both content and collaborative filtering.
  • Adding more filters on the user interface to create an even more customizable user experience
  • Expanding sentiment analysis model for a more refined rating prediction using NLP( i.e. a 1-10 score)

About Authors

Stefan Heinz

Stefan received his Bachelor's degree in Logistics from Heilbronn University in Germany, including a one year stopover in Hong Kong. He then went on to graduate cum laude from Maastricht University's School of Business and Economics in the...
View all posts by Stefan Heinz >

Yvonne Lau

Yvonne Lau is a recent Yale University graduate with a B.A. degree in Economics and Mathematics. Hailing from Rio de Janeiro, Brazil, she became interested in data science after serving as a Data Analyst for a nonprofit organization,...
View all posts by Yvonne Lau >

Daniel Epstein

Daniel Epstein is a neuroscience PHD candidate at the University of Utah, expecting to graduate in summer 2017. While performing analyses on behavioral and neuroimaging data, he became interested in utilizing data science to understand human behavior and...
View all posts by Daniel Epstein >

Related Articles

AWS
Automated Data Extraction and Transformation Using Python, OpenAI, and AWS
AWS
A.I. Development for Two Sigma Halite II Challenge
Alumni
Alumni Spotlight: Claire Keser, Senior Analyst at Casper
AWS
Predicting Success on Stack Overflow
AWS
Scraping millions of reviews from Amazon.com

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Nique Devereaux September 16, 2017
FYI see below for what happens when I try to access your app. Application error An error occurred in the application and your page could not be served. If you are the application owner, check your logs for details.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application