Metarecommendr is a recommendation system for video games, TV shows and movies created by Yvonne Lau , Stefan Heinz, and Daniel Epstein. It uses word-embedding neural networks, sentiment analysis and collaborative filtering to deliver the best suggestions to match your preferences. It is part of our capstone project delivered at the end of the NYCDSA Data Science Bootcamp program.
You can take a look at our app here. Please keep in mind that for the time being only a scaled-down version of our models is running online due to memory restrictions. Only "Content-based" is functional at this time. The code is online on GitHub.
Finding a piece of media today can be difficult. There are so many games, movies, and tv shows coming out every week, that it is difficult to keep up with. It can take hours to look through blogs, videos, and reviews to determine if a new piece of media is something you will like. Finding a game from the past that you are sure you will like is even harder. Websites like metacritic.com attempt to simplify this process by aggregating reviews. However, there are still some major flaws including:
- Product suggestions are generally obvious and tied to the title of a product (i.e. if you like Super Mario 64, then you will get inundated with other Mario games)
- User interface is too crowded with ancillary and unnecessary information
- The text of reviews does not always match up with the scores associated with them
Hence, for our capstone project, we decided to address these issues by creating an application to improve your search for your next game (and even let you find movies and TV shows if you wish!). Metarecommendr is a web application that combines a sleek and intuitive user interface with the powers of content-filtering and collaborative-filtering in order to deliver the best recommendation for you.
Metarecommendr was designed and built in the span of 2 weeks. The project workflow is summarized below:
To collect all the data and reviews about our items - games, movies and TV shows - , we used the Python web scraping framework Scrapy. In total we implemented 12 spiders - one for each items list, one for the summary and details of each specific item, and one each for the critics and user reviews of each item. While some spiders were finished quickly, the longest one - scraping games reviews - took 10 days in total to finish.
Because we were already expecting a rather big amount of data, we decided to scrape directly into a database instead of using text files. A preliminary version of our database was set in SQlite, a self-contained SQL database engine, which was set up within minutes. After the scraping was finished, we exported the data to a MySQL database running as an Amazon Web Services (AWS) RDS service. To not have to insert 584mb of scraped data from a local machine into a remote database, we uploaded all our data to AWS Simple Storage Service (S3) and implemented an AWS Data Pipeline to directly stream from S3 to RDS via an AWS Elastic Compute Cloud (EC2) instance. This reduced the migration time dramatically by factor 7. Our final app was then ready to read the data directly from the MySQL database.
Exploratory Data Analysis
One of the reasons we opted to implement both content and collaborative-based recommendations was the distribution of ratings found in our dataset. There were in total roughly a million reviews - half from critics, half from users. We found that for both critic and user reviews scores, the distribution of ratings were negatively skewed. Hence, relying solely on ratings (for collaborative filtering) would not offer enough granularity to produce sensible reviews as most products are perceived positively.
In terms of observations scraped from metacritic.com we ended up with:
Interestingly, in our early exploration of the dataset, we found that the number of reviews was not necessarily indicative of the quality of a product. Infestation: Survivor Stories(The war Z) is among the most reviewed items and yet it has a very poor average critic and user review. This makes some intuitive sense. Games that skew either very positive or very negative create more discussion. Extremely bad games can be fun to talk about with others, similarly to how bad movies can live on as cult favorites. Mediocre games, where there isn’t much to say, tend to have less discussion, and therefore less reviews.
There are mainly two types of recommendation algorithms: content-filtering and collaborative filtering.
- Content-filtering: makes recommendations based on a product’s metadata. A classic example is how Pandora works.
- Collaborative filtering; takes into account user’s behaviors and interactions with items. It can be further subdivided into two kinds:
- User-based: recommendation are items from users who are similar to you. A classic example is how Spotify works.
- Item-based: recommendations happen according to an item-item similarity metric which is based on ratings from users. An example is how Amazon works
a) Content Filtering
Since a big portion of the dataset was composed of text data from reviews, the chosen approach for feature engineering on content-based recommendations was Doc2Vec. This is an unsupervised algorithm to generate vectors for documents. It is an extension of the Word2Vec algorithm, where a document (instead of a word) is turned into a vector representation. Its implementation in Python can be found under Gensim library.
Doc2Vec is able to learn semantical similarities among words, making its implementation more sophisticated than tf-idf. An example output of our model on critic reviews shows that it was able to learn pretty well similar words to the word “Excellent” . Pretty good job!
For metarecommendr, two Doc2Vec models were trained separately on Summary and Critic Reviews. We opted for not using user reviews since there were not enough descriptive words to yield a meaningful recommendation. On the user interface, a user selects a product they like. Products are then recommended according to a cosine similarity metric. The closer to 1, the more similar two vectors(products) are.
b) Collaborative Filtering
i) SVD - Singular Value Decomposition
A major challenge to implementing collaborative filtering on this particular dataset was the high dimensionality and sparsity of the user-item matrix. There were a total of around 27,500 products and 63,000 users, with an average number of less than 3 reviews per user. To reduce the dimensionality of the user-item matrix, truncated Singular Value Decomposition (SVD) was implemented.
Consider a user-to-item matrix A where aij represents the ratings from user i for product j. SVD states that every matrix Anxp can be approximated by the following equation:
where Unxn and Vpxp are orthogonal matrices and Snxp is a nxp diagonal matrix with singular values of A along the diagonal. As S is a diagonal matrix, we can obtain a more compact representation through SVD. Truncated SVD takes this approach one step further by using only the k most significant values of S instead of all values. Under this approach, we compute a rank-k approximation to A such that it minimizes the Frobenius norm error as follows:
For metarecommendr, the dataset was split into train and test, and k was chosen to be 13 according to Cattel’s scree plot.
Once we obtain the rank-k matrix A', we can make recommendations according to the entries in the matrix. In the context of our dataset, A’ corresponds to a matrix of predicted user ratings where aij'is the predicted user rating from user i for item j. Compared to a baseline where all user ratings for products are simply predicted to be the average user rating (RMSE = 7.50), truncated SVD improves 19% upon the error term on predicted user rating (RMSE = 6.07) .
To sum up, for collaborative filtering-SVD, a user inputs and ranks a few items. A user-item matrix is then generated and decomposed by SVD. For a given user i, this approach allows us to get a predicted user rating for different items, and recommend items with highest predicted rating.
ii) Pearson's Correlation
To better understand the relationship between item review scores, we compared items against each other using a modified Pearson’s correlation formula. To help scale down this correlation matrix, items with less than 3 overlapping reviews were disregarded, and given a score of 0, or no correlation.
This item-item matrix approach also allowed us to make cross-category recommendations since the algorithm was no longer bound to an item’s metadata(such as in collaborative filtering). On the user interface, a user has the option to select a product they like, and they receive products with the highest correlation metric.
c) Sentiment Analysis
As mentioned in the introduction, a major problem with Metacritic’s dataset was the fact that sentiment of reviews did not necessarily match the text data. To address this issue, we performed sentiment analysis on the critic reviews. Positive and negative were defined as follows: reviews with scores of 55 and below were classified as negative, and those with scores of 85 and above were classified as positive. Reviews with scores in between these values were not used for sentiment analysis.
Sentiment Analysis used vectors from doc2vec as features. We attempted a few different machine learning models, including: Logistic regression, Naive Bayes, SVM, and different types of Neural Network. The performance of each model is described below:
- Logistic regression: 75% accuracy
- SVM: ~ 65% accuracy
- Naive Bayes ~65% accuracy
- Long short term memory (LTSM) recurrent neural networks (RNN)[known method for NLP, good for assessing sequential data: ~75% accuracy
- Convolutional neural networks (CNN) [commonly used in image processing, but also in NLP tasks]: ~88% accuracy
At the end, the best model ended up being a CNNs with an added RNN component, with the following features: 2 convolution and pool layers, 2 recurrent LTSM layers, and 3 dense, fully connected layers. This model lead us to an accuracy rate above 90%.
On Metarecommendr, this sentiment analysis is showcased interactively: a user types in a review and the text is evaluated according to our model. Users are able to receive feedback on whether the given score aligned or diverged from the text. We hope to continue with this aspect of the project to improve accuracy and use it as another pre-processing step for our recommendation system
Since models were built in Python, a natural choice was to use Flask framework to implement our web application.The frontend is an interactive application built on top of Bootstrap, AngularJS and Angular Material. On the backend, The app is able to directly pull data from the aforementioned MySQL database on AWS. Models were exported to Pickle and H5 files which were stored on AWS S3. When a user visits our application, such files are loaded from AWS s3.
There are a few improvements that could be made to metarecommendr, including:
- Creating a hybrid recommendation system that blends both content and collaborative filtering.
- Adding more filters on the user interface to create an even more customizable user experience
- Expanding sentiment analysis model for a more refined rating prediction using NLP( i.e. a 1-10 score)