Building a Video Game Recommendation System
This blog goes through how I created a video game recommendation system for my NYC Data Science Academy Capstone Project. The recommender app is located here.
Building a recommendation system involved many steps ranging from collecting the data, preparing it, building a model, and deploying an interactive app. I collected the data from Metacritic, using Scrapy to gather user reviews and game data about every video game on the site. That data was then processed, adding unique identifiers for every game and user. The data was explored visually using Python. Then, I built a model using the reviews of each game. The model was built using Doc2Vec, an unsupervised algorithm that generated vectors for each game based on reviews. These vectors could then be compared to one another to find similarities. The final product of this project is an app that can be used to recommend video games based on various inputs from a user. The app is a single page web application using flask and jquery, and was built in cooperation with Andrew Clarry, a friend of mine who is more knowledgeable in the field of web development.
Here is a relatively brief overview of the process to build this recommendation system.
Scraping the Data
This project primarily utilizes Python, and Scrapy was chosen for the web-scraping task. Metacritic is luckily not a JavaScript heavy website, and the necessary information could easily be obtained by crawling through the HTML.
Data was collected on every game, including the game title, developer, publisher, release date, and platform for the game (like Xbox, Playstation, etc). Data was additionally scraped for every review of every video game. This included mainly the username, review text, and rating. The resulting data included about 17,000 games and over 1,000,000 reviews.
Data Processing
Once the data was collected, it needed to be processed for longer-term storage and prepared for the model. First the data was combed through to understand what it looked like and get a feel for how many missing values existed. Some missing values were easily fixable, but predominantly these were caused by the data not existing on Metacritic to begin with. The most important information, including video game identifiers and review texts, were almost entirely accounted for.
In preparation for building the model, the reviews were processed to remove unnecessary punctuation and unhelpful, frequently occurring words. They were then 'tagged' with the game ID corresponding to the review, so that the model can compare by game.
Building the Model
Doc2Vec is a pretty straight forward model to train. The important part is to tag each review with the game ID. The model then builds a vector for each tag, based on the review text for that tag. Below is the code for building the model. The 'build_vocab' function enables the user to search for game recommendations based on keywords, so long as the keyword is within the text corpus. Then the model is trained on the reviews:

Now you can see below, that the trained model can be used for simple querying to return the most relevant games.
A search based on a game I already like:

A search based on a keyword that I'm interested in:

Deploying the Model in a User-Friendly App
The above was some very simple querying with very relevant results already, but a better way to implement this recommender would be through a user-friendly app that allows for some filtering.
I teamed up with a friend of mine, Andrew Clarry, to design an app for this purpose. The final result allows a user to refine results by the platform they would like to play on, genres they are specifically interested in, and a game title or keyword to get a result from the model. While platform and genre are optional inputs, the game title or keyword is required to get recommendations. The result is a pretty nifty, lightweight game recommendation app that scarily predicts games that I have on my Steam wish list! A more advanced implementation of this in the future could account for past user interests and filters out redundant recommendations (such as games previously played). You can check out the results of this project here.