Building a Video Game Recommendation System

Brenna Botzheim
Posted on Aug 20, 2020

This blog goes through how I created a video game recommendation system for my NYC Data Science Academy Capstone Project. The recommender app is located here

Building a recommendation system involved many steps ranging from collecting the data, preparing it, building a model, and deploying an interactive app. I collected the data from Metacritic, using Scrapy to gather user reviews and game data about every video game on the site. That data was then processed, adding unique identifiers for every game and user. The data was explored visually using Python. Then, I built a model using the reviews of each game. The model was built using Doc2Vec, an unsupervised algorithm that generated vectors for each game based on reviews. These vectors could then be compared to one another to find similarities. The final product of this project is an app that can be used to recommend video games based on various inputs from a user. The app is a single page web application using flask and jquery, and was built in cooperation with Andrew Clarry, a friend of mine who is more knowledgeable in the field of web development.

Here is a relatively brief overview of the process to build this recommendation system.

Scraping the Data

This project primarily utilizes Python, and Scrapy was chosen for the web-scraping task. Metacritic is luckily not a JavaScript heavy website, and the necessary information could easily be obtained by crawling through the HTML.

Data was collected on every game, including the game title, developer, publisher, release date, and platform for the game (like Xbox, Playstation, etc). Data was additionally scraped for every review of every video game. This included mainly the username, review text, and rating. The resulting data included about 17,000 games and over 1,000,000 reviews.

Data Processing

Once the data was collected, it needed to be processed for longer-term storage and prepared for the model. First the data was combed through to understand what it looked like and get a feel for how many missing values existed. Some missing values were easily fixable, but predominantly these were caused by the data not existing on Metacritic to begin with. The most important information, including video game identifiers and review texts, were almost entirely accounted for.

In preparation for building the model, the reviews were processed to remove unnecessary punctuation and unhelpful, frequently occurring words. They were then 'tagged' with the game ID corresponding to the review, so that the model can compare by game.

Building the Model

Doc2Vec is a pretty straight forward model to train. The important part is to tag each review with the game ID. The model then builds a vector for each tag, based on the review text for that tag. Below is the code for building the model. The 'build_vocab' function enables the user to search for game recommendations based on keywords, so long as the keyword is within the text corpus. Then the model is trained on the reviews:

Now you can see below, that the trained model can be used for simple querying to return the most relevant games.

A search based on a game I already like:

A search based on a keyword that I'm interested in:

Deploying the Model in a User-Friendly App

The above was some very simple querying with very relevant results already, but a better way to implement this recommender would be through a user-friendly app that allows for some filtering.

I teamed up with a friend of mine, Andrew Clarry, to design an app for this purpose. The final result allows a user to refine results by the platform they would like to play on, genres they are specifically interested in, and a game title or keyword to get a result from the model. While platform and genre are optional inputs, the game title or keyword is required to get recommendations. The result is a pretty nifty, lightweight game recommendation app that scarily predicts games that I have on my Steam wish list! A more advanced implementation of this in the future could account for past user interests and filters out redundant recommendations (such as games previously played). You can check out the results of this project here.

About Author

Brenna Botzheim

Brenna Botzheim

Brenna Botzheim is an associate EOV Analyst at StormGeo. Brenna holds a Bachelors degree from San Francisco State University where she studied sociology and mathematics. In her spare time, Brenna continues to develop her skills in statistical data...
View all posts by Brenna Botzheim >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp