Building a Video Game Recommendation System

Posted on Aug 20, 2020

This blog goes through how I created a video game recommendation system for my NYC Data Science Academy Capstone Project. The recommender app is located here.Β 

Building a recommendation system involved many steps ranging from collecting the data, preparing it, building a model, and deploying an interactive app. I collected the data from Metacritic, using Scrapy to gather user reviews and game data about every video game on the site. That data was then processed, adding unique identifiers for every game and user. The data was explored visually using Python. Then, I built a model using the reviews of each game. The model was built using Doc2Vec, an unsupervised algorithm that generated vectors for each game based on reviews. These vectors could then be compared to one another to find similarities. The final product of this project is an app that can be used to recommend video games based on various inputs from a user. The app is a single page web application using flask and jquery, and was built in cooperation with Andrew Clarry, a friend of mine who is more knowledgeable in the field of web development.

Here is a relatively brief overview of the process to build this recommendation system.

Scraping the Data

This project primarily utilizes Python, and Scrapy was chosen for the web-scraping task. Metacritic is luckily not a JavaScript heavy website, and the necessary information could easily be obtained by crawling through the HTML.

Data was collected on every game, including the game title, developer, publisher, release date, and platform for the game (like Xbox, Playstation, etc). Data was additionally scraped for every review of every video game. This included mainly the username, review text, and rating. The resulting data included about 17,000 games and over 1,000,000 reviews.

Data Processing

Once the data was collected, it needed to be processed for longer-term storage and prepared for the model. First the data was combed through to understand what it looked like and get a feel for how many missing values existed. Some missing values were easily fixable, but predominantly these were caused by the data not existing on Metacritic to begin with. The most important information, including video game identifiers and review texts, were almost entirely accounted for.

In preparation for building the model, the reviews were processed to remove unnecessary punctuation and unhelpful, frequently occurring words. They were then 'tagged' with the game ID corresponding to the review, so that the model can compare by game.

Building the Model

Doc2Vec is a pretty straight forward model to train. The important part is to tag each review with the game ID. The model then builds a vector for each tag, based on the review text for that tag. Below is the code for building the model. The 'build_vocab' function enables the user to search for game recommendations based on keywords, so long as the keyword is within the text corpus. Then the model is trained on the reviews:

Now you can see below, that the trained model can be used for simple querying to return the most relevant games.

A search based on a game I already like:

A search based on a keyword that I'm interested in:

Deploying the Model in a User-Friendly App

The above was some very simple querying with very relevant results already, but a better way to implement this recommender would be through a user-friendly app that allows for some filtering.

I teamed up with a friend of mine, Andrew Clarry, to design an app for this purpose. The final result allows a user to refine results by the platform they would like to play on, genres they are specifically interested in, and a game title or keyword to get a result from the model. While platform and genre are optional inputs, the game title or keyword is required to get recommendations. The result is a pretty nifty, lightweight game recommendation app that scarily predicts games that I have on my Steam wish list! A more advanced implementation of this in the future could account for past user interests and filters out redundant recommendations (such as games previously played). You can check out the results of this project here.

About Author

Brenna Botzheim

Brenna Botzheim is an associate EOV Analyst at StormGeo. Brenna holds a Bachelors degree from San Francisco State University where she studied sociology and mathematics. In her spare time, Brenna continues to develop her skills in statistical data...
View all posts by Brenna Botzheim >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI