Data from the Steam Game Platform Using Web Scraping

Sean Justice

Posted on Dec 3, 2018

What is Steam?

Steam is a digital distribution platform for PC gaming that was created by Valve Software in 2003. It has grown over the years and has now become the largest PC platform for selling games. With the data, it has 150 million registered users and 18.5 million concurrent users. If you are a PC gamer, odds are good that you look for games to play by using the Steam platform.

Data Science Objective

To scrape data associated with all of the games available for purchase through the Steam platform and to use this data to gain insights on the growth of the platform as well as the popularity of types of games.

Description of Data to be Scraped

The data that will be collected for each game includes the title, developer, price, description of the game, total number of reviews, percentage of reviews that are positive, release data, and the category of the game as defined by users. The user-defined categories are referred to as tags by Steam, and they provide some interesting information about the game.

Data Challenges Encountered

While scraping, I encountered a few challenges. First there are some games that are not yet available for sale, and those games were skipped. Also there are some games that are on sale, so there are two prices associated with the game -- the original price and the current sale price. Both of these prices were scraped. Finally, there are some games that have fewer than ten reviews, so there is not yet a percent positive number for the game. For those games I just set the percent positive to N/A.

Biggest Challenge

Because some games feature mature content, Steam employs an age check popup that uses javascript to hide the desired elements of the page until the age of the visitor is verified. Scrapy is not able to access these elements directly, so I had to find a way to have scrapy interact with the javascript elements. One solution was to use Selenium to scrape the site, but I didn’t encounter this page until after I had written a large amount of my scraper in scrapy. Rewriting all of my code in Selenium was a last resort. Instead I learned how to implement scrapy-splash.

Scrapy-splash works by using Splash as a low level browser to render the page requests, and then those requests are forwarded to scrapy for scraping. Scrapy-splash allows dynamic elements, such as the age check inputs, to be manipulated using javascript, so it’s possible to input an age and click the “View Page” button.

Unfortunately for me, scrapy-splash interacts with the page by running scripts using the Lua language. As I had never encountered Lua before, and I had limited experience with javascript, I had to get through a very steep learning curve. But once I figured out how to implement the scrapy-splash script, I was able to scrape the desired data from the games.

Data Analysis around Prices

The average price for a game available on Steam is $8.42, and the median price is $5.99. Interestingly, the game Crisis Action VR is the most expensive game on the platform at $199. The price for this game ends up being a bit of a mystery since the game itself doesn’t seem out of the ordinary, and the reviews for the game just add to the confusion about why the price is so high on a game that was priced lower on release.

The average price of a game released since the platform became available has fluctuated around $8. One very interesting thing is that there has been a dramatic increase in the number of games available since 2014. This appears to be a response to Steam introducing the Early Access games that allow small developers to release games in unfinished format so that they can raise money for development, as well get early feedback on a game’s development.

Category Analysis

Since 2014 each game available for purchase through Steam has a number of categories that can be used to describe the game. These categories are called tags, and the users of Steam can pick them based on their experience with the game. Steam allows a total of 20 tags to be associate with each game, and there are a total of 354 distinct tags. Overall the Indie tag has the largest number of games associated with it.

Action, Casual, and Adventure are the next three top categories. That Indie is the highest category is not very surprising since Steam provides a means for independent game developers to reach a sizeable audience that would be difficult otherwise.

The overlap in tags is also interesting with Indie and Action games being the most common ones on the service.

2014	2015	2016	2017	2018
Indie	Indie	Indie	Indie	Indie
Adventure	Action	Action	Action	Casual
Casual	Adventure	Casual	Casual	Action
Action	Casual	Adventure	Adventure	Adventure
Singleplayer	Singleplayer	Simulation	Simulation	Simulation
RPG	Strategy	Strategy	Strategy	Strategy
Strategy	RPG	Singleplayer	VR	Early Access
Simulation	Simulation	VR	Early Access	Singleplayer
Puzzle	2D	RPG	Singleplayer	RPG
Free to Play	Great Soundtrack	Early Access	RPG	VR

By analyzing the top ten tags for each year since 2014, it’s possible to see certain types of games increase and sometimes decrease in popularity over time. For example it’s possible to see VR increase in popularity in 2016 and 2017, but that popularity appears to have waned in 2018. Early Access games have increased in popularity since 2016, and they are likely to be continuing that increase this year as well.

Word Cloud Analysis

Each game on Steam comes with a description in the form of a short blurb provided by the game’s developer as a sort of sales pitch to potential buyers. By using these descriptions, it’s possible to do a word cloud analysis to get an idea of how games on Steam are marketed and to also see what are the most common terms used by developers. Some terms are generally what one expect to find associated with games, like adventure and puzzle, but one term that is kind of surprising to see so prominently is VR.

This is surprising since even at its highest popularity point in 2017, it was only the 7th most popular tag. My guess as to why it’s so prominent is that there was a lot of buzz around VR games a few years ago and that developers who use it wished to bring it to the attention of gamers because VR requires additional hardware to work properly.

Ideas for Future Development

The Steam platform provides a large amount of data about games, and it’s possible to see many trends using that data. One thing that I would like to do for future development is to scrape the reviews for each game along with the data about the games. One thing that would make this difficult is that the reviews are only available using an infinitely scrolling page, and some games have over a million reviews. Scrapy-splash will need to be implemented to access the reviews, and it might be worthwhile to only consider the most recent 500 reviews to keep the number to be scraped down to a manageable amount.

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

About Author

Sean Justice

Data scientist with a background in computer engineering and over a decade of experience working in a team environment to solve challenging problems. Interested in deep learning, especially computer vision and natural language processing.

View all posts by Sean Justice >

Data Visualization

Python Shows Factors Influencing University Retention Rates

Student Works

Data Science Analysis of Scraped TripAdvisor Reviews

Python

Using Data Science to Start The Quest for the Perfect Recipe

Python

Scraping Recipes Using Data Science

Web Scraping

DATA STUDYING THE LABOR MARKET DURING A PANDEMIC

Cancel reply

You must be logged in to post a comment.

John January 15, 2019

Sean, I left this message in your other article! Woops haha Anyway, awesome article! I use steam for my favorite pass-times. I have a few questions if you have a chance to speak with me. Please let me know how I can reach you and when you are available to chat. jbuen@capvision.com

Data from the Steam Game Platform Using Web Scraping

What is Steam?

Data Science Objective

Description of Data to be Scraped