Webscraping the Steam Game Platform

Posted on Dec 3, 2018

What is Steam?

Steam is a digital distribution platform for PC gaming that was created by Valve Software in 2003. It has grown over the years and has now become the largest PC platform for selling games. It has 150 million registered users and 18.5 million concurrent users. If you are a PC gamer, odds are good that you look for games to play by using the Steam platform.

Objective

To scrape data associated with all of the games available for purchase through the Steam platform and to use this data to gain insights on the growth of the platform as well as the popularity of types of games.

Description of Data to be Scraped

The data that will be collected for each game includes the title, developer, price, description of the game, total number of reviews, percentage of reviews that are positive, release data, and the category of the game as defined by users. The user-defined categories are referred to as tags by Steam, and they provide some interesting information about the game.

Challenges Encountered

While scraping, I encountered  a few challenges. First there are some games that are not yet available for sale, and those games were skipped. Also there are some games that are on sale, so there are two prices associated with the game -- the original price and the current sale price. Both of these prices were scraped. Finally, there are some games that have fewer than ten reviews, so there is not yet a percent positive number for the game. For those games I just set the percent positive to N/A.

Biggest Challenge

Because some games feature mature content, Steam employs an age check popup that uses javascript to hide the desired elements of the page until the age of the visitor is verified. Scrapy is not able to access these elements directly, so I had to find a way to have scrapy interact with the javascript elements. One solution was to use Selenium to scrape the site, but I didn’t encounter this page until after I had written a large amount of my scraper in scrapy. Rewriting all of my code in Selenium was a last resort. Instead I learned how to implement scrapy-splash.

Scrapy-splash works by using Splash as a low level browser to render the page requests, and then those requests are forwarded to scrapy for scraping. Scrapy-splash allows dynamic elements, such as the age check inputs, to be manipulated using javascript, so it’s possible to input an age and click the “View Page” button. Unfortunately for me, scrapy-splash interacts with the page by running scripts using the Lua language. As I had never encountered Lua before, and I had limited experience with javascript, I had to get through a very steep learning curve. But once I figured out how to implement the scrapy-splash script, I was able to scrape the desired data from the games.

Price Analysis

 

The average price for a game available on Steam is $8.42, and the median price is $5.99. Interestingly, the game Crisis Action VR is the most expensive game on the platform at $199. The price for this game ends up being a bit of a mystery since the game itself doesn’t seem out of the ordinary, and the reviews for the game just add to the confusion about why the price is so high on a game that was priced lower on release.

The average price of a game released since the platform became available has fluctuated around $8. One very interesting thing is that there has been a dramatic increase in the number of games available since 2014. This appears to be a response to Steam introducing the Early Access games that allow small developers to release games in unfinished format so that they can raise money for development, as well get early feedback on a game’s development.

Category Analysis

Since 2014 each game available for purchase through Steam has a number of categories that can be used to describe the game. These categories are called tags, and the users of Steam can pick them based on their experience with the game. Steam allows a total of 20 tags to be associate with each game, and there are a total of 354 distinct tags. Overall the Indie tag has the largest number of games associated with it. Action, Casual, and Adventure are the next three top categories. That Indie is the highest category is not very surprising since Steam provides a means for independent game developers to reach a sizeable audience that would be difficult otherwise.

The overlap in tags is also interesting with Indie and Action games being the most common ones on the service.

 

2014 2015 2016 2017 2018
Indie Indie Indie Indie Indie
Adventure Action Action Action Casual
Casual Adventure Casual Casual Action
Action Casual Adventure Adventure Adventure
Singleplayer Singleplayer Simulation Simulation Simulation
RPG Strategy Strategy Strategy Strategy
Strategy RPG Singleplayer VR Early Access
Simulation Simulation VR Early Access Singleplayer
Puzzle 2D RPG Singleplayer RPG
Free to Play Great Soundtrack Early Access RPG VR

By analyzing the top ten tags for each year since 2014, it’s possible to see certain types of games increase and sometimes decrease in popularity over time. For example it’s possible to see VR increase in popularity in 2016 and 2017, but that popularity appears to have waned in 2018. Early Access games have increased in popularity since 2016, and they are likely to be continuing that increase this year as well.

 

Word Cloud Analysis

Each game on Steam comes with a description in the form of a short blurb provided by the game’s developer as a sort of sales pitch to potential buyers. By using these descriptions, it’s possible to do a word cloud analysis to get an idea of how games on Steam are marketed and to also see what are the most common terms used by developers. Some terms are generally what one expect to find associated with games, like adventure and puzzle, but one term that is kind of surprising to see so prominently is VR. This is surprising since even at its highest popularity point in 2017, it was only the 7th most popular tag. My guess as to why it’s so prominent is that there was a lot of buzz around VR games a few years ago and that  developers who use it wished to bring it to the attention of gamers because VR requires additional hardware to work properly.

Ideas for Future Development

The Steam platform provides a large amount of data about games, and it’s possible to see many trends using that data. One thing that I would like to do for future development is to scrape the reviews for each game along with the data about the games. One thing that would make this difficult is that the reviews are only available using an infinitely scrolling page, and some games have over a million reviews. Scrapy-splash will need to be implemented to access the reviews, and it might be worthwhile to only consider the most recent 500 reviews to keep the number to be scraped down to a manageable amount.

About Author

Sean Justice

Fellow at the NYC Data Science Academy. Previous experience was as a Physical Design Engineer working on chip designs.
View all posts by Sean Justice >

Related Articles

Leave a Comment

Your email address will not be published. Required fields are marked *

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags