Data from the Steam Game Platform Using Web Scraping
What is Steam?
Steam is a digital distribution platform for PC gaming that was created by Valve Software in 2003. It has grown over the years and has now become the largest PC platform for selling games. With the data, it has 150 million registered users and 18.5 million concurrent users. If you are a PC gamer, odds are good that you look for games to play by using the Steam platform.
Data Science Objective
To scrape data associated with all of the games available for purchase through the Steam platform and to use this data to gain insights on the growth of the platform as well as the popularity of types of games.
Description of Data to be Scraped
The data that will be collected for each game includes the title, developer, price, description of the game, total number of reviews, percentage of reviews that are positive, release data, and the category of the game as defined by users. The user-defined categories are referred to as tags by Steam, and they provide some interesting information about the game.
Data Challenges Encountered
While scraping, I encountered a few challenges. First there are some games that are not yet available for sale, and those games were skipped. Also there are some games that are on sale, so there are two prices associated with the game -- the original price and the current sale price. Both of these prices were scraped. Finally, there are some games that have fewer than ten reviews, so there is not yet a percent positive number for the game. For those games I just set the percent positive to N/A.
Biggest Challenge
Because some games feature mature content, Steam employs an age check popup that uses javascript to hide the desired elements of the page until the age of the visitor is verified. Scrapy is not able to access these elements directly, so I had to find a way to have scrapy interact with the javascript elements. One solution was to use Selenium to scrape the site, but I didnโt encounter this page until after I had written a large amount of my scraper in scrapy. Rewriting all of my code in Selenium was a last resort. Instead I learned how to implement scrapy-splash.
Scrapy-splash works by using Splash as a low level browser to render the page requests, and then those requests are forwarded to scrapy for scraping. Scrapy-splash allows dynamic elements, such as the age check inputs, to be manipulated using javascript, so itโs possible to input an age and click the โView Pageโ button.
Unfortunately for me, scrapy-splash interacts with the page by running scripts using the Lua language. As I had never encountered Lua before, and I had limited experience with javascript, I had to get through a very steep learning curve. But once I figured out how to implement the scrapy-splash script, I was able to scrape the desired data from the games.
Data Analysis around Prices
![]() |
![]() |
The average price for a game available on Steam is $8.42, and the median price is $5.99. Interestingly, the game Crisis Action VR is the most expensive game on the platform at $199. The price for this game ends up being a bit of a mystery since the game itself doesnโt seem out of the ordinary, and the reviews for the game just add to the confusion about why the price is so high on a game that was priced lower on release.
The average price of a game released since the platform became available has fluctuated around $8. One very interesting thing is that there has been a dramatic increase in the number of games available since 2014. This appears to be a response to Steam introducing the Early Access games that allow small developers to release games in unfinished format so that they can raise money for development, as well get early feedback on a gameโs development.
Category Analysis
Since 2014 each game available for purchase through Steam has a number of categories that can be used to describe the game. These categories are called tags, and the users of Steam can pick them based on their experience with the game. Steam allows a total of 20 tags to be associate with each game, and there are a total of 354 distinct tags. Overall the Indie tag has the largest number of games associated with it.
Action, Casual, and Adventure are the next three top categories. That Indie is the highest category is not very surprising since Steam provides a means for independent game developers to reach a sizeable audience that would be difficult otherwise.
The overlap in tags is also interesting with Indie and Action games being the most common ones on the service.
2014 | 2015 | 2016 | 2017 | 2018 |
Indie | Indie | Indie | Indie | Indie |
Adventure | Action | Action | Action | Casual |
Casual | Adventure | Casual | Casual | Action |
Action | Casual | Adventure | Adventure | Adventure |
Singleplayer | Singleplayer | Simulation | Simulation | Simulation |
RPG | Strategy | Strategy | Strategy | Strategy |
Strategy | RPG | Singleplayer | VR | Early Access |
Simulation | Simulation | VR | Early Access | Singleplayer |
Puzzle | 2D | RPG | Singleplayer | RPG |
Free to Play | Great Soundtrack | Early Access | RPG | VR |
By analyzing the top ten tags for each year since 2014, itโs possible to see certain types of games increase and sometimes decrease in popularity over time. For example itโs possible to see VR increase in popularity in 2016 and 2017, but that popularity appears to have waned in 2018. Early Access games have increased in popularity since 2016, and they are likely to be continuing that increase this year as well.
Word Cloud Analysis
Each game on Steam comes with a description in the form of a short blurb provided by the gameโs developer as a sort of sales pitch to potential buyers. By using these descriptions, itโs possible to do a word cloud analysis to get an idea of how games on Steam are marketed and to also see what are the most common terms used by developers. Some terms are generally what one expect to find associated with games, like adventure and puzzle, but one term that is kind of surprising to see so prominently is VR.
This is surprising since even at its highest popularity point in 2017, it was only the 7th most popular tag. My guess as to why itโs so prominent is that there was a lot of buzz around VR games a few years ago and that developers who use it wished to bring it to the attention of gamers because VR requires additional hardware to work properly.
Ideas for Future Development
The Steam platform provides a large amount of data about games, and itโs possible to see many trends using that data. One thing that I would like to do for future development is to scrape the reviews for each game along with the data about the games. One thing that would make this difficult is that the reviews are only available using an infinitely scrolling page, and some games have over a million reviews. Scrapy-splash will need to be implemented to access the reviews, and it might be worthwhile to only consider the most recent 500 reviews to keep the number to be scraped down to a manageable amount.
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.