Data from the Steam Game Platform Using Web Scraping

Posted on Dec 3, 2018

What is Steam?

Steam is a digital distribution platform for PC gaming that was created by Valve Software in 2003. It has grown over the years and has now become the largest PC platform for selling games. With the data, it has 150 million registered users and 18.5 million concurrent users. If you are a PC gamer, odds are good that you look for games to play by using the Steam platform.

Data Science Objective

To scrape data associated with all of the games available for purchase through the Steam platform and to use this data to gain insights on the growth of the platform as well as the popularity of types of games.

Description of Data to be Scraped

The data that will be collected for each game includes the title, developer, price, description of the game, total number of reviews, percentage of reviews that are positive, release data, and the category of the game as defined by users. The user-defined categories are referred to as tags by Steam, and they provide some interesting information about the game.

Data Challenges Encountered

While scraping, I encountered a few challenges. First there are some games that are not yet available for sale, and those games were skipped. Also there are some games that are on sale, so there are two prices associated with the game -- the original price and the current sale price. Both of these prices were scraped. Finally, there are some games that have fewer than ten reviews, so there is not yet a percent positive number for the game. For those games I just set the percent positive to N/A.

Biggest Challenge

Because some games feature mature content, Steam employs an age check popup that uses javascript to hide the desired elements of the page until the age of the visitor is verified. Scrapy is not able to access these elements directly, so I had to find a way to have scrapy interact with the javascript elements. One solution was to use Selenium to scrape the site, but I didn’t encounter this page until after I had written a large amount of my scraper in scrapy. Rewriting all of my code in Selenium was a last resort. Instead I learned how to implement scrapy-splash.

Scrapy-splash works by using Splash as a low level browser to render the page requests, and then those requests are forwarded to scrapy for scraping. Scrapy-splash allows dynamic elements, such as the age check inputs, to be manipulated using javascript, so it’s possible to input an age and click the “View Page” button.

Unfortunately for me, scrapy-splash interacts with the page by running scripts using the Lua language. As I had never encountered Lua before, and I had limited experience with javascript, I had to get through a very steep learning curve. But once I figured out how to implement the scrapy-splash script, I was able to scrape the desired data from the games.

Data Analysis around Prices

 

The average price for a game available on Steam is $8.42, and the median price is $5.99. Interestingly, the game Crisis Action VR is the most expensive game on the platform at $199. The price for this game ends up being a bit of a mystery since the game itself doesn’t seem out of the ordinary, and the reviews for the game just add to the confusion about why the price is so high on a game that was priced lower on release.

The average price of a game released since the platform became available has fluctuated around $8. One very interesting thing is that there has been a dramatic increase in the number of games available since 2014. This appears to be a response to Steam introducing the Early Access games that allow small developers to release games in unfinished format so that they can raise money for development, as well get early feedback on a game’s development.

Category Analysis

Since 2014 each game available for purchase through Steam has a number of categories that can be used to describe the game. These categories are called tags, and the users of Steam can pick them based on their experience with the game. Steam allows a total of 20 tags to be associate with each game, and there are a total of 354 distinct tags. Overall the Indie tag has the largest number of games associated with it.

Action, Casual, and Adventure are the next three top categories. That Indie is the highest category is not very surprising since Steam provides a means for independent game developers to reach a sizeable audience that would be difficult otherwise.

The overlap in tags is also interesting with Indie and Action games being the most common ones on the service.

 

2014 2015 2016 2017 2018
Indie Indie Indie Indie Indie
Adventure Action Action Action Casual
Casual Adventure Casual Casual Action
Action Casual Adventure Adventure Adventure
Singleplayer Singleplayer Simulation Simulation Simulation
RPG Strategy Strategy Strategy Strategy
Strategy RPG Singleplayer VR Early Access
Simulation Simulation VR Early Access Singleplayer
Puzzle 2D RPG Singleplayer RPG
Free to Play Great Soundtrack Early Access RPG VR

By analyzing the top ten tags for each year since 2014, it’s possible to see certain types of games increase and sometimes decrease in popularity over time. For example it’s possible to see VR increase in popularity in 2016 and 2017, but that popularity appears to have waned in 2018. Early Access games have increased in popularity since 2016, and they are likely to be continuing that increase this year as well.

 

Word Cloud Analysis

Each game on Steam comes with a description in the form of a short blurb provided by the game’s developer as a sort of sales pitch to potential buyers. By using these descriptions, it’s possible to do a word cloud analysis to get an idea of how games on Steam are marketed and to also see what are the most common terms used by developers. Some terms are generally what one expect to find associated with games, like adventure and puzzle, but one term that is kind of surprising to see so prominently is VR.

This is surprising since even at its highest popularity point in 2017, it was only the 7th most popular tag. My guess as to why it’s so prominent is that there was a lot of buzz around VR games a few years ago and that  developers who use it wished to bring it to the attention of gamers because VR requires additional hardware to work properly.

Ideas for Future Development

The Steam platform provides a large amount of data about games, and it’s possible to see many trends using that data. One thing that I would like to do for future development is to scrape the reviews for each game along with the data about the games. One thing that would make this difficult is that the reviews are only available using an infinitely scrolling page, and some games have over a million reviews. Scrapy-splash will need to be implemented to access the reviews, and it might be worthwhile to only consider the most recent 500 reviews to keep the number to be scraped down to a manageable amount.

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

About Author

Sean Justice

Data scientist with a background in computer engineering and over a decade of experience working in a team environment to solve challenging problems. Interested in deep learning, especially computer vision and natural language processing.
View all posts by Sean Justice >

Related Articles

Leave a Comment

John January 15, 2019
Sean, I left this message in your other article! Woops haha Anyway, awesome article! I use steam for my favorite pass-times. I have a few questions if you have a chance to speak with me. Please let me know how I can reach you and when you are available to chat. [email protected]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI