Steam Data Visualization using GGPlot2 and R

Posted on Oct 24, 2016

He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 26 to Dec 23 2016. The post was based on his first class project (due at the 2nd week of the program before statistics were taught).

1) Purpose

The purpose of my data visualization project was to visualize data about how review scores, discounts and campaign length affect a customers' buying decisions. Specifically, I wanted to use data sets from Steam, Metacritic, IGN, and HowLongToBeat and use R to combine them into one data frame, and then use the GGPlot2 R package to visualize these data sets.

2) Background

PC gaming enthusiasts buy powerful machines to run their games maxed out, visualized on beautiful high resolution 144hz screens, and amass impressive libraries of games they can play on any whim. Of course PC gamers often the latest and greatest. But what does that mean for games that aren't Battlefield, Overwatch, or Starcraft II?

PC gaming has been on the forefront of digital sales for over a decade due in large part to Steam. For those out of the PC gaming loop, Valve Software's Steam platform has become the go to digital distribution platform for PC gaming since its release in 2003. Steam is a service available for Windows, OS X and Linux with 11,358 games available and 125 million registered accounts. Steam has had as many as 12.5 million concurrent users as of November 2015.

The Steam Summer sale has been an annual event on Steam where games become heavily discounted and generates buzz throughout the gaming community. Most games typically go on sale, and there are daily and semi daily special deals and packs of games that keeps customers returning each day as well as an encore of some of the best sales at the end.

2) Data sets used for Data Visualization

The following datasets were accessed for data visualization:

  • Steam sales data - Game sold and prices from before and after major sales event on Steam. The data has been collected and hosted by the third party site, SteamSpy. The data excludes games from the data that sold less than 5000 copies.Metacritic review scores - Average professional reviews across blogs, gaming websites and magazines. 1 is the lowest value, 100 the highest.
  • Steam review scores - Reviews of games posted to Steam gathered using Steam's API. 1 is the lowest value, 100 the highest.
  • IGN review scores - Reviews from the popular video game news site, IGN. The data was available from Kaggle. 1 is the lowest value, 10 the highest.
  • Game campaign lengths - Length of single player portions of games in hours gathered from HowLongToBeat.

3) Preparing the data sets using R

Each dataset required polishing to get values into a proper form including removing non PC gaming platforms and converting the data into the proper data class.

Below is an example for how I prepared the Steam sales data for this project. The numerical columns data included symbols that needed to be removed dollar signs, commas, parenthesis, and percent symbols while also including two data points in one column. This required regular expression to retrieve the values. I also dropped and renamed columns for readability and ridding excess information.

4) Merging the data

One of the most time consuming parts of this project involved merging the data because of the game’s names. The names of games on Microsoft’s sites were abbreviated or shortened due to character limits on their own website or are only available as “XYZ Edition”. Users often used abbreviations or modified names when submitting to UserVoice. Wikipedia articles sometimes used the international name of a game. Punctuation, trademark/copyright marks, and differences in spelling prevent one to one matching.

The solution was to make a new column of names where I would remove as many obstacles as I could so that when datasets are merged, there are the fewest differences possible. I began by removing words including “the,” “edition,” “special,” “full game,” “free to play” and many more. I then removed pluralization and possession from words and converted numbers to roman numerals (“Assassin’s Creed 2” to “Assassin Creed II”). I also removed all non alphanumeric letters as well as spacing and capitalization.

5) Insights

Sales Versus the Increase of Owners


I felt that the increase of owners better represented the success of a sale than sales numbers alone because it allows comparisons of smaller games, such as indie games, to AAA games. For example, if an indie game sold only 2000 copies and increased to 2400 copies sold, that is very impressive but a 400 sales increase for a AAA may be dismal. The above scatterplot shows how there seems to be a closer relationship to the increase of owners and the number of owners before versus the as opposed to a comparison using sales.



I compared the number of owners before the sale to the percent discount of the price before the beginning of the sale. I would have liked to have compared the change in sales for a similar period but because SteamSpy has a time limit to access historical data, with the exception of the Summer Sale data, I found this was the best way to represent the success of a game's sales. The mean increase is the percentage increase of owners compared to the number of owners at the start of the sale.

Campaign Length

Longer games, especially those longer than 20 hours long did not see more than 50% increase of sales. Overall, it appears that games from 0 to 15 hours saw more dramatic sales. Perhaps gamers who are into discounted games are not those of the type that enjoy playing long games on a PC such as an RPG and instead more into action or adventure games.

Metacritic Reviews

Metacritic is a site that compiles reviews of games, movies and tv shows and produces a score based on the overall average reception. Publishers have been putting a strong emphasis on ensuring a good score on Metacritic that they use it as a guide to provide incentives to developers to release high quality products. Certain publishers such as Activision and EA pay developers bonuses if a game's Metacritic review score is greater 90. 

Many found it shocking when Bethesda missed out on a bonus for the much loved, Fallout: New Vegas, because it failed to hit a Metacritic average of 85 (it was 84). How much of this affects a customer's buying decision.

It seems to affect it a lot, but only up to a certain limit. Games greater than 50 as rated by metacritic fared well and saw some dramatic increase in owners. However, games rated greater 85 didn't necessarily see a massive increase in sales. This could be that during a sale, these games retained their value causing them to not be put on a very large sale, costed significantly more than most of the games on sale, and perhaps a sale encouraged customers to buy lower rated games they otherwise may not had interest in instead.

Games above 60 and 70 on metacritic generally did much better than those below those scores. Above 85, the games didn't see a huge increase but perhaps this is a weakness in this type of analysis. These games may have already had a huge amount of owners and still saw good sales.

Steam Users' Reviews

The reviews on the Steam website seemed to also skew towards higher reviewed games possibly because the reviews are prominent on the game's page.

Time Since the Game's Release

I found the time since the release date for each game and grouped them by each year and normalized them so each year would be represented by the increase it saw and not skewed by the varying amount of games from that year.Interestingly, as the increase in ownership increases from 0% to 50%, the age of games generally increased.

Perhaps, this is due to classics like the original Half-Life or early Star Wars games or and some in the intermediate aged games are ones that gamers wished to return to or never had the chance to experience. The majority of games that found successes of greater than 50% increase in ownership were less than one year old. There are even ranges where 4 years old or newer saw heavy increases in the relative amount of new owners.

6) Wishful Thinking

There are other concepts I wish I could have studied. One of the major factors that influence PC gamer's willingness to purchase the game is how well it runs on their computers. Some companies have gotten particularly bad PR for releasing a badly optimized game on PC. Perhaps, in a future study I could scrape benchmarks for games across a variety of hardware (with special attention to hardware represented on Steam's hardware survey). I would also like to see how hesitant gamers are to buying from a publisher after they release a few games which had a negative reception.

Another distinction I'd like to look at is how much the number of graphics tuning options affect the sales of a game. Could quickly porting a console game with limited customizability to PC be enough to earn sales or does it truly need to be feature packed and highly customizable? What features, like ultra widescreen resolutions and SLI, should be prioritized to appeal to the most amount of gamers and extend sales the longest?

I would also like to have gotten more granular about the sale, looking at the timing of the sales of a game and the percent discount because these sometimes varied during the course of the sale.

7) References

Diver, Mike. "It's Not Enough to Make 'Good' Video Games Anymore." VICE. N.p., 18 Sept. 2014. Web. 16 Oct. 2016.

Galyonkin, Sergey. "SteamSpy API." SteamSpy. N.p., n.d. Web. 16 Oct. 2016.

Grinstein, Eric. "20 Years of Games: 18000+ Rows of Review Data from" Kaggle. Kaggle, 27 Sept. 2016. Web. 5 Oct. 2016.

"IGN." IGN. N.p., n.d. Web. 16 Oct. 2016.

Jackson, Mark. "UK ISPs Brace for Internet Traffic Surge During the Steam Summer Sale." ISPreview UK. ISPreview, 21 June 2014. Web. 16 Oct. 2016.

MacDonald, By Keza. "Is Metacritic Ruining The Games Industry?" IGN. IGN, 16 July 2012. Web. 16 Oct. 2016.

Metzger, Florian. "Steam-data-stats." GitHub. Mas-ude, 05 Sept. 2016. Web. 16 Oct. 2016.

"Steam (software)." Wikipedia. Wikimedia Foundation, 12 Oct. 2016. Web. 16 Oct. 2016.

"Steam Web API Documentation." Steam Community. Valve Software, n.d. Web. 16 Oct. 2016.

Totilo, Stephen. "The Ideal Length Of A Role Playing Game Is..." Kotaku. N.p., 09 July 2010. Web. 16 Oct. 2016.

About Author

Leave a Comment

meritking May 27, 2023
Steam Data Visualization using GGPlot2 and R
magazine theme wordpress free December 1, 2017
Trying to get back issue. Its really hard too get the online to work or to login,pass word never let’s u login. Need to fix that.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI