Data Analysis on Video Games Ratings

Posted on Aug 3, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Data Analysis on Video Games Ratings

Data shows video games are a valued source of entertainment for an ever-growing fan base.  Indie game creators are more easily connecting their products with customers through new, easy-to-use game portals, while industry leading game companies are allocating new games budgets that rival those of Hollywood movies.  Gamers have a plethora of genres to choose from and many different styles of playing with which they can engage. 

With this in mind, I want to explore gamers' play time and game ratings.  The time a player spends in game is important to them as they want to feel fulfilled with how they spend their time, and ratings of course are important to developers as they help boost current and potential future sales.  My primary purpose is to explore the potential correlations between average play time for a video game, the game's genre, and the average rating that players give the game.  Hopefully with this insight, potential game developers would be able to plan new games that maximize player satisfaction.

Collectibles: Where to get the data

We're actually able to collect this data from the players themselves.  There's a crowd sourced website,, where users can submit how long they've spent playing a game, as well as their rating of the game.

The fact that they submit both their play time and rating ensures that the ratings we're dealing with are coming from players who have played through the game.  That’s important for establishing a metric intended to influence the development of future games.

Player Play Style: How to Enjoy a Game

There are three main styles of play that users can submit their play time under:

  1. Main Story, which measures how long the player spent just playing through the game's main story, only the parts required to get to the end credits.
  2. Main + Extras, which accounts for playing through the main story, as well as engaging in nonessential tasks of the game like side quests, collecting superfluous items, and investigating game lore.
  3. Completionist, which means scouring every last inch of the game to ensure they experience absolutely everything the game has to offer.

Within each of these three categories are another three, the average general play time, a Rushed play time where players tried to accomplish their goal as quickly as they could, and a Leisure play time where the player went through everything very casually.

Scraping the Data

HowLongToBeat has individual pages for every game it tracks tied to unique game IDs.  The play time, rating, and misc data (genre, developer, release date, etc.) was all easily obtained from these pages using Scrapy. However, this first required getting a list of the game IDs, something that is not easily obtainable on the site.

Consequently, I had to run through the entirety of the search page results using a blank string as my search term.  These results were live-loaded on top of the present page, and I found the frame they're loaded in refreshes itself shortly after initial load.  Using Selenium for the search results, I was able to avoid any issues this might cause and retrieve the game ID from each search result.

The game search results included a preview player time, but not any of the other data I needed such as rating or genre.  This meant I was still required to go to the individual page for each game for my final data, but the play time previews did allow me to start filtering before that.  The displayed times were presented on a background with a color spectrum between red and blue, depending on the number of players who submitted play time. 

More play time data available corresponded with a blue background, while games with less appear red.  Using this I had my Selenium script save the game IDs for games only where the number of play times submitted was at least 20, which corresponded to a blue-ness background color listed as 70%.

Quality vs Quantity Data

Data Analysis on Video Games Ratings

The graph above shows, across all game genres, the length of time to beat the main story in minutes versus the average rating of the game on a scale of 0 to 100. It seems to suggest that more time is generally a good thing, but we can see there are a number of outliers, as well as potentially diminishing returns.  That said, it can be hard to interpret much from such a mass of points.

Genres Generally Matter

Data Analysis on Video Games Ratings

When we start to break games down by genre, we see very different plots.  Here we see two of the same graph as before, but this time one is only for simulation type games, such as a flight simulator, a game where you own and run a farm, The Sims, while the other is only for sports type games, such as a soccer game, basketball, golf.

We see that for Simulation games, longer Main Stories definitely do seem to matter, rising from a clustering around 70 to ratings in the 80s and 90s as play time increases. On the other hand, for Sports games, the ball is all over the place.

We can theorize about this.  If the purpose of a simulation is to give the player a novel, enveloping experience then maybe they want to spend as much time as possible in that world.  If the player simply wants to play a sports game against a friend, though, it probably doesn't matter what the game's story looks like in any way.

Visual Novel Style

For yet other genres, we can see there are clearly diminishing returns.  Here we see Visual Novel style games, games that are highly story driven with sometimes very minimal input necessary from the user, as well as side scrolling games, 2-dimensional adventures where the player is simply trying to get from the left side of the screen to the right while avoiding danger.

For each of these graphs, we see average rating increases to smatterings around the 70s or 80s as play time increases initially, though the ratings never seem to get much higher than that.  Knowing this we could design a game hoping to get a certain length of story at minimum and don't need to worry about trying to put in as much as possible.  That said, what exactly would we be focusing on increasing for either genre?  For story driven visual novels, the users probably want to experience just that -- the story.  For side scrolling games, though, the play time might instead be a factor of the number of levels included in the game.  This probably warrants some further investigation.

What are you playing for?

For some game genres and play styles we see that average ratings can even go down as play time increases.  Why might this be?

Main Story vs Main Story + Extras

Here we see two play styles: Main Story only and Main + Extras at a Leisurely pace for the same genre, role playing games.  These are games where the player takes on the role of another person in various settings, such as fantasy, medieval or futuristic sci-fi. 

The average ratings for Main Story only are pretty huddled, perhaps trending upward.  However, we seem to have a somewhat parabolic relation between ratings and play time in the second graph, rising to the 80s and 90s at a point but then trending back down to the 70s. 

It appears that for role playing games at a certain point in play after the main story has been finished but before the player has completed the game 100% they start to feel like it's a chore to continue.  A lot of role playing games are known for their "grind," repetitive actions the player must go through for not necessarily great rewards.  Knowing this, we could focus on designing a great story but not worry too much about a lot of extra side stuff for the player to do.

Here again we see different play time and rating correlations for the same genre, roguelike, but different play styles.  For Main Story alone ratings are all over the place, but Completionist Leisure players appear to prefer more content.


Why might this be?  Roguelike games tend to be relatively short but are massively replayable.  Their main defining characteristic is giving the user a similar but slightly different experience each time, like a choose-your-own-adventure novel.  Perhaps their fans are people who can pick up a game, play through a full round in one session, walk away, then come back and do it again whenever they have some free time.  They like the leisurely attitude they can have with the game and return to play the same game over and over repeatedly to collect new and different items and achievements each time until they have it completed 100%.

If this is the case we could try a new marketing approach for a roguelike game.  Instead of selling the game at a regular pricing to a user for infinite plays, maybe we could sell a number of play rounds for a significantly cheaper price.  Some players would feel compelled to keep buying rounds until they have completed everything.  In the end, those players would pay more than the cost of a standard game retail price.  It could also appeal to players who might otherwise not commit to purchasing the game, as they could give it a try with a few rounds.

Playing with words

Speaking of marketing and game design, I also investigated the text used in game descriptions.  Again, I wanted to see if there was any correlation between how a game is described, the play time and average ratings.

There didn't seem to be any obvious patterns or takeaways as we can see in this graph of description text polarity versus game rating.  Perhaps that's due to short or incomplete descriptions, but I did find a couple things which seemed interesting.

Here we see all games of the party genre, multiplayer games where a player and their friends compete in a number of short rounds or mini games against each other, plotted with the subjectivity of their description vs their game rating.  Subjectivity refers to emotional sentiment expressed as opposed to rational fact stated.  "The car is blue" would be close to a score of 0 while "This car has an amazing color" would be closer to 1.

Although there aren't a lot of data points, it does look like being more objective is better.  Perhaps for party style games people care less for games trying to tell them how or why they should feel about something and more about just getting into the game and playing.  This might warrant further investigation.

Are we having fun yet?

In this graph I've plotted the main story play time versus average rating across all genres where the polarity of the description was below -0.7, the harshest, most emotionally negative filled descriptions.  We see that average rating seems to trend from lower to upper 70s as play time increases.

Why might people better enjoy longer games that are described with such negative words?  Here perhaps play time is an indication of game difficulty. If that's the case then when the player picks up a particularly brutally described game maybe they're hoping to really put their skills to the test.  Anything less is just a waste of time.

Game End Statistics

What can we conclude from all of this?  It seems that for some genres, some play styles there is definitely a time versus rating correlation, but what that correlation is depends on the genre and play style.  More importantly, the reasoning behind the correlation almost assuredly depends on the genre and play style.  We can delve further into that to help inform what games a development company could work on and how they should develop them beyond just main story length.

What further analysis can be done?

I would like to explore more deeply games spanning multiple genres.  Many of the games I scraped had multiple genres they fell under.  For the purposes of this analysis, I included every game under each of the genres it was listed as.  A future question I would like to address would be if certain genre combinations seemed to better fit any particular trends.

For example, people might like simulation and virtual reality games in general, as an escape from the real world, suggesting developing more in-depth stories. On the other hand, we might find that virtual reality / horror games are better enjoyed in brief spurts for an intense rush like a roller coaster ride.  This could lead to many short, episodic releases instead of one lengthy game.

I would also like to see how game ratings from HowLongToBeat compare to ratings from other sources.  HowLongToBeat only includes ratings from people submitting their play times and styles.  I'm curious how these ratings would compare with score aggregator sites that don't have any play time requirement.  One challenge there though would be how to match game data between sites.  I'd probably have to rely on title text matching, which can sometimes be difficult.

About Author

Douglas Hilton

Douglas graduated from Cornell University with a triple major in Physics, Math, and Philosophy. Post graduation he worked his way up to Senior Lead Software Developer at a financial services company. Currently he is studying Data Science with...
View all posts by Douglas Hilton >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI