Scraping Popular Board Games
Contributed by Hayes Cozart. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his third class project - Web Scraping (due on the 6th week of the program).
Scraping Popular Board Games
May 29, 2016
Why Board Games?
Board games have been one of my hobbies for a long time. I enjoy them for the challenges they present and the strategies involved in both facing off against other players or the board game itself. I believe, now is the perfect time to be looking at board games since the industry is currently in what is being referred to as the golden age of board games. There are many new people joining the hobby and lots of new types of games entering the market. This is, in part, due to things like kickstarter allowing designers to develop the games they want. The issue, right now is not finding a good game but rather finding the exact kind of game that you want to play. I also wanted to look at difficulty or complexity of a game since I enjoy challenging games, I expected to see games that were harder or more complex would be more popular.
What did I scrape?
The first page I scraped which can be seen below was the overall ranking page for every board game. Every page lists 100 board games before you need to go to the next page. The scrapy program was written to go through every page and collect the rank, name of the game, the ratings, the number of voters, the price, and the link to the board game's details.
Because this is all being done through selenium running the program for every board game on the site would take roughly 40 hours. It was for this reason that I only pulled the top 1500 and bottom 1500 ranked board games. I am already working on making a better running scrapy program and will update this post with that code when it is finished.
As I said at the beginning of this post, we are in a golden age of board games. I wanted to test that hypothesis. This is why the first thing that I looked at was the year the game was published and compared the top ranked board games to the bottom ranked board games.
This density graph shows the highest density of the top ranked games are in current years. So the fact that we are in a golden age of board games does appear to be true. It also shows that the bottom ranked board games seem to have the highest density around the early 2000's but drop off in more recent years. As a side note, the site has some really old games included in its list such as Go or Chess. These games make the graph difficult to read so I filtered the data to everything after 1950. That period of time is more what people think of when they think of board games.
The next analyses I did was on the suggested age of the board game. This was to answer the question of complexity of a game. I hypothesized that the harder the game the higher the suggested age. Though this is probably true, more interesting insights were found.
This graph shows that as the suggested age goes up the proportion of top board games increases. This is true up until about age 14 and up. This could mean that as games require you to be older to play, or are more complex, they are generally more popular. However, an interesting observation occurs when you look purely at the numbers.
This graph shows that the top ages listed for board games are 8 and up, 10 and up, and 12 and up. This most likely has to deal with marketing games to different audiences and may say more about a game's theme or category than its complexity. I intend to look into this further by comparing community voted age versus game age since that may capture the difference between how hard a game is to understand versus how the game is marketed.
Next to further try and capture whether or not a more complex game is more popular than other types of games I looked at a game's difficulty to learn. Here is how a community voted field listed on the site.
In this density graph it shows that the the bottom ranked games are all centered around the low difficulty. This supports the hypothesis that the more complex a game is the more popular it is. However, it looks like the sweet spot for difficulty is around 2.5 and that the density of top games seems to drop off towards the higher end of difficulty. This indicates that for a game to be popular, players want the game to be complex but not too difficult to learn.
Next I looked at the amount of time it takes to play the game. I thought that this could be a barrier to entry for some people, but also that complex games might take longer to play than simple games.
This graph illustrates that there does not seem to be much difference overall between the time it takes to play top rated board games versus bottom ranked board games. It does appear that bottom ranked board games are much more dense when the game takes a very short amount of time to play, while the top ranked are more dense at the 40 to 60 minute mark. Otherwise, both show very similar trends.
Subsequently I looked at the number of mechanics a game has, as the more mechanics there are, the more complex the game will be.
This graph shows that most games that are made do not have many mechanics. However, the chart below shows the more mechanics games include, the more popular they become. A game that has zero mechanics is most likely a game so simple that there is not a way to describe it, e.g. Tic-Tac-Toe.
This graph shows as the number of mechanics increases then proportionally more games become top ranked. This defends the idea that the more complex a game is the more popular it is up to the point where it becomes too complex.
Next, the number of themes or categories a game has was considered. This is, in part, a supposition that in this golden age of board games there have been more games with different or complex themes entering the market. I was interested in capturing that information to discover if this was true.
It does not look like theme or at least the number of themes a game has provides much information on the differences between top and bottom ranked games. Generally, if a game has more themes, it is more popular but this category needs to be broken out more to get usable information.
Lastly, I considered whether price might be a barrier to entry or reduce the popularity of a game.
Price turned out to be a very interesting variable to consider and should be broken out more to get a better idea of what is happening. The two immediate observations are that bottom ranked games that are zero priced are very dense at the bottom of the price scale and top ranked games are more evenly distributed among all the price ranges. The top ranked game, however, drop off at around the $50 mark. The zero priced games are all the games where no price was listed but this did not necessarily mean that they were free. This price included games like Old Maid or Go Fish that did not have a price since all you needed was a deck of cards to play. War games, like Warhammer, where you buy your armies separately did not have a set price range. In future work, these zero priced games will need to be more thoroughly analyzed to determine the differences among these games.
The main conclusions, after reviewing all of this data, are that board games have been growing more popular in recent years. The difficulty of a game does seem to be associated with it's popularity. The more mechanics a game has the more popular the game is. Finally, there are so many ways to look at this data and so much to look into that these analyses have barely scratched the surface of potential analysis.
On that note, the next steps for me to continue analysis of this data are that I want to change my scraping program and to use that program to scrape data for all the board games on the site. Then I would like to make a shiny app that would allow someone looking for a game to buy, to get a list of games matching what they are looking for ranked by the sites order. Finally, this data is a prime candidate for machine learning since there are many different ways to look at the data including trying to find out what combination of variables would make a top ranked board game.