NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Python > How to Make the Best Board Game

How to Make the Best Board Game

William Best
Posted on Feb 20, 2017

Introduction - Data Set Overview

Source

For this project, I built a data set based on the ranked games on BoardGameGeek. BoardGameGeek is a popular board game forum where users can rank and discuss board games. The site also discusses categories, mechanics, and designers of the games as well as expansions to board games. However this analysis does not look deeply into the expansions. Each board game has a page similar to the one shown in figure 1.

Fig 1: Game Page Example

This game page exposes information for ratings, statistics, general game information, and more. From pages like this it is possible to collect a wealth of information about many different games, and patterns may start to emerge.

Collection Methodology

In order to collect the desired information, I wrote a web scraper to pull the data for me. The scraper is written in Python, with navigation handled by Selenium, and data scraping handled by Beautiful Soup. It proved to be necessary to use Selenium as opposed to other libraries such as Scrapy, since the page source on BoardGameGeek appears to be generated dynamically by JavaScript when requested. This means that simple url request methods won't work. Since Selenium opens an actual browser to do its work, all of the JavaScript is handled by the browser (in my case it was a Firefox browser). Once the browser has finished loading the page, Selenium can access any elements on the page, including the newly generated page source. Once the page source has been generated as html, this information is passed to BeautifulSoup, which can scrape text out of the source much faster than Selenium can.

To answer the question of what makes a game popular. It seemed necessary to look at both the top and bottom games. For this, I used the site's search function with no parameters, and then sorted the resulting list by rank, both increasing (figure 2) and decreasing. Note that a game ranked #1 is considered better than a game ranked #100.

Fig 2: Game Search Page, Rank Increasing

Fig 2: Game Search Page, Rank Increasing

Collecting games by increasing rank gave me access to 50 pages, each holding 100 games. There was an exception where the final page only held 99 games, giving a total of 4999 ranked games. However, when searching by decreasing rank, the site returned a lot of games with a rank of N/A. On BoardGameGeek, ranking is based on their own metric, Geek Rating. Geek Rating is based on number of factors including number of votes. If too few people voted, the Geek Rating will be incalculable, and the game will be unrankable. For this reason, I was limited to the top 4999 games. Ultimately this was not an issue as due to the sluggishness of Selenium and time constraints, I would not have been able to scrape data for too many more games.

Data Collected

As mentioned previously, the data set contained data for 4999 ranked games. For each game, the scraper attempted to collect:

  • Game Id
  • Name
  • Game Page
  • Year Published
  • BoardGameGeek Ranking
  • Number of Votes
  • Geek Rating
  • Average User Rating
  • User Rating Standard Deviation
  • Number of Comments
  • Number of Fans
  • Weight
  • Designers
  • Game Mechanics
  • Game Categories
  • Minimum Players
  • Maximum Players
  • Best Players
  • Minimum Age
  • Minimum Playtime
  • Maximum Playtime
  • Number of Expansions
  • Number of Plays
  • Number Owned
  • Number Previously Owned

Not all games had information for all of these features. They did all at least have game id, name, page, geek rating, number of votes, average user rating, user rating standard deviation, number of comments, number of fans, number of expansions, number of plays, number owned, and number previously owned.

Information existed for most other games for the other features, and information that was typically missing was "maximum" values. However, due to a bug in the code, the "best players" feature was missing in all games. If data was missing for any feature, it was replaced by an empty string. As can be seen in Figure 3, most of the missing data is in max playtime. This means that most games either didn't have a maximum playtime listed, or the scraper failed to collect that data (as it did with best players). Minimum age is barely above 5% missing.

Fig 3: Proportion of Missing Values (Best Players Omitted)

Fig 3: Proportion of Missing Values (Best Players Omitted)

Analysis

Numerical Features

To see what did and didn't make a good game, I began looking at how different game features correlated with game rank. Figure 4 shows the correlation with game features such as playtime, and min age. Generally these are values suggested by the game producer. Average user rating was included in this plot to see how these features also affect average user rating.

Figure 4: Game Feature Correlation

As can be seen in the figure 4, things such as minimum age, and minimum playtime have almost no correlation with geek rating, average rating, or rank. Number of expansions is somewhat positively correlated, but most likely this is more of an reverse correlation where the more popular a game is the more expansions it will receive.

Next I looked at the correlation between user input and game rank. User input covers features that are more in line with user opinions. This is both explicit with their ratings, and implicit with how they interact and share ideas on the game page. For figure 5, I did consider average user rating when looking at correlation as it looks at different ways users voiced their thoughts on the game, rather than what came from the producers.

corrplot

Figure 5: User Input Correlation

Again, it can be seen that geek rating and rank are tightly correlated (which is true since geek rating is how rank is decided). However most other attributes are not closely correlated. The closest we see is in number of votes and number of comments. One interesting thing to note here is that average rating is not a huge contributor towards geek rating and rank. While it may be nice to have a high user rating, it may be better to have a more active player community that post a lot of comments.

Looking at the plots for some of these features, such as number of fans in figure 6, we see that there is a long tail up to high values.

Figure 6: Histogram of Number of Fans

Figure 6: Histogram of Number of Fans

However, if we look at log10(# of Fans) instead, we see a much better and clearer graph. Looking at figure 7, the histogram of log10(# of Fans) even takes a more normal structure.

Figure 7: Histogram of log10(# of Fans)

Figure 7: Histogram of log10(# of Fans)

Similar graphs and results are produced when log10 is applied to comments and votes as well. These new values can then be run again through correlation to produce the plot in figure 8. Average rating follows a roughly normal distribution as well, and does not require a transformation.

Figure 8: Log10 Correlation Plot of User Input

Figure 8: Log10 Correlation Plot of User Input

Comparing this back to figure 5, we see that when log10 is applied, number of fans, number of votes and number of comments become much more important. By applying log10 to these values, it also helps clarify the trends. Compare figures 9 and 10, where original values are used in figure 9 and log10 values are used in figure 10. When looking at the graphs below, keep in mind that a lower rank is better.

Fig 9: Comparing Votes, Comments, and Rank

Fig 9: Comparing Votes, Comments, and Rank

Fig 10: Comparing log10(Comments), log10(Votes), Rank

Fig 10: Comparing log10(Comments), log10(Votes), Rank

Figures 9 and 10 clearly show that games with higher ranks get both more votes and more comments. This keeps in line with what was shown in the log10 correlation plot in figure 8. These plots suggest that log10 of comments, votes, and fans are good indicators of how high a game will rank on BoardGameGeek. This still more or less stands to reason. A highly ranked game has a lot of fans, and gets lots of comments and votes. But even considering that, they still provide extra, useful information. The reason for this is that rating, and especially rank, are not normally distributed. But the log10 of fans, comments, and votes are very close to normal distributions. We can then use these, along with non-numerical features to start to look at what makes games popular.

Non-Numerical Features

Also included in the game data that was scraped is information like game categories, mechanics, and designers. This information provides the best visibility into what does and doesn't make for popular games as these are descriptors of the actual attributes of the games. Categories covers things like "Politics," "Economy," "WW II," and "Horror." Mechanics is more about "Dice Rolling," "Variable Player Powers," and "Trading." Not every game has listings for these, but more than 95% of the games collected do, which should be more than sufficient to look for trends in both categories and mechanics.

Since it's already been seen that by transforming certain  features by log10, we can get normalized values (figure 7), then we can run t-tests on the values for categories and mechanics, and see which ones will be more likely to yield more fans, comments, or votes. As before, games that have more fans, comments, or votes tend to have a higher rank.

Four t-tests were run for every category in categories, one each for average rating, fans, votes, and comments. In each case, the null hypothesis stated that the given category had an equal average value for the tested feature. For example, if the "Sci-Fi" category was being tested, the null hypothesis would state that the average log10(number of fans) for all games with "Sci-Fi" listed in their categories is equal to the average log10(number of fans) for all games without "Sci-Fi" listed in their categories. In all cases, the alternate hypothesis was that the sample average was larger than the population average. This would be like saying that "Sci-Fi" games have more fans on average than other games.

Test Results: Categories

The table below shows the top ten games that had the lowest p-values from their respective tests. The p-value basically corresponds with the likelihood that the null hypothesis stands. Typically, a p-value below .05 is desired in order to reject the null hypothesis with 95% confidence. For the categories below, we are saying that with more than 95% confidence, this category yields a higher average rating, number of fans, number of votes, or number of comments than other categories. When looking at the top (or bottom) ten, order isn't hugely important, just being in the top ten is what should be considered. The difference between placement in order here is not statistically significant.

Average Rating Number of Fans Number of Votes Number of Comments
Wargame Fighting City Building Economic
World War II Fantasy Economic City Building
Miniatures Science Fiction Fighting Exploration
Economic Miniatures Exploration Adventure
Napoleonic Adventure Adventure Political
American Civil War Horror Medieval Medieval
Modern Warfare Exploration Fantasy Civilization
Civilization Zombies Territory Building Fighting
Civil War Space Exploration Humor Territory Building
Fighting Novel-Based Negotiation Negotiation

The table above starts to show some interesting things. War games tend to yield higher average ratings. Thematic games, on the other hand, tend to attract more fans. Finally, users are more likely to vote and comment on games that have some sort of constructive aspect to them. Since votes and comments tend to correlate highly with a higher rank (and average rating interestingly less so), games with constructive themes will likely yield higher ratings. Especially interesting is that war games are not likely to yield more votes or comments, and may not provide for a higher rank.

Test Results: Mechanics

Just as with the categories table above, t-tests produced the following table for game mechanics.

Average Rating Number of Fans Number of Votes Number of Comments
Hand Management Variable Player Powers Set Collection Set Collection
Variable Player Powers Dice Rolling Hand Management Hand Management
Set Collection Modular Board Card Drafting Area Control/Area Influence
Player Elimination Grid Movement Variable Player Powers Auction/Bidding
Card Drafting Co-operative Play Area Control/Area Influence Card Drafting
Grid Movement Action Point Allowance System Player Elimination Variable Player Powers
Simultaneous Action Selection Deck/Pool Building Auction/Bidding Player Elimination
Worker Placement Card Drafting Grid Movement Modular Board
Co-Operative Play Area Movement Modular Board Tile Placement
Variable Phase Order Worker Placement Worker Placement Grid Movement

Again, we see that number of votes and number of comments share very similar results. Some results show up in all sets. However, unlike with categories, game mechanics are much more spread out and don't really follow a trend for the top values.

That being said, the bottom values do follow an interesting trend. Over all tests, performance-based games did the worst. The people who visit BoardGameGeek do not like to act or sing in their games.

Test Results: Designers

Unfortunately, there are not enough data points for each designer to perform accurate t-tests. So I was unable to learn anything about which designers are and are not popular.

Conclusion

Based on the analysis performed, two of the major indicators of a game's rank on BoardGameGeek are the number of votes it has gotten and the number of comments it has gotten. Likely this is because people who enjoy a game want to get online and talk about it, and express how much they enjoyed it. If they don't like a game, they are probably more likely to simply never play it again.

Using the knowledge that votes and comments are indicators, and the fact that they are more or less normal distributions when transformed, we can use the number of votes and comments to see which categories and mechanics are doing the best. Games where there is some sort of constructive aspect, building cities, businesses, etc, tend to yield more votes and comments. War games, while making for games with a higher average rating, do not. While it may seem like it would be best to go for games that yield higher average ratings, games with higher average ratings aren't as highly correlated with rank on BoardGameGeek. While game mechanics are varied for the top mechanics, one can still learn something by looking at the bottom mechanics. Performance-based games should be avoided. They don't increase the likelihood of higher ratings, fans, votes, or comments.

Further Work

Going forward I'd like to look at more games. I was only able to scrape about 5000 games, and I think I could learn more from more games. I'd also like to look more into the connection between votes and comments. In all tests I performed, they were very tightly correlated, and I think it would be interesting to find out more. Finally, I want to run more varied tests and see how much that changes my results so I get get a higher confidence and understanding of what makes for good categories and mechanics in games.

About Author

William Best

Over the years I have held several different programming roles, and the projects that interested me the most were the data-intensive ones. I received a BS and BE from NYU and Stevens respectively, and did my MEng at...
View all posts by William Best >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application