User Analysis on BeerAdvocate.com

Avatar
Posted on Jul 15, 2019

User data is one of the largest growing commodities in today's marketplace. Everytime you click, share, post or tag, that action is recorded and analyzed, bought and sold. Companies pay huge sums for this data and the insight it provides.

BeerAdvocate.com is a popular online community that catalogs beer information and user reviews/ratings. In 2017, one admin boasted the site lists nearly 300,000 individual beers. Many of these beers have between dozens and thousands of individuals reviewing them. In order to tap into this wealth of information, I built a scrapy spider and analyzed the results.

Webscraping

The spider worked by first exploring each style of beer, from American IPA to Wheat Beer, and then iteratively opening each beer in that style. I limited my spider to only opening beers with more than 100 ratings in an effort to increase efficiency, and focus on the bulk of user activity.

Over 3 hours later, my spider scraped data from nearly 10,000 individual beers and 1.7 million reviews. The data was stored in relationally linked csv files and processed.

Numerical Analysis

Examining the data, one insight that stood out was the distribution of user activity. Since BeerAdvocate is primarily a review website, I looked into how many reviews each reviewer has written.

What I found was that more than 75% of the userbase has written less than 10 reviews each, whereas there are 10 reviewers that are responsible for more than 65% of the number of reviews.

In the following graphs we can infer that while the vast majority of the userbase is relatively inactive (less than 10 reviews to their name), there is a severe minority of users that contribute the majority of the content.

Unique User Review Distribution

User Review Distribution

Taking a closer look at the top 10 most active users monthly activity we see two things. 1) They're activity is rather stable and 2) collectively they contribute to nearly 70% of the content.

Super User Lifetime

Lifetime activity of the top 10 super users as a fraction of the whole.

It's one thing to identify the extremely active user base, it's another to identify the monetary value of said user base. So I looked to see if there was any correlation to the number of reviews on a given product and that's product rating.

BA Score by Number of Reviews

BA Score by Number of Reviews

We see that the vast majority of beers are rated under 1000 times, and for these the BA Score (Beer Advocates average rating) is extremely variable. However as we view products with 2000 - 4000 reviews, the Score stabilizes around 4.4.

BeerAdvocate maintains a Top 250 list of it's most popular beers by BA Score. The lowest rated beer on that list is Imperial Eclipse Stout with a score of 4.46.

So we see that the more times a beer is rated, the higher its BA Score, the greater the likelihood that beer gets on the Top 250. This in turn leads to higher visibility on the website.

Beer Advisor (A Recommender System)

Another way social media sites use user data is in building recommender systems. Similar to how Netflix suggests content to watch or Amazon suggest items to purchase, we can user BeerAdvocate user data to recommend beers to try.

I built such a recommender system employing a User-Item Collaborative Filter. This filter takes your user preferences and finds others with similar preference to you, and uses their history to suggest products.

Collaborative Filtering is a common technique due to it's simplicity and computational ease, however it tends to promote popularity bias and cannot handle obscure taste profiles. I.E. if you only reviewer extremely obscure items, it won't be able to recommend to you.

This recommender system can be used in all manner of scenarios from providing targeted advertising to intelligent sales promotion.

Future Work

There is a wealth of textual data that I omitted from analysis. Much of that info can be used to vastly improve my recommender system. Additionally, review content can be examined using Natural Language Processsing to provide insight in user favorability, polarity and product clustering.

Thank You

To examine the code that went into this project and to check out my other work please go see my github.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data Book Launch Book-Signing bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp