User Data Analysis on BeerAdvocate.com
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
User data is one of the largest growing commodities in today's marketplace. Everytime you click, share, post or tag, that action is recorded and analyzed, bought and sold. Companies pay huge sums for this data and the insight it provides.
BeerAdvocate.com is a popular online community that catalogs beer information and user reviews/ratings. In 2017, one admin boasted the site lists nearly 300,000 individual beers. Many of these beers have between dozens and thousands of individuals reviewing them. In order to tap into this wealth of information, I built a scrapy spider and analyzed the results.
Webscraping Data
The spider worked by first exploring each style of beer, from American IPA to Wheat Beer, and then iteratively opening each beer in that style. I limited my spider to only opening beers with more than 100 ratings in an effort to increase efficiency, and focus on the bulk of user activity.
Over 3 hours later, my spider scraped data from nearly 10,000 individual beers and 1.7 million reviews. The data was stored in relationally linked csv files and processed.
Numerical Data Analysis
Examining the data, one insight that stood out was the distribution of user activity. Since BeerAdvocate is primarily a review website, I looked into how many reviews each reviewer has written.
What I found was that more than 75% of the userbase has written less than 10 reviews each, whereas there are 10 reviewers that are responsible for more than 65% of the number of reviews.
In the following graphs we can infer that while the vast majority of the userbase is relatively inactive (less than 10 reviews to their name), there is a severe minority of users that contribute the majority of the content.
Taking a closer look at the top 10 most active users monthly activity we see two things. 1) They're activity is rather stable and 2) collectively they contribute to nearly 70% of the content.
It's one thing to identify the extremely active user base, it's another to identify the monetary value of said user base. So I looked to see if there was any correlation to the number of reviews on a given product and that's product rating.
We see that the vast majority of beers are rated under 1000 times, and for these the BA Score (Beer Advocates average rating) is extremely variable. However as we view products with 2000 - 4000 reviews, the Score stabilizes around 4.4.
BeerAdvocate maintains a Top 250 list of it's most popular beers by BA Score. The lowest rated beer on that list is Imperial Eclipse Stout with a score of 4.46.
So we see that the more times a beer is rated, the higher its BA Score, the greater the likelihood that beer gets on the Top 250. This in turn leads to higher visibility on the website.
Beer Advisor (A Recommender System)
Another way social media sites use user data is in building recommender systems. Similar to how Netflix suggests content to watch or Amazon suggest items to purchase, we can user BeerAdvocate user data to recommend beers to try.
I built such a recommender system employing a User-Item Collaborative Filter. This filter takes your user preferences and finds others with similar preference to you, and uses their history to suggest products.
Collaborative Filtering is a common technique due to it's simplicity and computational ease, however it tends to promote popularity bias and cannot handle obscure taste profiles. I.E. if you only reviewer extremely obscure items, it won't be able to recommend to you.
This recommender system can be used in all manner of scenarios from providing targeted advertising to intelligent sales promotion.
Future Work
There is a wealth of textual data that I omitted from analysis. Much of that info can be used to vastly improve my recommender system. Additionally, review content can be examined using Natural Language Processsing to provide insight in user favorability, polarity and product clustering.
Thank You
To examine the code that went into this project and to check out my other work please go see my github.