User Analysis on

Charles Cohen
Posted on Jul 15, 2019

User data is one of the largest growing commodities in today's marketplace. Everytime you click, share, post or tag, that action is recorded and analyzed, bought and sold. Companies pay huge sums for this data and the insight it provides. is a popular online community that catalogs beer information and user reviews/ratings. In 2017, one admin boasted the site lists nearly 300,000 individual beers. Many of these beers have between dozens and thousands of individuals reviewing them. In order to tap into this wealth of information, I built a scrapy spider and analyzed the results.


The spider worked by first exploring each style of beer, from American IPA to Wheat Beer, and then iteratively opening each beer in that style. I limited my spider to only opening beers with more than 100 ratings in an effort to increase efficiency, and focus on the bulk of user activity.

Over 3 hours later, my spider scraped data from nearly 10,000 individual beers and 1.7 million reviews. The data was stored in relationally linked csv files and processed.

Numerical Analysis

Examining the data, one insight that stood out was the distribution of user activity. Since BeerAdvocate is primarily a review website, I looked into how many reviews each reviewer has written.

What I found was that more than 75% of the userbase has written less than 10 reviews each, whereas there are 10 reviewers that are responsible for more than 65% of the number of reviews.

In the following graphs we can infer that while the vast majority of the userbase is relatively inactive (less than 10 reviews to their name), there is a severe minority of users that contribute the majority of the content.

Unique User Review Distribution

User Review Distribution

Taking a closer look at the top 10 most active users monthly activity we see two things. 1) They're activity is rather stable and 2) collectively they contribute to nearly 70% of the content.

Super User Lifetime

Lifetime activity of the top 10 super users as a fraction of the whole.

It's one thing to identify the extremely active user base, it's another to identify the monetary value of said user base. So I looked to see if there was any correlation to the number of reviews on a given product and that's product rating.

BA Score by Number of Reviews

BA Score by Number of Reviews

We see that the vast majority of beers are rated under 1000 times, and for these the BA Score (Beer Advocates average rating) is extremely variable. However as we view products with 2000 - 4000 reviews, the Score stabilizes around 4.4.

BeerAdvocate maintains a Top 250 list of it's most popular beers by BA Score. The lowest rated beer on that list is Imperial Eclipse Stout with a score of 4.46.

So we see that the more times a beer is rated, the higher its BA Score, the greater the likelihood that beer gets on the Top 250. This in turn leads to higher visibility on the website.

Beer Advisor (A Recommender System)

Another way social media sites use user data is in building recommender systems. Similar to how Netflix suggests content to watch or Amazon suggest items to purchase, we can user BeerAdvocate user data to recommend beers to try.

I built such a recommender system employing a User-Item Collaborative Filter. This filter takes your user preferences and finds others with similar preference to you, and uses their history to suggest products.

Collaborative Filtering is a common technique due to it's simplicity and computational ease, however it tends to promote popularity bias and cannot handle obscure taste profiles. I.E. if you only reviewer extremely obscure items, it won't be able to recommend to you.

This recommender system can be used in all manner of scenarios from providing targeted advertising to intelligent sales promotion.

Future Work

There is a wealth of textual data that I omitted from analysis. Much of that info can be used to vastly improve my recommender system. Additionally, review content can be examined using Natural Language Processsing to provide insight in user favorability, polarity and product clustering.

Thank You

To examine the code that went into this project and to check out my other work please go see my github.

About Author

Charles Cohen

Charles Cohen

Charles Cohen is currently teaching at the NYC Data Science Academy. Charles studied Physical Sciences at the City College of New York and subsequently worked in research and non-profit environments. Charles is a self-motivated learner who eagerly adapts...
View all posts by Charles Cohen >

Related Articles

Leave a Comment

CBD For Dogs December 14, 2020
CBD For Dogs [...]just beneath, are a lot of completely not related sites to ours, nevertheless, they're certainly really worth going over[...]
Google September 30, 2020
Google Very handful of web-sites that take place to become in depth beneath, from our point of view are undoubtedly properly really worth checking out.
Google August 31, 2020
Google The time to read or pay a visit to the subject material or internet sites we have linked to below.
Backlink August 28, 2020
Backlink [...]here are some hyperlinks to web pages that we link to because we believe they're really worth visiting[...]
OnHax Me August 19, 2020
OnHax Me [...]Every when in a although we decide on blogs that we study. Listed beneath are the newest internet sites that we select [...]
Avatar August 5, 2020 [...]here are some hyperlinks to web sites that we link to mainly because we assume they are worth visiting[...]
Avatar July 30, 2020 [...]Every the moment inside a even though we opt for blogs that we study. Listed beneath are the newest web pages that we pick out [...]
cbd for pain July 9, 2020
cbd for pain [...]that could be the end of this report. Right here you’ll discover some web pages that we consider you’ll enjoy, just click the links over[...]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp