User Data Analysis on

Posted on Jul 15, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

User data is one of the largest growing commodities in today's marketplace. Everytime you click, share, post or tag, that action is recorded and analyzed, bought and sold. Companies pay huge sums for this data and the insight it provides. is a popular online community that catalogs beer information and user reviews/ratings. In 2017, one admin boasted the site lists nearly 300,000 individual beers. Many of these beers have between dozens and thousands of individuals reviewing them. In order to tap into this wealth of information, I built a scrapy spider and analyzed the results.

Webscraping Data

The spider worked by first exploring each style of beer, from American IPA to Wheat Beer, and then iteratively opening each beer in that style. I limited my spider to only opening beers with more than 100 ratings in an effort to increase efficiency, and focus on the bulk of user activity.

Over 3 hours later, my spider scraped data from nearly 10,000 individual beers and 1.7 million reviews. The data was stored in relationally linked csv files and processed.

Numerical Data Analysis

Examining the data, one insight that stood out was the distribution of user activity. Since BeerAdvocate is primarily a review website, I looked into how many reviews each reviewer has written.

What I found was that more than 75% of the userbase has written less than 10 reviews each, whereas there are 10 reviewers that are responsible for more than 65% of the number of reviews.

In the following graphs we can infer that while the vast majority of the userbase is relatively inactive (less than 10 reviews to their name), there is a severe minority of users that contribute the majority of the content.

User Data Analysis on

User Review Distribution

Taking a closer look at the top 10 most active users monthly activity we see two things. 1) They're activity is rather stable and 2) collectively they contribute to nearly 70% of the content.

User Data Analysis on

Lifetime activity of the top 10 super users as a fraction of the whole.

It's one thing to identify the extremely active user base, it's another to identify the monetary value of said user base. So I looked to see if there was any correlation to the number of reviews on a given product and that's product rating.

User Data Analysis on

BA Score by Number of Reviews

We see that the vast majority of beers are rated under 1000 times, and for these the BA Score (Beer Advocates average rating) is extremely variable. However as we view products with 2000 - 4000 reviews, the Score stabilizes around 4.4.

BeerAdvocate maintains a Top 250 list of it's most popular beers by BA Score. The lowest rated beer on that list is Imperial Eclipse Stout with a score of 4.46.

So we see that the more times a beer is rated, the higher its BA Score, the greater the likelihood that beer gets on the Top 250. This in turn leads to higher visibility on the website.

Beer Advisor (A Recommender System)

Another way social media sites use user data is in building recommender systems. Similar to how Netflix suggests content to watch or Amazon suggest items to purchase, we can user BeerAdvocate user data to recommend beers to try.

I built such a recommender system employing a User-Item Collaborative Filter. This filter takes your user preferences and finds others with similar preference to you, and uses their history to suggest products.

Collaborative Filtering is a common technique due to it's simplicity and computational ease, however it tends to promote popularity bias and cannot handle obscure taste profiles. I.E. if you only reviewer extremely obscure items, it won't be able to recommend to you.

This recommender system can be used in all manner of scenarios from providing targeted advertising to intelligent sales promotion.

Future Work

There is a wealth of textual data that I omitted from analysis. Much of that info can be used to vastly improve my recommender system. Additionally, review content can be examined using Natural Language Processsing to provide insight in user favorability, polarity and product clustering.

Thank You

To examine the code that went into this project and to check out my other work please go see my github.

About Author

Charles Cohen

Charles Cohen is currently teaching at the NYC Data Science Academy. Charles studied Physical Sciences at the City College of New York and subsequently worked in research and non-profit environments. Charles is a self-motivated learner who eagerly adapts...
View all posts by Charles Cohen >

Related Articles

Leave a Comment

CBD For Dogs December 14, 2020
CBD For Dogs [...]just beneath, are a lot of completely not related sites to ours, nevertheless, they're certainly really worth going over[...]
Google September 30, 2020
Google Very handful of web-sites that take place to become in depth beneath, from our point of view are undoubtedly properly really worth checking out.
Google August 31, 2020
Google The time to read or pay a visit to the subject material or internet sites we have linked to below.
Backlink August 28, 2020
Backlink [...]here are some hyperlinks to web pages that we link to because we believe they're really worth visiting[...]
OnHax Me August 19, 2020
OnHax Me [...]Every when in a although we decide on blogs that we study. Listed beneath are the newest internet sites that we select [...] August 5, 2020 [...]here are some hyperlinks to web sites that we link to mainly because we assume they are worth visiting[...] July 30, 2020 [...]Every the moment inside a even though we opt for blogs that we study. Listed beneath are the newest web pages that we pick out [...]
cbd for pain July 9, 2020
cbd for pain [...]that could be the end of this report. Right here you’ll discover some web pages that we consider you’ll enjoy, just click the links over[...]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI