Scraping data for different Beers

Posted on Feb 19, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Data shows beer is a staple in American culture. Especially in recent years, the American craft beer industry has exploded with microbreweries and a number of different styles. I wanted to know the answer to multiple questions:

  1. How has the craft beer explosion affected the perception of American light lagers like Bud and Coors over time?
  2. The age-old question, which is better: Bud or Coors?
  3. Which American style is the highest rated and the most coveted?
  4. Is alcohol content correlated with rating?

Data & Scraping

I scraped my data from There were two different resources that I wanted to scrape. First, for each cheap lager I wanted a large sample size of user reviews on the look, smell, feel, taste and overall ratings. For this, I used selenium as it could easily navigate page to page and pull out those attributes in addition to the date the review was posted. Unfortunately, all of these parameters were in the same string, so I had to do some compensatory data cleaning in my global.R file.

Scraping data for different Beers

Name, Brewery, Number of Ratings, and Alcohol Content

Second, I scraped the name, brewery, number of ratings, average rating and alcohol content of individual beers for a number of choice American Ale styles. As craft ales are by far the most popular in America, I wanted a good read on the highest rated and most reviewed styles. Scrapy worked better for parsing the table, so I used a spider for each individual style. It was easy to add 50 to the URL each time to go page to page, grabbing the data.

Screen Shot 2017-02-19 at 2.51.29 PM

Cheap Lagers

I quickly realized I would probably need two shiny apps. I didn’t want them to become too crowded and take away from either of my investigations, so I wanted to focus on them separately.

The “College Lagers” shiny opens with a simple boxplot based on the averages for the selected attribute based on beer. Across the top, you can see the maximum and minimum mean values for the current beers displayed. By selecting or unselecting beers, you can alter these values.

Screen Shot 2017-02-19 at 2.52.30 PM


As can be seen, it turns out Coors is rated better than Budweiser on average, who would’ve thought! Coors light also seems to do a bit better than Bud Light, which is important to consider given no one really drinks just Coors, the “banquet beer”. Much to my dismay, Miller High Life is rated higher than all of them. The “champagne of beers” does not live up to the name, in my personal opinion.

Screen Shot 2017-02-19 at 2.54.41 PM

Ratings Over Time

Clicking the next tab, we can see the ratings over time represented by boxplots with a geom_smooth trend line. I was surprised to find the general mean rating for these beers increased over time. Given the shift towards craft beer, I assumed the perception of these beers might dwindle. Of course the expanding user base can be taken into account. Additionally, I noticed some very obviously sarcastic reviews for some of these beers, which may have skewed the data. Still, for the overall perception of these non-craft lagers to improve as the craft beer market expanded is perplexing. I would have to gather more data as to why this could be.

Screen Shot 2017-02-19 at 2.56.04 PM

Screen Shot 2017-02-19 at 2.56.56 PM

The final tab simply contains my table of data, if you wanted to search for something specifically. I never ended up using the rDev, which is a measure of how far a review is from the average of a beer. Moving forward, I would like to work this into some of my analytics.

American Ales

My next app contains all the information by style for a beer. To get an idea of what it looks like, we can view the “Table” tab at the bottom of the dashboard.

Scraping data for different Beers

The “Bros” rating is the rating given to the beer by the founders of the website. Given it’s missing for a number of beers, I decided not to make use of it. Everything else should be self-explanatory. “Reviews” is the number of reviews that beer has, “ABV” is alcohol by volume, and “Avg” is the average rating of that beer by users.

Histogram of Number of Reviews

Going back to the first tab, we can see a histogram for the number of reviews each style of beer has. As I expected, the IPA is by far the most reviewed in America. By using the sliders on the left, we can filter this and all other graphs by number of reviews, ABV, or average rating to get subsets of the data. If the slider for the number of reviews goes over 4500, the histogram gets colored in by brewery.

Scraping data for different Beers

Scatterplot of Alcohol and Average Rating

Moving down, we can see a scatterplot of alcohol and average rating with a linear regression line. Although this was more EDA than anything, I wanted to play with my data and see if I could find a good linear model. Using the sliders on the left, we can subset the data and get updated linear regressions.

Scraping data for different Beers

Screen Shot 2017-02-19 at 3.01.11 PM

Unfortunately, there seemed to be no linear correlation. My highest R2 value only got to about 0.3, indicating a bad model. However, we can see some sort of shape to the data. As the ABV gets further above 5, the number of poorly reviewed beers decreases dramatically. While there are plenty of good beers with low alcohol content, there are no bad beers with high alcohol content. Moving forward, I would like to implement more of my newfound machine learning skills onto this data.

The last tab simply contains of boxplot of average rating by style. As can be seen, imperial stouts have the highest average rating of American ales. However, by playing with the sliders, we can see other beers take the lead for various reasons. Note that it's helpful to keep the number of reviews above at least 10 or so, since there are many unreviewed beers or beers with only 1 review skewing the average.

Screen Shot 2017-02-19 at 3.02.15 PM

Going Forward

Overall, the scraping and these apps were an enjoyment to create. Going forward, I would like to incorporate more machine learning and predictive modeling to define a “good” beer. By scraping user reviews for every beer, I could also incorporate more factors into my data, as I did with the cheap lagers app. In regards to those beers, I would like to obtain more information about the market, and other factors that may have lead to periods of increased ratings. Per usual, the simple strategy is the best: obtain more data and do more analysis. Regardless, I can say that I will continue to drink and enjoy American beers, albeit some more than others.




About Author

Kyle Gallatin

Kyle Gallatin graduated from Quinnipiac University with a biology degree in 2015. Following, he continued on for his Master's in Molecular and Cellular Biology, received in 2016. Cultivating high level skills in data science through his analytical work...
View all posts by Kyle Gallatin >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI