American Beers: Scraping data for different American Ales and cheap lagers

Kyle Gallatin
Posted on Feb 19, 2017

Beer is a staple in American culture. Especially in recent years, the American craft beer industry has exploded with microbreweries and a number of different styles. I wanted to know the answer to multiple questions:

  1. How has the craft beer explosion affected the perception of American light lagers like Bud and Coors over time?
  2. The age-old question, which is better: Bud or Coors?
  3. Which American style is the highest rated and the most coveted?
  4. Is alcohol content correlated with rating?

Data & Scraping

I scraped my data from There were two different resources that I wanted to scrape. First, for each cheap lager I wanted a large sample size of user reviews on the look, smell, feel, taste and overall ratings. For this, I used selenium as it could easily navigate page to page and pull out those attributes in addition to the date the review was posted. Unfortunately, all of these parameters were in the same string, so I had to do some compensatory data cleaning in my global.R file.

Screen Shot 2017-02-19 at 2.50.38 PM

Second, I scraped the name, brewery, number of ratings, average rating and alcohol content of individual beers for a number of choice American Ale styles. As craft ales are by far the most popular in America, I wanted a good read on the highest rated and most reviewed styles. Scrapy worked better for parsing the table, so I used a spider for each individual style. It was easy to add 50 to the URL each time to go page to page, grabbing the data.

Screen Shot 2017-02-19 at 2.51.29 PM

Cheap Lagers

I quickly realized I would probably need two shiny apps. I didn’t want them to become too crowded and take away from either of my investigations, so I wanted to focus on them separately.

The “College Lagers” shiny opens with a simple boxplot based on the averages for the selected attribute based on beer. Across the top, you can see the maximum and minimum mean values for the current beers displayed. By selecting or unselecting beers, you can alter these values.

Screen Shot 2017-02-19 at 2.52.30 PM

As can be seen, it turns out Coors is rated better than Budweiser on average, who would’ve thought! Coors light also seems to do a bit better than Bud Light, which is important to consider given no one really drinks just Coors, the “banquet beer”. Much to my dismay, Miller High Life is rated higher than all of them. The “champagne of beers” does not live up to the name, in my personal opinion.

Screen Shot 2017-02-19 at 2.54.41 PM

Clicking the next tab, we can see the ratings over time represented by boxplots with a geom_smooth trend line. I was surprised to find the general mean rating for these beers increased over time. Given the shift towards craft beer, I assumed the perception of these beers might dwindle. Of course the expanding user base can be taken into account. Additionally, I noticed some very obviously sarcastic reviews for some of these beers, which may have skewed the data. Still, for the overall perception of these non-craft lagers to improve as the craft beer market expanded is perplexing. I would have to gather more data as to why this could be.

Screen Shot 2017-02-19 at 2.56.04 PM

Screen Shot 2017-02-19 at 2.56.56 PM

The final tab simply contains my table of data, if you wanted to search for something specifically. I never ended up using the rDev, which is a measure of how far a review is from the average of a beer. Moving forward, I would like to work this into some of my analytics.

American Ales

My next app contains all the information by style for a beer. To get an idea of what it looks like, we can view the “Table” tab at the bottom of the dashboard.

Screen Shot 2017-02-19 at 2.58.27 PM

The “Bros” rating is the rating given to the beer by the founders of the website. Given it’s missing for a number of beers, I decided not to make use of it. Everything else should be self-explanatory. “Reviews” is the number of reviews that beer has, “ABV” is alcohol by volume, and “Avg” is the average rating of that beer by users.

Going back to the first tab, we can see a histogram for the number of reviews each style of beer has. As I expected, the IPA is by far the most reviewed in America. By using the sliders on the left, we can filter this and all other graphs by number of reviews, ABV, or average rating to get subsets of the data. If the slider for the number of reviews goes over 4500, the histogram gets colored in by brewery.

Screen Shot 2017-02-19 at 2.59.29 PM

Moving down, we can see a scatterplot of alcohol and average rating with a linear regression line. Although this was more EDA than anything, I wanted to play with my data and see if I could find a good linear model. Using the sliders on the left, we can subset the data and get updated linear regressions.

Screen Shot 2017-02-19 at 3.00.31 PM

Screen Shot 2017-02-19 at 3.01.11 PM

Unfortunately, there seemed to be no linear correlation. My highest R2 value only got to about 0.3, indicating a bad model. However, we can see some sort of shape to the data. As the ABV gets further above 5, the number of poorly reviewed beers decreases dramatically. While there are plenty of good beers with low alcohol content, there are no bad beers with high alcohol content. Moving forward, I would like to implement more of my newfound machine learning skills onto this data.

The last tab simply contains of boxplot of average rating by style. As can be seen, imperial stouts have the highest average rating of American ales. However, by playing with the sliders, we can see other beers take the lead for various reasons. Note that it's helpful to keep the number of reviews above at least 10 or so, since there are many unreviewed beers or beers with only 1 review skewing the average.

Screen Shot 2017-02-19 at 3.02.15 PM

Going Forward

Overall, the scraping and these apps were an enjoyment to create. Going forward, I would like to incorporate more machine learning and predictive modeling to define a “good” beer. By scraping user reviews for every beer, I could also incorporate more factors into my data, as I did with the cheap lagers app. In regards to those beers, I would like to obtain more information about the market, and other factors that may have lead to periods of increased ratings. Per usual, the simple strategy is the best: obtain more data and do more analysis. Regardless, I can say that I will continue to drink and enjoy American beers, albeit some more than others.




About Author

Kyle Gallatin

Kyle Gallatin

Kyle Gallatin graduated from Quinnipiac University with a biology degree in 2015. Following, he continued on for his Master's in Molecular and Cellular Biology, received in 2016. Cultivating high level skills in data science through his analytical work...
View all posts by Kyle Gallatin >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp