Scraping data for different Beers

Kyle Gallatin

Posted on Feb 19, 2017

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Data shows beer is a staple in American culture. Especially in recent years, the American craft beer industry has exploded with microbreweries and a number of different styles. I wanted to know the answer to multiple questions:

How has the craft beer explosion affected the perception of American light lagers like Bud and Coors over time?
The age-old question, which is better: Bud or Coors?
Which American style is the highest rated and the most coveted?
Is alcohol content correlated with rating?

Data & Scraping

I scraped my data from beeradvocate.com. There were two different resources that I wanted to scrape. First, for each cheap lager I wanted a large sample size of user reviews on the look, smell, feel, taste and overall ratings. For this, I used selenium as it could easily navigate page to page and pull out those attributes in addition to the date the review was posted. Unfortunately, all of these parameters were in the same string, so I had to do some compensatory data cleaning in my global.R file.

Name, Brewery, Number of Ratings, and Alcohol Content

Second, I scraped the name, brewery, number of ratings, average rating and alcohol content of individual beers for a number of choice American Ale styles. As craft ales are by far the most popular in America, I wanted a good read on the highest rated and most reviewed styles. Scrapy worked better for parsing the table, so I used a spider for each individual style. It was easy to add 50 to the URL each time to go page to page, grabbing the data.

Cheap Lagers

I quickly realized I would probably need two shiny apps. I didn’t want them to become too crowded and take away from either of my investigations, so I wanted to focus on them separately.

The “College Lagers” shiny opens with a simple boxplot based on the averages for the selected attribute based on beer. Across the top, you can see the maximum and minimum mean values for the current beers displayed. By selecting or unselecting beers, you can alter these values.

Findings

As can be seen, it turns out Coors is rated better than Budweiser on average, who would’ve thought! Coors light also seems to do a bit better than Bud Light, which is important to consider given no one really drinks just Coors, the “banquet beer”. Much to my dismay, Miller High Life is rated higher than all of them. The “champagne of beers” does not live up to the name, in my personal opinion.

Ratings Over Time

Clicking the next tab, we can see the ratings over time represented by boxplots with a geom_smooth trend line. I was surprised to find the general mean rating for these beers increased over time. Given the shift towards craft beer, I assumed the perception of these beers might dwindle. Of course the expanding user base can be taken into account. Additionally, I noticed some very obviously sarcastic reviews for some of these beers, which may have skewed the data. Still, for the overall perception of these non-craft lagers to improve as the craft beer market expanded is perplexing. I would have to gather more data as to why this could be.

The final tab simply contains my table of data, if you wanted to search for something specifically. I never ended up using the rDev, which is a measure of how far a review is from the average of a beer. Moving forward, I would like to work this into some of my analytics.

American Ales

My next app contains all the information by style for a beer. To get an idea of what it looks like, we can view the “Table” tab at the bottom of the dashboard.

The “Bros” rating is the rating given to the beer by the founders of the website. Given it’s missing for a number of beers, I decided not to make use of it. Everything else should be self-explanatory. “Reviews” is the number of reviews that beer has, “ABV” is alcohol by volume, and “Avg” is the average rating of that beer by users.

Histogram of Number of Reviews

Going back to the first tab, we can see a histogram for the number of reviews each style of beer has. As I expected, the IPA is by far the most reviewed in America. By using the sliders on the left, we can filter this and all other graphs by number of reviews, ABV, or average rating to get subsets of the data. If the slider for the number of reviews goes over 4500, the histogram gets colored in by brewery.

Scatterplot of Alcohol and Average Rating

Moving down, we can see a scatterplot of alcohol and average rating with a linear regression line. Although this was more EDA than anything, I wanted to play with my data and see if I could find a good linear model. Using the sliders on the left, we can subset the data and get updated linear regressions.

Unfortunately, there seemed to be no linear correlation. My highest R² value only got to about 0.3, indicating a bad model. However, we can see some sort of shape to the data. As the ABV gets further above 5, the number of poorly reviewed beers decreases dramatically. While there are plenty of good beers with low alcohol content, there are no bad beers with high alcohol content. Moving forward, I would like to implement more of my newfound machine learning skills onto this data.

The last tab simply contains of boxplot of average rating by style. As can be seen, imperial stouts have the highest average rating of American ales. However, by playing with the sliders, we can see other beers take the lead for various reasons. Note that it's helpful to keep the number of reviews above at least 10 or so, since there are many unreviewed beers or beers with only 1 review skewing the average.

Going Forward

Overall, the scraping and these apps were an enjoyment to create. Going forward, I would like to incorporate more machine learning and predictive modeling to define a “good” beer. By scraping user reviews for every beer, I could also incorporate more factors into my data, as I did with the cheap lagers app. In regards to those beers, I would like to obtain more information about the market, and other factors that may have lead to periods of increased ratings. Per usual, the simple strategy is the best: obtain more data and do more analysis. Regardless, I can say that I will continue to drink and enjoy American beers, albeit some more than others.

About Author

Kyle Gallatin

Kyle Gallatin graduated from Quinnipiac University with a biology degree in 2015. Following, he continued on for his Master's in Molecular and Cellular Biology, received in 2016. Cultivating high level skills in data science through his analytical work...

View all posts by Kyle Gallatin >

Machine Learning

Beware of Feature Importance for Business Decisions

Student Works

Power of a Predictive Model for Ames, Iowa Housing

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

No comments found.