U.S. Wine Production Statistics, Wine? Wine Not!

Posted on Mar 9, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

View Edwin's Github or LinkedIn | View the live application


By its very nature, wine is a beverage that suits a wide variety of occasions: it can be served before a meal as an appetite stimulator, as an ideal accompaniment to a fancy three-course meal, or simply for casually socializing with friends. The color mix spectrum ranges from that of deep, intense reds to glistening whites with the pinkish blush characteristic of rosΓ© varieties found somewhere in the middle. Contrary to other types of alcoholic beverages, wine is a popular choice for almost every occasion. Wine's steadily increasing global popularity reflects its versatility amongst a wide range of consumers.

However, what's most fascinating is how the United States has fastened its grip on the global wine market as of the last couple of years. According to Statista, the U.S. produced over 800 million gallons of wine in 2016, nearly 12% of the global wine production volume at the time. Wine production in the U.S. is dominated by the sunny state of California, which accounted for approximately 90% of the entire U.S. wine production in 2017.

Global Wine Market

These statistics were surprising to say the least, given my previously biased view of the global wine market being completely flooded by the perennial top producers like Spain, Portugal, France, Italy, and Argentina. Surely, these countries with some of the richest, deepest wine-making traditions would give the U.S. no shot at being a competitive player in this market. Boy, was I wrong.

As the U.S. solidifies its place in the global wine industry, I thought it would be both interesting and valuable to build a user-friendly app with the potential to assist anyone from industry professionals to individuals with little to no wine knowledge seeking to get into the wide industry. All in all, this app is designed to help users recognize the key factors that differentiate the best U.S. wines based on Wine Enthusiast Magazine's rating system and reviews.


Technical Details

Although the original Kaggle dataset included wines from all over the world, not just the U.S., I limited the scope to wines produced on U.S. soil by domestic wineries and vineyards. Using this smaller subset of data, I quickly realized that there were multiple insights and relationships that could be drawn between the eight variables presented in the data (i.e. state, region, winery, vineyard, variety, price per bottle, review score, and rating classification).

To facilitate a friendly user experience, I built a live dashboard application using R Shiny that you can follow along with here. The sidebar is loaded with several tabs, each containing a unique interactive tool that allows users to extract useful insights through a visually-pleasing display. When users first accesses the online application, the first tab they see is a tab named "Get Started" where the user can view the U.S. data in neatly organized table.

Data Analysis

The top section with the red, blue, and green boxes summarizes the number of distinct variables represented in the dataset. To further clarify, for all the U.S. wines that were rated by Wine Enthusiast Magazine, there were a total of 25 states, 269 regions, 246 varieties, 4,528 wineries and 17,382 vineyards. Users are encouraged to sort through the data by any of the eight key variables (columns) including the wines' variety, review score, price, and rating as well as the state, region, winery, and vineyard in which the wine was produced.



There is also a built-in search bar where a user may search for any particular key word or phrase. Finally, users have the ability to filter the data set based on any available states, regions, varieties and ratings. The advantages of this filtering feature are discussed in further detail below.


Analytical Tools


As my first order of business, I analyzed the top ten most frequently used significant words from each wines' critic reviews filtered by state, region, winery, and variety. In essence, this tool can be used to provide valuable information regarding the geographical characteristics of a given wine. For example, the top ten characteristics of wines from the U.S. as a whole were "Cherry", "Dry", "Tannins", "Acidity", "Oak", "Black", "Ripe", "Sweet", "Rich", and "Red", ordered by the percentage of top ten mentions.



In this particular example, the word "Cherry" is the number one characteristic of wines produced throughout the entire United States, strongly indicating that a large portion of wines made in the U.S. are predominantly cherry-based. Other prevalent characteristics include "Dry", "Acidity", "Rich", and "Red".Β 

The purpose of this particular tool is to give the user a baseline understanding of the top characteristics of the wines for a given state, region, winery, vineyard, or any combination of the four inputs. The inputs are configured such that any time one or more inputs are selected (i.e. not "All"), the remaining unselected inputs are automatically filtered to only show options available based on the inputs already selected.

Inputs and Outputs

For instance, if "Oregon" is selected as the state, then the other three inputs will only show the regions, wineries and vineyards from the state of Oregon. Likewise, if a specific vineyard such as "Martha's Vineyard" is selected, then the other three inputs are filtered to only show the states, regions, and wineries that Martha's Vineyard is associated with.

This powerful feature potentially gives the user three distinct competitive advantages by:


  1. Users with limited (or extensive) knowledge about the U.S. wine industry to confidently explore the data without second guessing whether the available options are correct or not
  2. quickly understand the attributes and characteristics that dominate a particular state or region
  3. compare the dominant attributes and characteristics of wines produced from wineries and vineyards all across the U.S.



The next analytical tool allows the user to explore the top varieties of wine by geographic location (state, region) and review score (80 to 100 point scale). This particular tool displays a piechart with the top five varieties of wine and their respective percentages based on selected filters. For example, the top five varieties of wine produced in New York with a score of 90 or higher were:

  1. Riesling (61.4%)
  2. Cabernet Franc (11.9%)
  3. Pinot Noir (9.9%)
  4. RosΓ© (8.91%)
  5. Chardonnay (7.92%)




It's clear from this graph that more than half of the top wines produced in New York were rieslings (61.4%). The other four varieties share a relatively similar portion of the pie, hovering between 7 and 12 percent. This type of analysis is most beneficial for individuals or entities looking to start a winery and/or vineyard in their local state or region. Essentially, one is able to identify the varieties that dominate a specific state or region and, in turn, use that information for their own winery to focus production on the other less saturated top varieties.Β 

Users are also encouraged to specify a range of review scores based on their preferences. The default range is set at 80 to 100 points, which encompasses all of the wines in the data set since Wine Enthusiast Magazine only publishes wine reviews for wines that scored 80 or higher.



The next analytical tool allows the user to explore the top wineries filtered by geographic location (state, region) and review score (80 to 100 point scale). This tool displays a piechart with the top five wineries along with the distinct number of wines each winery produced that lands within the specified score range, represented as a percentage. For example, the top five wineries in Oregon, specifically the Willamette Valley region, with a score of 90 or lower were:

  1. Willamette Valley Vineyards (28%)
  2. David Hill (24%)
  3. Amalie Robert (17%)
  4. Left Coast Cellars (17%)
  5. Chehalem (14%)



Willamette Valley, Oregon

Fittingly, the winery with the most wines produced in Willamette Valley, Oregon with a score 90 or less is the Willamette Valley Vineyards. Moreover, this chart shows that Willamette Valley Vineyards produced twice the amount of wines within the specific score range than Chehalem. This type of analysis is most beneficial for individuals or entities looking to start a winery and/or vineyard in their local state or region.

Essentially, one is able to identify the wineries that dominate a specific state, region, or score range. This information can be extremely critical for young wineries to identify the top established wineries in their respective locations and to study their business models to further understand what makes them successful.


Vineyard Designations

The next analytical tool allows the user to explore the top vineyard designations based on geographic location (state, region) and review score (80 to 100 point scale). This tool displays a piechart with the top vineyards along with the distinct number of wines each vineyard produced that lands within the specified score range, represented as a percentage. For example, the top five vineyards in the Russian River Valley region with wines that scored between 85 and 95 points were:

  1. Reserve (30%)
  2. Dutton Ranch (22%)
  3. Bacigalupi Vineyard (19%)
  4. Estate (16%)
  5. Saralee's Vineyard (13%)



This chart shows that 30% of the top wines for this particular region and score range were produced at reserve-style vineyards.Β  The next top vineyards were Dutton Ranch and Bacigalupi Vineyard, accounting for 22% and 19% of the top wines, respectively. Rounding out the fourth and fifth top vineyards were estate-style vineyards (16%) and Saralee's Vineyard (13%).

Vineyard analysis

Similar to the wineries tool, this vineyard analysis tool is most beneficial for individuals or entities looking to start a vineyard in their local state or region. Essentially, one is able to identify the top vineyards that dominate a specific state, region, or score range. This information can be extremely critical for young vineyards to identify the top established vineyards in their respective locations and to study their business models to further understand what makes them successful.


Review Scores

The next analytical tool allows the user to explore the frequency of wine review scores filtered by geographic location (state, region) and variety. This tool displays a histogram with wine review scores along the x-axis and the frequency count of wines with each score along the y-axis. Based on the example plot below, one can easily tell that the majority of cabernet sauvignons produced in the Red Mountain region of Washington scored 91 out of 100 points.

More specifically, exactly 49 wines scored a 91. The next highest frequency observed for this specific example was 43 wines that scored 90 out of 100 points.





The next analytical tool allows the user to explore the probability density of wine prices based on geographic location (state, region), variety, and rating classification. This tool displays a probability density function (PDF) with wine prices per bottle along the x-axis as well as the probability density of wine at each price point along the y-axis.

The probability density can be interpreted as the relative likelihood that a random variable in the sample space is equal to a given price. Based on the example plot below, a randomly selected wine from this distribution has the highest likelihood of being between $48 and $50.




Prices by Rating

The final analytical tool I created allows the user to explore the price distribution by wine rating classifications based on geographic location (state, region) and variety. This tool displays a box plot (whisker plot) with the wine rating classifications along the x-axis and the distribution of wine prices for each rating along the y-axis. Let's use the state of Washington as an example.


Typically, wines with an acceptable (lowest) rating have the lowest median prices while wines with a classic (highest) rating have the highest median prices. Logically, this trend makes sense because ratings are a reflection of wine quality. In the same sense, theoretically, the lowest quality wines should have the lowest prices while the highest quality wines should have the highest prices. For the most part, this generalization holds true since rating classifications are ordinal in nature as defined by Wine Enthusiast Magazine's point system.

However, there are certain cases like the one shown above, where higher-rated wines have a lower median price than lower-rated wines. In the case of Washington state, wines with a "Good" rating have a median price point of $20 while wines with an "Acceptable" rating have a slightly higher median price point of $22.

Bargain Hunters

Therefore, in practical terms, bargain hunters (people looking for the cheapest price with the highest quality possible) in Washington with no specific varietal preference should look to purchase wines with a "Good" rating (83 - 86 points) over wines with an "acceptable" rating (80 - 82 points) for the best bargain.

The reason for this guidance is due to the fact that this bargain hunter is likely to end up purchasing a higher-rated wine at a similar or lower price point than the lowest rated wines. As seen in the example above, this type of analysis can be very useful for casual wine drinkers seeking the best "bang for your buck" wines.

In a similar fashion, this analysis tool has practical applications for industry professionals as well. For example, let's look at all of the wines in the U.S. classified as a meritage variety.


"Acceptable" wines have a relatively high median price of $48 while "Good" and "Very Good" wines have significantly lower median prices at $35 and $34, respectively. Even "Excellent" wines have a lower median price ($45) than "Acceptable" wines. Therefore, an industry professional that works for a winery that produces meritage wines with "Good" or "Very Good" ratings could use this analysis as a foundation to justify a price increase in order to maximize profits.


Although many more factors would need to be considered along with this analysis to make a compelling argument, the foundational reasoning behind this move is that if the particular winery described above is selling their "Good" and "Very Good" rated wines close to or below the median price points for those ratings, then they are likely to generate more profit by increasing their prices to be slightly below the median price point for "Excellent" wines.

Ideally, these newly increased prices would not deter casual consumers (bargain hunters) with a preference for meritage wines because the "Good" and "Very Good" rated wines would still be cheaper than "Acceptable" and "Excellent" rated wines. As a result, casual consumers looking for the cheapest wines with the best rating would still be inclined to purchase "Good" and "Very Good" wines (hopefully from your winery) over "Acceptable" wines due to the lower median price points while your winery generates more profit due to the carefully calculated price increase.Β 



Lastly, although these analytical tools can provide very helpful insights when used in isolation, more profound insights can be drawn when these tools are used in conjunction with one another. I hope you enjoyed using this application as much as I enjoyed creating it. Thank you for your time and check out my Github if you'd like to see the coding behind this project.

About Author

Edwin Back

Graduated from the University of Michigan in 2015 with a BSE in Environmental Engineering. Data Analyst with a robust understanding of probability and statistics backed by 2 years of professional work involving business data analytics (sales/marketing/real estate), environmental...
View all posts by Edwin Back >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI