U.S. Wine Production Statistics, Wine? Wine Not!
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
By its very nature, wine is a beverage that suits a wide variety of occasions: it can be served before a meal as an appetite stimulator, as an ideal accompaniment to a fancy three-course meal, or simply for casually socializing with friends. The color mix spectrum ranges from that of deep, intense reds to glistening whites with the pinkish blush characteristic of rosé varieties found somewhere in the middle. Contrary to other types of alcoholic beverages, wine is a popular choice for almost every occasion. Wine's steadily increasing global popularity reflects its versatility amongst a wide range of consumers.
However, what's most fascinating is how the United States has fastened its grip on the global wine market as of the last couple of years. According to Statista, the U.S. produced over 800 million gallons of wine in 2016, nearly 12% of the global wine production volume at the time. Wine production in the U.S. is dominated by the sunny state of California, which accounted for approximately 90% of the entire U.S. wine production in 2017.
Global Wine Market
These statistics were surprising to say the least, given my previously biased view of the global wine market being completely flooded by the perennial top producers like Spain, Portugal, France, Italy, and Argentina. Surely, these countries with some of the richest, deepest wine-making traditions would give the U.S. no shot at being a competitive player in this market. Boy, was I wrong.
As the U.S. solidifies its place in the global wine industry, I thought it would be both interesting and valuable to build a user-friendly app with the potential to assist anyone from industry professionals to individuals with little to no wine knowledge seeking to get into the wide industry. All in all, this app is designed to help users recognize the key factors that differentiate the best U.S. wines based on Wine Enthusiast Magazine's rating system and reviews.
Although the original Kaggle dataset included wines from all over the world, not just the U.S., I limited the scope to wines produced on U.S. soil by domestic wineries and vineyards. Using this smaller subset of data, I quickly realized that there were multiple insights and relationships that could be drawn between the eight variables presented in the data (i.e. state, region, winery, vineyard, variety, price per bottle, review score, and rating classification).
To facilitate a friendly user experience, I built a live dashboard application using R Shiny that you can follow along with here. The sidebar is loaded with several tabs, each containing a unique interactive tool that allows users to extract useful insights through a visually-pleasing display. When users first accesses the online application, the first tab they see is a tab named "Get Started" where the user can view the U.S. data in neatly organized table.
The top section with the red, blue, and green boxes summarizes the number of distinct variables represented in the dataset. To further clarify, for all the U.S. wines that were rated by Wine Enthusiast Magazine, there were a total of 25 states, 269 regions, 246 varieties, 4,528 wineries and 17,382 vineyards. Users are encouraged to sort through the data by any of the eight key variables (columns) including the wines' variety, review score, price, and rating as well as the state, region, winery, and vineyard in which the wine was produced.
There is also a built-in search bar where a user may search for any particular key word or phrase. Finally, users have the ability to filter the data set based on any available states, regions, varieties and ratings. The advantages of this filtering feature are discussed in further detail below.
As my first order of business, I analyzed the top ten most frequently used significant words from each wines' critic reviews filtered by state, region, winery, and variety. In essence, this tool can be used to provide valuable information regarding the geographical characteristics of a given wine. For example, the top ten characteristics of wines from the U.S. as a whole were "Cherry", "Dry", "Tannins", "Acidity", "Oak", "Black", "Ripe", "Sweet", "Rich", and "Red", ordered by the percentage of top ten mentions.
In this particular example, the word "Cherry" is the number one characteristic of wines produced throughout the entire United States, strongly indicating that a large portion of wines made in the U.S. are predominantly cherry-based. Other prevalent characteristics include "Dry", "Acidity", "Rich", and "Red".
The purpose of this particular tool is to give the user a baseline understanding of the top characteristics of the wines for a given state, region, winery, vineyard, or any combination of the four inputs. The inputs are configured such that any time one or more inputs are selected (i.e. not "All"), the remaining unselected inputs are automatically filtered to only show options available based on the inputs already selected.
Inputs and Outputs
For instance, if "Oregon" is selected as the state, then the other three inputs will only show the regions, wineries and vineyards from the state of Oregon. Likewise, if a specific vineyard such as "Martha's Vineyard" is selected, then the other three inputs are filtered to only show the states, regions, and wineries that Martha's Vineyard is associated with.
This powerful feature potentially gives the user three distinct competitive advantages by:
- Users with limited (or extensive) knowledge about the U.S. wine industry to confidently explore the data without second guessing whether the available options are correct or not
- quickly understand the attributes and characteristics that dominate a particular state or region
- compare the dominant attributes and characteristics of wines produced from wineries and vineyards all across the U.S.
The next analytical tool allows the user to explore the top varieties of wine by geographic location (state, region) and review score (80 to 100 point scale). This particular tool displays a piechart with the top five varieties of wine and their respective percentages based on selected filters. For example, the top five varieties of wine produced in New York with a score of 90 or higher were:
- Riesling (61.4%)
- Cabernet Franc (11.9%)
- Pinot Noir (9.9%)
- Rosé (8.91%)
- Chardonnay (7.92%)
It's clear from this graph that more than half of the top wines produced in New York were rieslings (61.4%). The other four varieties share a relatively similar portion of the pie, hovering between 7 and 12 percent. This type of analysis is most beneficial for individuals or entities looking to start a winery and/or vineyard in their local state or region. Essentially, one is able to identify the varieties that dominate a specific state or region and, in turn, use that information for their own winery to focus production on the other less saturated top varieties.
Users are also encouraged to specify a range of review scores based on their preferences. The default range is set at 80 to 100 points, which encompasses all of the wines in the data set since Wine Enthusiast Magazine only publishes wine reviews for wines that scored 80 or higher.
The next analytical tool allows the user to explore the top wineries filtered by geographic location (state, region) and review score (80 to 100 point scale). This tool displays a piechart with the top five wineries along with the distinct number of wines each winery produced that lands within the specified score range, represented as a percentage. For example, the top five wineries in Oregon, specifically the Willamette Valley region, with a score of 90 or lower were:
- Willamette Valley Vineyards (28%)
- David Hill (24%)
- Amalie Robert (17%)
- Left Coast Cellars (17%)
- Chehalem (14%)
Willamette Valley, Oregon
Fittingly, the winery with the most wines produced in Willamette Valley, Oregon with a score 90 or less is the Willamette Valley Vineyards. Moreover, this chart shows that Willamette Valley Vineyards produced twice the amount of wines within the specific score range than Chehalem. This type of analysis is most beneficial for individuals or entities looking to start a winery and/or vineyard in their local state or region.
Essentially, one is able to identify the wineries that dominate a specific state, region, or score range. This information can be extremely critical for young wineries to identify the top established wineries in their respective locations and to study their business models to further understand what makes them successful.
The next analytical tool allows the user to explore the top vineyard designations based on geographic location (state, region) and review score (80 to 100 point scale). This tool displays a piechart with the top vineyards along with the distinct number of wines each vineyard produced that lands within the specified score range, represented as a percentage. For example, the top five vineyards in the Russian River Valley region with wines that scored between 85 and 95 points were:
- Reserve (30%)
- Dutton Ranch (22%)
- Bacigalupi Vineyard (19%)
- Estate (16%)
- Saralee's Vineyard (13%)
This chart shows that 30% of the top wines for this particular region and score range were produced at reserve-style vineyards. The next top vineyards were Dutton Ranch and Bacigalupi Vineyard, accounting for 22% and 19% of the top wines, respectively. Rounding out the fourth and fifth top vineyards were estate-style vineyards (16%) and Saralee's Vineyard (13%).
Similar to the wineries tool, this vineyard analysis tool is most beneficial for individuals or entities looking to start a vineyard in their local state or region. Essentially, one is able to identify the top vineyards that dominate a specific state, region, or score range. This information can be extremely critical for young vineyards to identify the top established vineyards in their respective locations and to study their business models to further understand what makes them successful.
The next analytical tool allows the user to explore the frequency of wine review scores filtered by geographic location (state, region) and variety. This tool displays a histogram with wine review scores along the x-axis and the frequency count of wines with each score along the y-axis. Based on the example plot below, one can easily tell that the majority of cabernet sauvignons produced in the Red Mountain region of Washington scored 91 out of 100 points.
More specifically, exactly 49 wines scored a 91. The next highest frequency observed for this specific example was 43 wines that scored 90 out of 100 points.
The next analytical tool allows the user to explore the probability density of wine prices based on geographic location (state, region), variety, and rating classification. This tool displays a probability density function (PDF) with wine prices per bottle along the x-axis as well as the probability density of wine at each price point along the y-axis.
The probability density can be interpreted as the relative likelihood that a random variable in the sample space is equal to a given price. Based on the example plot below, a randomly selected wine from this distribution has the highest likelihood of being between $48 and $50.
Prices by Rating
The final analytical tool I created allows the user to explore the price distribution by wine rating classifications based on geographic location (state, region) and variety. This tool displays a box plot (whisker plot) with the wine rating classifications along the x-axis and the distribution of wine prices for each rating along the y-axis. Let's use the state of Washington as an example.
Typically, wines with an acceptable (lowest) rating have the lowest median prices while wines with a classic (highest) rating have the highest median prices. Logically, this trend makes sense because ratings are a reflection of wine quality. In the same sense, theoretically, the lowest quality wines should have the lowest prices while the highest quality wines should have the highest prices. For the most part, this generalization holds true since rating classifications are ordinal in nature as defined by Wine Enthusiast Magazine's point system.
However, there are certain cases like the one shown above, where higher-rated wines have a lower median price than lower-rated wines. In the case of Washington state, wines with a "Good" rating have a median price point of $20 while wines with an "Acceptable" rating have a slightly higher median price point of $22.
Therefore, in practical terms, bargain hunters (people looking for the cheapest price with the highest quality possible) in Washington with no specific varietal preference should look to purchase wines with a "Good" rating (83 - 86 points) over wines with an "acceptable" rating (80 - 82 points) for the best bargain.
The reason for this guidance is due to the fact that this bargain hunter is likely to end up purchasing a higher-rated wine at a similar or lower price point than the lowest rated wines. As seen in the example above, this type of analysis can be very useful for casual wine drinkers seeking the best "bang for your buck" wines.
In a similar fashion, this analysis tool has practical applications for industry professionals as well. For example, let's look at all of the wines in the U.S. classified as a meritage variety.
"Acceptable" wines have a relatively high median price of $48 while "Good" and "Very Good" wines have significantly lower median prices at $35 and $34, respectively. Even "Excellent" wines have a lower median price ($45) than "Acceptable" wines. Therefore, an industry professional that works for a winery that produces meritage wines with "Good" or "Very Good" ratings could use this analysis as a foundation to justify a price increase in order to maximize profits.
Although many more factors would need to be considered along with this analysis to make a compelling argument, the foundational reasoning behind this move is that if the particular winery described above is selling their "Good" and "Very Good" rated wines close to or below the median price points for those ratings, then they are likely to generate more profit by increasing their prices to be slightly below the median price point for "Excellent" wines.
Ideally, these newly increased prices would not deter casual consumers (bargain hunters) with a preference for meritage wines because the "Good" and "Very Good" rated wines would still be cheaper than "Acceptable" and "Excellent" rated wines. As a result, casual consumers looking for the cheapest wines with the best rating would still be inclined to purchase "Good" and "Very Good" wines (hopefully from your winery) over "Acceptable" wines due to the lower median price points while your winery generates more profit due to the carefully calculated price increase.
Lastly, although these analytical tools can provide very helpful insights when used in isolation, more profound insights can be drawn when these tools are used in conjunction with one another. I hope you enjoyed using this application as much as I enjoyed creating it. Thank you for your time and check out my Github if you'd like to see the coding behind this project.