Data Analytics on Liquor Retail

, and
Posted on Oct 14, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


From a casual cocktail by the pool to a celebratory toast, alcohol frequently plays a part in our social lives, but it also plays an important role in our economy. In the US alone, data shows consumers spent $117.3 billion on beer, $72.2 billion on wine, and $64.3 billion on spirits, totalling $253 billion spent on alcohol in 2018 [1]. Beyond its size, the alcohol industry has also shown resilience through recessions, making it both a profitable and stable industry. 

Our purpose is to provide useful insights and recommendations for stores selling class ‘E’ liquors, aka spirits or hard liquors like gin and whiskey. We derive these insights with retail analytics, the process of providing analysis on sales trends, inventory levels, and consumer demand, which are crucial for making marketing and procurement decisions that can increase profit margins, minimize overstocking costs, and identify other processes in a business that need improvement. 

The Data

The data used in our study come from the Alcoholic Beverages Division of Iowa which maintains a record of alcohol sales in Iowa to establishments with class ‘E’ liquor licenses for off-premise liquor consumption. The data set contained 19.4 million transactions or observations, spanned eight and half years from 2012 to 2020, and included 24 features that detail sales date, sales volume, sale price, and descriptions for the products, stores, and vendors.

Preprocessing Data 

This dataset contained many inconsistent values including:

  • One liquor item mapped to different product categories or different product names producing over-fragmentation
  • A single storemapped to multiple store names or lat-long coordinates
  • A single vendor mapped to different vendor names

Relational Database

First, to make the data more manageable, we created a relational database using sqlite and divided up the features into smaller tables to clean up these inconsistencies. Below is the entity relationship (ER) diagram for the relational database.

Data Analytics on Liquor Retail

There are five tables in the database:

  • Transactions: Master dataframe, transactions ID as the primary key, record of each transactions’ detail.
  • Vendors: Vendor ID as the primary key, record of vendor information including vendor name.
  • Stores: Stores ID as the primary key, record of store information including store name and location.
  • Products: Product ID as the primary key, record of product information including product name and category.
  • Product Price: Product ID and Date as the primary key, record of product price with respect to time.

Dealing with inconsistency data

Then we used a majority vote to clean up the inconsistent values for each table. For example, for the vendor table, we counted the number of times an ID associated with each name and used the most frequent pairing as the unique mapping for that ID. This allowed us to use the vendor ID as our primary key for the vendor table in our relational database. We applied the same methods for the store, product, and product price tables as well.

Clustering into Urban and Rural using DBSCAN

It may not make sense to place a store in a rural location with fewer potential customers, but opening  a new store in a densely populated area could mean fierce competition from other nearby stores. Therefore, identifying a store’s location as rural or urban can be important as strategies for success can differ between the two. We wanted to ask:

  • If stores were more successful in urban areas or rural ones?
  • What types of stores were more successful in urban areas, in rural?
  • What types of stores are found in urban areas, in rural?

To classify the stores as urban or rural, we used DBscan to form clusters based on the stores’ proximity to each other. Stores closely packed together, were assigned to a cluster, while outliers were assigned as noise and not assigned to a cluster. Below is the resulting eight clusters (red) and their noise (blue). The county population is also shown in shades of green. All clusters overlaid major cities, so we labeled all clustered stores ‘urban’ and all non-clustered noise ‘rural’.

Data Analytics on Liquor Retail

Store Data Analysis

Which types of stores sell the most alcohol?

Data Analytics on Liquor Retail

The bar plot above shows the average monthly stores sales volume for the different store categories (aka supermarket, liquor/tobacco store, drug store, convenience store, and other).  Supermarkets sell the most liquor on average and convenience stores sell the least. This may be because supermarkets accommodate more customers and convenience stores less.

Which stores survive the longest?

The above plot shows a store’s longevity in days. On average, supermarkets last the longest and liquor/tobacco stores the shortest.

If we look at their survival curves, we can see supermarkets have a low closure rate and convenience stores and liquor/tobacco stores have a higher closure rate. Interestingly drug stores seem to have a low closure rate until year nine, but this is due to a large number of drugstores emerging a year after the start of the dataset and continuing on to 2020. This drop from 8 years to 9 is really a reflection of some drug stores being a year behind and continuing to survive to present times.

Data on the Stores Entering the Market

We also look at the number of new stores emerging each year. The majority of the stores emerging are convenience stores. It is interesting to note that there was a policy change in Iowa in 2011, a year before the start of this data. That year, convenience stores were granted the ability to sell liquor for off-site consumption which led to an increase in the number of stores that applied for class ‘E’ liquor licences. These stores were consequently included in the data and this growth trend in convenience stores may be due to this policy change.

Does location affect store success?

The above plot shows us the average monthly stores sales volume of the store types in urban (red) vs rural (blue) locations, with the 95% confidence interval represented by the error bar (black line). Supermarkets in urban areas sell significantly more than those in rural areas, nearly 2.5 times more.


Rural & Urban Data

For both rural and urban areas, convenience stores dominate, followed by others, and then liquor/tobacco stores. Urban and rural areas differ in their proportions of drug stores and supermarkets; urban areas have a larger proportion of drug stores and smaller proportion of supermarkets than rural areas.


How are stores growing over time and are they competing with each other?

Gross Value

Next we looked at how gross sales volume changed over time and how the different store types split up these sales. The blue line in figure above shows gross sales volume has grown over time (see left y-axis: total volume of alcohol sold in millions of liters), and the normalized stacked bars show how much each store category contributed each year and how those contributions changed over time.

For example, the proportion of convenience stores in red, has grown over the years, while the proportion of liquor and tobacco stores has decreased, particularly from 2012 to 2013. Supermarkets seem to have maintained a steady sales proportion, which could indicate that supermarkets have been growing at the same rate as total sales or that supermarkets may account for most if not all the growth we see in alcohol.

Below in Figure 1, we break down the information into three graphs and connect the general trends. Figure 1a shows the total sales volume for each category. For supermarkets, in purple, total sales are increasing. Figure 1b, shows the number of stores selling alcohol each quarter. Here the number of supermarkets stayed constant and started growing in the last couple of years.

Figure 1c shows the average quarterly sales on a per store basis; how much a single store sells on average. We wanted to see if the individual stores were doing better or worse over the years and supermarkets seem to be the only type of stores that are doing better on a per store basis.

If we look at convenience stores now in blue, we see again their overall sales are increasing, but really this trend seems to be driven by its massive expansion of stores (Figure 1b), because its average quarterly sales stay constant (Figure 1c). Recall from the stacked bar plot that the proportion of sales from convenience stores has been increasing, it looks like this trend is driven by an increase in number of stores.

Finally let’s look at liquor stores in green. From the stacked barplot, remember that the proportion of sales for liquor stores looked like it was decreasing. We see that it does look like quarterly sales are decreasing as well, but it does not seem like that is driven by the number of stores, as this is increasing slightly. On average liquor stores are faring worse over the years, selling less on a per store basis.

Market Basket Data Analysis

Is the growth of the convenience stores or supermarkets contributing to the downfall of liquor stores? To explore potential competition between store types, we used market basket analysis whose purpose is to find hidden patterns in transaction records: what products are most frequently bought and which of those are bought together? We applied a market basket analysis on each of the store types and compared their results to detect similarities. Stores that sold similar items and held similar patterns could be in direct competition with each other.

From the market basket analysis, we compared the top 20 most frequent items for each store type and found an interesting pattern. The top row of Venn diagrams below compares the supermarket top frequent items to those of the other categories and the bottom row compares convenience stores to those of other categories.

Interestingly the store types fell into two groups, where drug stores and supermarkets had similar products, and convenience stores and liquor stores were most alike. Liquor stores having similar products to convenience stores could have contributed to the decrease in liquor store sales. As the number of convenience stores expanded competition increased for liquor stores leaving them with fewer sales and customers.

KMeans Clustering on Store Demand

We also looked into the demand patterns of different store types by using K-Means to group all the individual stores into three clusters, based on their normalized annual sales volume in the past 10 years. We normalized the sales volume to focus on the trend regardless of the scale of the volume. Also, we filtered out stores that only present for a short period of time and kept those that were in business during the entirety of 2012 - 2020. This left us with 989 out of 2500 stores. 

The above graph shows the average annual sales volume trend of each cluster. 

  • Red Line: Stores with generally large sales volume but volume has been gradually decreasing over the years 
  • Green Line: Stores with medium sales volume and experienced a high volume growth rate over the years 
  • Blue Line: The stores with relatively small sales volume and first experienced a volume growth but then hit the plateau around 2017-2018.

After further investigation, we found that cluster 1 has the highest percentage of liquor tobacco stores and the lowest percentage of convenience stores which might account for the drop in sales volume.

Product Data Analysis 

The next step in our analysis was to look into the liquor products to understand the dynamics of different types of liquors; their popularity, sales channels, and seasonality of demand.

There were over 9000 unique liquor products sold in Iowa in the last 10 years which we categorized into 10 alcohol types based on their product descriptions. From Figure 2a, we can see whisky and vodka have the highest overall sales volume. The two combined occupied over 60% of the entire market. Following whisky and vodka were rum, liqueur, tequila, cocktails, brandy, and gin.

The stacked bar graph in Figure 2b shows what categories of liquor each store type carried. Convenience stores and drug stores had a higher percentage of vodka and whisky while liquor stores and supermarkets had a greater percentage of other types of alcohol, potentially representing a more diverse stock.

Seasonality of demand

Next, we examined the seasonality of demand for different types of alcohol. From the graph above, we can observe some clear seasonality patterns: 

  1. Summer peak: Sales volume peaks in May - July for tequila, gin, and cocktails indicating those drinks are popular in the summer.
  2. Winter peak: Sales volume peaks in December - January for brandy and liqueurs.
  3. Multi-peak: Sales volume peaked multiple times throughout the year for whisky, vodka and rum. This could be because these liquors generally are versatile and can be enjoyed in a variety of different beverages and for different occasions. 

Top 10 liquor products 

Finally we looked at the top 10 most popular products and their sales channels. The top 10 products were sold in 1840 stores out of the total 2500 stores. For comparison, a given product was sold in only 28 stores on average, leading us to believe that sales channels is playing a big role in the success of a product. 


The graph above shows the sales volume for each of the top 10 products broken down by store type. As the overall sales volume increases for those products, the proportion sold in supermarkets also increases. This makes sense because as we mentioned in our store analysis, supermarkets were the leading in both total sales volume and individual store sales volume.

Demand Forecast 

Demand forecast and inventory management are important tasks in e-commerce and retail industry to maintain profitability. We used the ARIMA model with seasonality parameters to forecast liquor sales volume. We used a top-down approach to forcast total demand and then focus on specific store types and product categories. The general process we took to build the ARIMA model are as follows:

  1. Created time series data by aggregating monthly sales volume 
  2. Conducted a seasonal decomposition to examine trend, seasonality, and residuals 
  3. Used Auto Arima in R to select the best ARIMA model based on Maximum Likelihood and AICc Score
  4. Checked model residuals and confirmed there is no additional patterns in the residuals 
  5. Predicted on test data and evaluated model performance 

From the seasonal decomposition above, we can see that the total volume time series data can be decomposed into a trend and a seasonality factor. The residual from the decomposition shows no significant patterns, which is a sign that the ARIMA model is appropriate.

We used the first 75% (January 2012 - June 2018) of the monthly data as training, and the latter 25% (July 2018 - August 2020) as test data. The MAPE (Mean Absolute Percentage Error) of the training and testing sets were 4% and 5% respectively. From the two graphs above we can see that the ARIMA model was able to capture the fluctuation patterns pretty well. 


Our analysis included many aspects of the liquor market in Iowa, including location, store competition, and products. To tie everything together, we’ve created a condensed version of recommendations for each store type, shown here is the supermarket example. This 360-Degree review aims to give supermarket owners or investors in the liquor business the most relevant information they need to know in order to open or maintain a store. 


  • From a store longevity perspective, supermarkets generally survive the longest amongst all store types, 75% of the stores can last at least 9 years; and with 5% overall liquor sales volume increase year over year and 25 new stores opening per year, it’s relatively fast growing in revenue and relatively less intense competition among its own type. 
  • Location wise, on average a supermarket in an urban area sells 2.5 times as much compared with a rural area, which is a significant difference that is not observed for other store types. So we’d recommend choosing a supermarket location at a more densely populated area.
  • In terms of competition with other store types, even though supermarkets are currently the leader in terms of liquor sales volume, we suggest paying some attention to drug stores since they carry similar products and show similar transaction patterns from our clustering analysis. 
  • As for what products to stock for supermarkets, the general breakdown is 30% whisky, 30% vodka, and 40% everything else. However from our time series forecast, we were able to use our model to give a more fine-grained number for a specific liquor type for supermarkets. Following the top down approach, we built an ARIMA for all Whisky products sold in supermarkets specifically, and predicted that the total demand for whisky in all supermarkets in Iowa is about 575 thousands liters for October 2020, with an error margin of 7%. Dividing that by the number of supermarkets, that’s roughly 1600 L of whisky for a single supermarket. 
  • Based off our market basket analysis our top suggestions to stock for Supermarkets are Black Velvet whiskey, Hawkeye vodka, Seagrams 7 whiskey, and Admiral nelson spiced rum.
  • The following combinations of alcohols also seem to be bought together frequently, Jagermeister and Seagrams 7 Whiskey, Kahlua and Seagrams, Five star whiskey and Five o’clock vodka, and old crow whiskey with either Seagrams or Five o’clock vodka. If stocking up on one of these products, make sure to order the others and maybe even display these products together or offer special discounts when bought together.

Future Work

  • Pull more demographic information, such as poverty rate, education level, and income for a more in depth study of store location analytics.
  • Use different ML probabilistic classifiers to estimate the fraction of each liquor category and use those to forecast the sales percentages of each liquor category.


[1]  “U.S. Beverage Alcohol Spending Hits $253.8 Billion in 2018, +5.1% versus 2017” bw166, 9/20/20,

Contact Us

If you have any questions or comments, please feel free to reach out to any of us on LinkedIn or GitHub.

About Authors

Judy Chung

Data Scientist with five years of lab manager experience in microbiology and ecology. Strong research professional with a Bachelor's degree in Integrative Biology from University of California, Berkeley. Excited to generate data-driven insights to drive business value.
View all posts by Judy Chung >

Qing Ying

Qing (Sophie) Ying graduated from UC Berkeley with a Master's degree in Industrial Engineering and Operations Research. She has been a product manager in a healthcare technology startup for 3 years, where she developed various data analytics products...
View all posts by Qing Ying >

Xiaogang Zhu

I am graduating from the University of Florida majoring in Ocean Engineering. When I am pursuing my master's degree I also earned a Statistics Minor degree. I got my Ocean Engineering bachelor's degree at the Ocean University of...
View all posts by Xiaogang Zhu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI