Using Data to Analyze: Capstone Project: Iowa Liquor Sale
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Data Science Background
Data of Alcohol consumption by humans has been recorded as early as 5500 BC in Iran. Liquor products are ubiquitous, and according to the 2019 National Survey of Drug Use and Health, 85.6% of people 18 or older have reported that they have consumed alcohol at least once in their lives. We decided to analyze liquor sales in Iowa not only to try and predict trends in Liquor sales for Hy-Vee stores, but also try to understand the omnipresence of alcoholic products in today’s society.
The Dataset
The dataset was acquired from the Government of Iowa. It contains the spirits purchase information of Iowa Class “E” liquor licensees by product and date of purchase from January 1, 2012 to the present. Iowa is one of the 18 remaining Alcohol Beverage Control or “ABC” states. Being an ABC state essentially entails collecting a portion of the sales revenue and allocating it for public spending funds such as education initiatives, human services/prevention efforts or local municipality spending.
The Iowa Alcoholic Beverages Division (ABD) is responsible for the licensing and permitting of alcoholic beverages to retailers, manufacturers, importers and brokers. The dataset consists of over 24 features, which include (but are not limited to) date, store_number, store_name, store_location, county_name, category (of alcohol), state_bottle_cost, state_bottle_retail, sale($) as well as other details about the alcohol sold. The data consisted of over 24 million observations giving us a large dataset to work with. Our intentions were to provide valuable business insights to Hyvee through the following:
- Effective and efficient data cleaning to handle inconsistencies in missing data
- Logical feature engineering to create additional valuable and prominent features
- Time Series Analysis to understand seasonality trends and predict liquor sales over the next couple years
- Location analysis to understand and produce efficient and quintessential expansion models
Data Preprocessing and Feature Engineering
Our initial dataset consisted of over 24 million observations. Using Google Bigquery for data warehousing and executing SQL queries, we were able to subset the data frame for data pertaining only to Hyvee stores. The first step of our data preprocessing was to look for any missing data to cover any gaps.
Our dataset was fairly complete and did not have a scattered array of missing values. Store_location(longitude and latitude) was the only feature with significant missing inputs. Imputation of the missing store_locations was processed using address, zip_code and city. No other data imputation was necessary. We then split the store_location(longitude and latitude) into separate longitude and latitude features.
Date was also split into separate month, day, year and day_of_week features in order to ensure effective time series analysis. Day_of_week was added to understand the sale of liquor on Sundays and compare it to the sale of liquor on all the other days of the week. This feature was added due to the new law passed in July 2021 that permitted the sale of liquor on Sundays (effective July 1st, 2021) for additional revenue.
New features for profit_per_bottle and profit_per_sale were engineered to understand profit distributions per sale. The profit_per_bottle was created subtracting the state_bottle_cost from the state_bottle_retail. The profit_per_sale feature was created by dividing the bottles_sold by the profit_per_bottle. These new features were crucial in interpreting our location and time series analysis.
Hyvee is a diverse organization with many different types of stores and different types of alcohol. Hyvee has a chain of not only supermarkets but also drug stores, convenience stores, gas stores, as well as stand alone liquor stores. A feature for the store_type was also created in order to help us understand the wide array of sales the different types of Hyvee stores were concluding. This feature was engineered by creating a function which recognized keywords such as “gas” and “supermarket,” etc., and then placing the store under that particular store_type.
A secondary feature categorizing liquor_type was also composed by creating a similar function –described previously – that recognized key words pertaining to the liquor_type such as “vodka,” “gift,” “whiskey,” “moonshine,” etc. This feature was fundamental in comprehending consumer purchases. A bottle_size feature was also engineered to identify the type of liquor sold in various sizes in milliliters.
Exploratory Data Analysis
To start, we wanted to get an idea of how Iowa compares to the rest of the country in terms of liquor consumption to determine if the information we analyzed about Iowa would be relevant to the rest of the country, or if Iowa is an outlier with above or below average consumption. As shown in the graph above, Iowa is only slightly below the national average, at 2.4%, which makes it not an outlier and an optimal state to analyze.
The upper left boxplots show a general uptick trend in sales revenues over the years of our dataset, with earlier years that include outlier months with extremely high sales revenues. The upper right boxplots show high sales revenues in October and December over the ten years since 2012. It’s expected because October is a drinking month when alcohol festivals are held each year. December is the Christmas holiday month when families gather and drink for celebrations and often gift each other various liquors.
We also tried to gain more business insight by looking at store type versus sales for each type of alcohol to examine which alcohols generate more sales in the different stores. First we looked at Hy-Vee supermarkets, as shown below, and were able to observe that these stores had the highest average sales of gifts, tequila and whiskies.
This same observation was seen in liquor stores with these alcohols, followed by rum and mezcal. On the contrary, we can see that, while gas stations had the least amount of locations overall, they did seem to have more sales of liqueurs, tequila and vodka. This suggests that store type has an effect on alcohol sales for each different alcohol type and that certain store types should specialize in certain liquor types.
Location Data Analysis with Mapping
After creating the latitude and longitude columns for our dataframe, we proceeded to use this information to plot the Hy-Vee store locations to gain additional insights. One of the ways we visualized this information was through the use of Folium. Since Folium allows us to create interactive maps with lots of configuration options, we chose it to map our geospatial data so that we could easily explore different regions of Iowa for analysis.
Ultimately we created a time-lapse heatmap with folium to show the locations of Hy-Vee stores around Iowa in the ten year period to see where more stores were located and where new stores were being built. This was done by creating a new dataframe consisting of the year, latitude, and longitude of the individual stores. Then the year column became the index or key and the latitude and longitude columns became the values which were then sorted by year and plotted using the HeatMapWithTime plugin from Folium. Finally, to view it externally, it was saved to an html file.
What we saw initially was an abundance of stores in the bigger cities such as Des Moines, Cedar Rapids and Iowa City. Next, when looking at the map of the entirety of Iowa, we see that in recent years it appears that Hy-Vee is expanding locations and store types into the more rural areas of Iowa. Furthermore, when we ‘zoom in’ on the cities, such as Des Moines, we see that while there were more locations in the central, more densely populated region of the city in the earlier years, more locations have been built in the suburbs or outskirts of the city in recent years.
This could be a response to the recent “exodus” from the cities resulting from the pandemic where more people took up residence in the suburbs and rural areas for the cheaper cost of living and access to more space. This suggests that it could be beneficial for Hy-Vee to continue expanding to these areas where people are relocating and adding different store types such as supermarkets, convenience or gas locations to grow their profits as a company.
In our original dataset, we also have a column that contains descriptive information regarding the type of store and alcohol types that are sold. Using this description, we engineered a new column that generalizes store type and alcohol type. We generalized store types into 5-6 categories: gas station, convenience store, drug store, liquor store, supermarket and fast & fresh (Hy-Vee version of fresh ready-to-go food).
As mentioned earlier, some stores were missing geospatial information. As a solution, questmap API and geopandas were used to extract this information and fill the missing values. Moreover, we wanted to see how these store types fared in different cities/counties in Iowa based on population and demographics.
As analyzed in the above map, Des Moines, one of the largest cities in Iowa, has the most dense concentration of stores in the state. It can also be deduced that higher population areas (dark red), impact sales and profit. Major crossroads and highways impact sales, as well. Specifically, Highway 80 has an immense impact on sales. Cities with highest profit per bottle like Des Moines, Davenport, Cedar Rapids, all enjoy the accessibility offered by major highways.
Furthermore, to explore how different store types impact profit per bottle, we subset our data from 2019-2022 and explored profits by store type. Intuitively, liquor stores and supermarkets, which usually specialize in selling mid-high end priced liquors, consistently outperformed other categories. Direct linear correlation can also be observed between profit per bottle and cost per bottle.
To explore further, we wanted to gauge the difference in profits based on city and store type. For example, larger cities have a more diverse selection of store types in contrast to the more specialized stock in smaller cities. An investor should keep such differences in mind if a decision when planning to expand Hy-Vee franchises into smaller cities.
As mentioned earlier, the state of Iowa introduced a new law in 2021 that allowed the sale of alcohol on Sundays. Due to the applicability for our dataset, we wanted to see how the new law has impacted the sale of alcohol-based on store type.
As we can see for all store types, barring convenience stores, have seen an increase in average profit per bottle sold. Interestingly, liquor stores and supermarkets have seen the largest increase in average profits per bottle. This could be attributed to the fact that these store types sell higher priced alcohols thus have proportionally higher profit margins.
Time Series Data Analysis
Since the timeframe of our data is broad, spanning from 2012 to 2022, we need a time series analysis to figure out patterns and trends in liquor sales.
To start with, we need to specify our time-space. Since time series prediction models perform best when the timeframe is equally spaced and our dataset has ten years of data, we decided to aggregate the data into a monthly window to ensure sales data for each month over the ten years.
After getting monthly revenue for all the liquor sales, we decomposed data using multiplicative and additive decomposition methods. As shown below, both ways indicate an overall uptick trend and a clear seasonality every year in sales. Also, the residual plots in both methods show variability and an almost stationary time series. To prove our aforementioned findings, we will build models progressively.
Before we start building models, we first look at the stationary status of our data. The first metric we used - ADF Test - returns a p-value of 0.99. Since the p-value is greater than the threshold of 0.05, we accept the null hypothesis that the data is non-stationary. Another metric we evaluated was the KPSS Test, which returns a p-value of 0.01.
Since the p-value is less than the threshold of 0.05, we reject the null hypothesis and conclude the data is non-stationary (Please note the null hypothesis of the ADF Test and the KPSS Test are opposite). Therefore, we went ahead to difference the data to make it more stationary for a better performance of our models.
Unfortunately, the above plots show that after just one difference, the autocorrelation went from positive to negative at first lag, showing one difference will overly correct our data. This finding confirms our earlier conclusion from decompositions that the original monthly sales data is almost stationary after decomposing trend and seasonality. Therefore, we pick the d-value of all our ARIMA-related models as 0, i.e., we don’t difference our data.
To select the p and d values of our ARIMA-related models, we look at the ACF and PACF plots above. The moving average correlation drops below the significance level after eight lags, and the autoregression correlation drops below the threshold after two lags. Therefore, we pick our p-value as 2 and the q-value as 8. Both plots show significant spikes at 12th lags, indicating the seasonality of the dataset, further indicating seasonality plays a part in our model selection.
The upper left plot shows the ARIMA model (2,0,8), and the upper right plot shows the SARIMAX model (2,0,8) (2,0,8,12). We picked P, D, Q as the same as the p,d,q of the ARIMA model for comparison purposes. Then we ran three metrics – MAPE, CORR, and MINMAX - to compare the two models in addition to the visualizations above. We used the three metrics over others because they are all percentage-therefore-accuracy-based so that scales won’t affect the results. The result below shows that the SARIMAX model outperforms the ARIMA model for every metric.
Now that we know our model performs better when we include the seasonality element, we need to look at some model that considers trends in addition to seasonality. Therefore, the Exponential Smoothing model with the Halt-Winter’s method catches our eye. Such a model and method allow us to consider the seasonality of 12 months and trends.
Since we see smaller variability in the additive decomposition’s residual plot than in the multiplicative decomposition shown above, we use additive as our model’s trend element. Also, we set use_boxcox to be ‘True’ in the model parameter to reduce the residual heteroskedasticity. The plot and the metrics below prove this model performs better than our earlier ARIMA and SARIMAX models, and it catches the trend and seasonality of the original dataset very well.
Using the Exponential Smoothing model above, we are ready to predict future sales revenue. In the prediction below, gray shades representing a 95% confidence interval, our model expects the sales revenue to reach record highs at the end of this year and even higher for the following year.
However, the SARIMAX model prediction below expects the sales profit to stay at the same level as in 2020 and 2021. As a result, we would expect the cost to go up for the liquor business in Iowa, which is reasonable because of the high inflation caused by factors just as Quantitative Easing (QE) by central banks, geopolitical tensions, and conflicts that caused demand-pull inflation.
Especially in the oil and semiconductor sectors, cost-push inflation, especially for consumer goods, and various other factors. Some possible ways to curb the rising cost might be to sell the most popular liquor types and brands to avoid excessive inventory, resulting in lost revenue and opportunity costs, and expand into other regions where liquor demand is high.
Supervised and Unsupervised Machine Learning Using Data
We decided to use K-Prototypes for clustering analysis, since our dataset contains both categorical and numerical types. Starting by lowering all string data types to ensure we didn’t count the same thing twice for our analysis. Then we dropped some columns with too many unique values, making it hard for the clustering model to run and not valid for clustering since we only have 25 columns initially after preprocessing our data.
For instance, column ‘item_description’ has 3902 unique values. We also removed columns having the same meanings as other columns. We dropped ‘coordinates,’ ‘zip-code,’ ‘county’ columns because the remaining ‘address,’ ‘city’ columns could also be used to identify the store’s location. Another reason is that we did a separate location analysis above. Therefore, we don’t need this many features. One last step we worked on the categorical features was to make sure the values for each feature were balanced. Otherwise, we could do naïve based models.
VIF testing the Data
We did VIF testing to filter out features with high collinearity for numerical features. Some features such as state_bottle_retail, state_bottle_cost, and profit_per_bottle have extremely high VIF, which makes sense because sales revenue, cost, and profit are expected to be correlated. We took these features out as we analyzed them in separate sections. In the end, we standardized all numerical features to avoid the model putting weight on those with high variance.
After ensuring our final dataset has no null values, we trained our model and got the above segments table. We can tell the model clusters store_type into the supermarket for all segments and liquor_type into whisky for the majority of all segments, which makes sense because they are the most frequent types within our dataset.
In addition, all the day_of_week clusters are around the early days of a week (0 is Monday, 1 is Tuesday, …and 6 is Sunday). This agrees with the fact that stores buy liquors from vendors early in each week for sale at the end of each week, as the dataset is about stores buying from vendors. Other information in our clustering is also helpful to vendors and stores as they tell which type or value is popular.
Summaries and Conclusions
Based on the analysis above, we draw the following conclusions:
- Specialization should be key for anyone looking to enter the Iowa liquor market.
- In General, Whiskey tends to outperform all other liquor categories by profit margins and overall profit.
- Some cities with high populations, like Sioux City, Omaha, and Davenport, don’t have high average liquor sales profits.
- These cities are primed to be investment targets based on construction developments and highway constructions.
- Sales are expected to reach record highs in 2022 and 2023. However, profits are expected to be static due to retail cost increases driven by overall microeconomic/macroeconomic trends (logistic issues, inflation, geopolitics, etc.).
- Hy-Vee Supermarkets and Liquor Stores in Iowa should strive to increase profit margins, and Convenience Stores, Drug Stores, and Gas Stations should increase inventory types while maintaining current margins.
Future Works
Our analysis and understanding of liquor sales is fairly incomplete. We would need to paint a more explicit picture to fully understand the liquor sales in Iowa. Accumulating the income distribution data in Iowa would be a huge step in apprehending the sales distribution of different types of liquor.
We would also like to add consumer data to develop a deeper understanding of not only the product purchased but also gain intel on our targeted audience for the range of stores Hyvee owns. Gathering complete receipts of a consumer would be quintessential to complete a true market basket analysis of products being purchased along with alcoholic products. We can then create an app and place similarly purchased products together based on the market basket analysis.
Authors' LinkedIn
Visit Github for source code.