Data Visualization of CatskillProvisions.com
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
BACKGROUND:
CatskillProvisions.com
Catskill Provisions is a tale of two businesses. Data history shows the first being a wholesaler who sells product like honey, honey based whiskey, NY maple syrup, etc. directly to restaurants and liquor stores. The business was first conceived through the love of bee-keeping.
The bulk of the revenue generated for the company is through the wholesale business. The ecommerce website, CatskillProvisions.com, generates little revenue as compared to wholesale. However, the ecommerce website has been a branding engine and online sales presence for the company since inception.
CatskillProvisions.com sells food specialty items such as honey, truffles, marinades, gift sets and more. Organic honey was the first item sold online in 2010. Honey continues to be the predominant product sold through the site.
Given its position, CatskillProvisions.com could generate revenue in much greater proportion to the larger wholesale business. With a general understanding of the customer base and some underlying investment, CatskillProvisions.com is positioned for growth.
ECOMMERCE CUSTOMER PROFILE
Reviewing the dataset and conducting simple EDA, the average customer who purchases from CatskillProvisions.com has the following profile:
- Female, 76%
- Lives in the North Eastern part of the US, 67%
- One-time purchaser of products, 85%
- Shops Tuesdays and Thursdays (from work)
- Purchased Honey, Truffles and Gift Sets
- Used email domain: Gmail, Yahoo or work domain
Data on WEB STORE CUMULATIVE PRODUCT SALES
The following bar chart not only represents the products purchased through the ecommerce site but the basket of goods consumers purchased at one time. For example, if a consumer purchased honey, then a secondary purchase was truffles or marinade and so on.
Roughly 20% of all ecommerce consumers are repeat purchasers. Such a high percentage of repeat purchasers illustrates the quality of the products and the customer service provided. However the fundamental issue for the ecommerce site is web traffic. Traffic and sales go hand-in-hand. Yet, traffic has declined over time as defined in the chart below.
DATA ON TOP SEO SEARCH TERMS
Looking further into the traffic issue, the SEO or search engine optimization search terms seem common at best and do not differentiate the ecommerce website from any other website. As well, the website is not engaged in an active SEM, search engine marketing program to drive traffic or promote product. The lack of promotion is the fundamental issue to the traffic issue and subsequently product sales.
Top SEO terms are represented in the following word cloud.
DATASET
The CatskillProvisions.com data is a transactional data. With feature engineering combining traffic data and transactional data, the dataset was expanded further for machine learning purposes. Key features included:
- Transaction date
- Transaction day
- Customer information
- Shipping/Billing information
- Repeat Purchased
- Traffic by day
- Purchase total
- Sales
- Sales to Web Visit Conversion
CORRELATION
Given the nature of the data and the small size of the dataset, despite additional feature engineering, the data remains highly correlated. When conducting machine learning, the models illustrated the structure of segmentation by identifying top features through training output despite correlation.
Predictive features from each model showed great promise. Those predictive features identified were shipping region, repeat purchasers and gender. Despite this progress, more information is needed to fully test out the models to understand their predictive quality and leverage for the website.
MACHINE LEARNING
CatskillProvisions.com FEATURE: SHIPPING REGION
Testing out the feature shipping region using software Dataiku, a logistic regression model showed promise predicting the feature with an ROC AUC score of .76 and accuracy of .69. This model scored higher than others such as SVM, Random forest and XGBoost among others. Reviewing the features for shipping region, it was fairly easy to see why the model ranked as well as it did. The model categorized the regions given the high inference of the feature. Following is a chart quantifying purchasers by shipping region.
Clearly the Eastern section of the US predominates the dataset where purchases and shipping originate enabling the model to accurately categorize this feature. A density chart further illustrates the predictive quality of the model feature.
The ROC AUC chart shows solid prediction of the shipping region. However with a larger data set, the curve would likely be smooth demonstrating the strength of its predictive accuracy.
CatskillProvisions.com FEATURE: REPEAT PURCHASES
Analyzing the feature repeat purchases, two models showed robust accurately categorizing and predicting the feature. Those models were Lasso Regression and XGBoost. Both had similar R2s while also reporting high correlation. However the Lasso regression model showed better results when reviewing overall model output.
Reviewing the model errors for normal distribution, the Lasso error distribution is illustrated as follows:
The errors for the Lasso model fall close to zero but are highly clustered with a non-normal distribution. Again, the distribution indicates correlation.
XGBoost
The XGBoost distribution looks a little better than the Lasso distribution but still demonstrates correlation as well with a non-normal distribution.
CatskillProvisions.com FEATURE: GENDER
The model categorizing gender as a predominate feature is a SVM model with an ROC AUC score of .87 and lift of 2.10. Like the previous three models, the SVM model categorized gender as a feature outranking all other possible models. With an accuracy of .84 and a dataset primarily female, it is no surprise that the SVM model popped as the strongest predictor and best classifier. A chart of ecommerce purchasers follows (primarily female):
GENDER FEATURE
The model density curve illustrating the SVM model's ability to predict male versus female is highlighted below:
Gender SVM Lift chart further substantiating gender prediction.
SVM ROC AUC chart for gender demonstrating the model's ability to accurately predict gender. With a bigger data set, the curve would show a smooth prediction curve.
BUSINESS RECOMMENDATIONS FOR CatskillProvisions.com
Given the various machine learning on the dataset CatskillProvisions.com, the machine learning highlighted the strongest predictive features for the ecommerce website. With this information, CatskillProvisions.com should focus on these key features to create promotions to scale web traffic, sales conversion and revenue.
Clearly from the data, focusing marketing efforts on the eastern US through stronger SEO campaigns, including content promotion, would improve traffic to the website. Adding SEM campaigns to not only drive traffic but to promote the top three selling products to stimulate sales would also add to the web traffic/sales mix.
However, the most important promotion aspect this website should focus on is engaging constant contact with their 1x purchasers turning them into multi-buyers. Adding a simple CRM using transactional trigger data and targeted messaging will help with this effort. Simple to implement, email messages would be additive to the website's traffic and sales so long as the offers provided represent a strong value proposition and reason for going back to the website. Discount promotions for repeat purchasers could also be utilized - or at least tested.
The upside for CatskillProvisions.com is truly endless, however the above recommendations represent a small start to reverse traffic decline and enable growth.