Extracting Revenue and Marketing Insights Using Customer Segmentation

Posted on Aug 29, 2022


Stores can target campaigns at its customers in order to generate more revenue or maintain customer loyalty, for example.  However, which customers should they target with various campaigns? Knowing about star or disengaging customers can also help the store know where to put its efforts, both in terms of campaigns as well as other aspects, such as types of goods, the quality of customers' in-store or online shopping experience, etc.  The purpose of this study is to examine the store's revenue and campaign effectiveness data in order to uncover such insights.  The Kaggle data set for store's customers and purchases spans 796 days (2012-2014) and consists of 2240 individual observations on

  • Customer characteristics (year of birth, education, income, number of children, etc.)
  • Products (aggregated amounts spent of wines, fruit, sweets, etc.)
  • Products (aggregated amounts spent of wines, fruit, sweets, etc.)
  • Promotion acceptance for 5 campaigns
  • Number of purchases done on the web, using catalog, or in store, as well as the total number of web visits

It contains about 16 missing income observations and a few extreme outliers in income and age.   Using this data, I run a clustering analysis to learn more about the store's customers and offer suggestions targeted at each of the clusters.  In the end, I identify two particularly important clusters, Cluster 1 and Cluster 5.  Cluster 5 are the recent high-income customers who value the in-store experience and a polished catalog. They are not interested in deals, but will respond to a well-executed campaign. When normalized by length spent as customers, this cluster brings the most revenue to the company.
Cluster 1 are the high-income loyalty customers who are responsible for most historic revenue brought to the company. Like Cluster 5, they value in-store experience, a polished catalog, are not responsive to deals, but will respond to a well-structured campaign.

Preliminary Visualizations

The following graph provides a glimpse at the store's main revenue drivers, namely wine and meat products.

From the graph below, it is clear that the greatest number of purchases occurs in-store, followed by web purchases. There are also many web visits that are not necessarily tied to a purchase.

While these visualizations provide some insights, a much richer analysis results from clustering analysis discussed in the following sections.

Machine Learning

Since I had no additional background regarding the store and its customers, I did not have prior knowledge of an appropriate number for store's customers. Machine learning can help in this regard, however, as clusters can be discovered based on geometric closeness in the feature space. Since categorical features generally do not have a clear associated notion of distance, I only used numeric features for clustering.  However, once cluster labels are added to the data, it is possible to look at the way numeric features  vary among clusters.  After performing missing value imputation, outlier removal, and feature scaling, I used the elbow method to choose 6 as the optimal number of clusters to use.

One could also transform the long-tailed features with a log-transform before doing feature scaling.  This yields results that are qualitatively similar to the ones in this analysis.  In addition to KMeans, I've tried DBSCAN, which did not yield useful results for this data.  DBSCAN is based on the notion of cluster density, hence it's strong at separating high density clusters from low density clusters.  It struggles with high dimensional data, however, so perhaps reducing the data dimensions to the most important features and retrying DBSCAN could yield better results. As future work, it could also be useful to try KModes clustering to see if additional insights can be generated once categorical features are taken into account.

Business Insights

Initial Observations and Revenue Analysis

From the graphs below, it is evident that while clusters 2 - 4 are the biggest clusters by customer size, clusters 1, 5, and 0 bring in the most revenue.

These higher-spending clusters are also the highest income earners, as the following graph indicates.

Here is another look at cluster incomes:


A crucial pattern emerges:  If we look at total amounts spent by cluster, it appears that Cluster 1 is the most important, followed by Clusters 5 and 0.

However, if we normalize by time spent as customer, Cluster 5 spends the most, followed by Clusters 1 and 0.


I've explored the possible dimensions of Cluster 1 and 5 differences (age, education, marital status, etc), finding that the key distinguishing characteristic of Cluster 5 is length spent as customers: Cluster 5 customers are more recent.  Please see the graph below to observe this difference.

A key objective for the store is to keep  Cluster 5 engaged and re-engage Cluster 1.  Cluster 0 is similar to Clusters 1 and 5, but has somewhat lower income and more children.  I'll suggest a strategy to target it in the next section.  I'm not providing the graphs here for conciseness, but Clusters 2-4 have lower income, are more likely to have children, and spend less. In what follows, I'll look at campaign effectiveness and other means of engaging the customers.


Campaign Effectiveness and Other Observations

From the graph below, it is evident that campaigns were only moderately successful: In every cluster, the majority of the customers did not accept any campaign.

Campaigns 1 and 5 were more successful with Clusters 1 and 5, indicating that the store perhaps tried to replicate parts of the first campaign with the last one.




Campaign 4 did better with Cluster 0, while other campaigns (not shown here) had a more uniform low acceptance rate.



It would be desirable to obtain more detail on Campaigns 1,4, and 5 in order to determine how to run improved versions of the three campaigns discussed above.

In terms of where customers make their purchases, Cluster 5 favor the store, followed by catalog and web.

Cluster 0, followed by Clusters 2-4, are most responsive to deals, which the store should target at these customers. For example, if store wishes to retire a product or create more inventory space, targeting deals at Cluster 0 would be a good strategy.

Conclusions and Future Work

The store should concentrate on running well-structured campaigns to even better engage Cluster 1 and monitor the engagement of Cluster 5. A pleasant in-store shopping experience and a polished catalog are critical, but the store should study this demographic in detail to earn even more of their business.
Cluster 0 are have higher than average income but more children (hence probably less dispensable income), are responsive to deals and an occasional campaign. It's best to target deals at these customers.
Clusters 2-4 have lower than average income, accept deals at a higher rate, and visit the company's website. The store can target deals and website promotions at this demographic.
In the future, the store can use the suggestions above to perform A/B testing to target different deals to different customer clusters with a focus on profitability. Furthermore, it is crucial to keep Cluster 5 customers engaged (by targeted campaigns, positive in-store experience, and other solutions that are yet to be observed) and to boost future engagement of Cluster 1 customers by similar or even better targeted measures.

For future work, it would be helpful to have more information about the data set and the store in order to answer the following questions:

  • What do we know about the rationale behind each campaign? What distinguishes the campaigns?
  • What more can we learn about our customers? Specifically, are there factors that differentiate Cluster 5 that are not in the data?
  • What else can be learned about the way the store is making customers in-store and online shopping experience pleasant? Are there way to improve?
  • Can the store learn to do profitable business with Cluster 2-4?
  • Do A/B testing to judge the effectiveness of recommendations
  • Get more granular data on store purchases
  • Get more data on profitability rather than just revenue, as profitability is the store's key objective

About Author

Dmitriy Popov-Velasco

I'm a recent NYC DSA/fastai graduate with background in math, economics, and education, holding graduate degrees in these areas. I'm passionate about helping others and solving practical problems!
View all posts by Dmitriy Popov-Velasco >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI