Market Basket Analysis - Instacart Dataset

Posted on Oct 26, 2020

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items.  How companies like Instacart boost their sales by predicting products that their customers may purchase next. Instacart, a grocery ordering and delivery app, allows users to place grocery orders through their website or app which are then fulfilled  by a personal shopper .In 2017 ,Instacart open-sourced 3 million grocery orders. This anonymized dataset contains a sample of over 3 million grocery  orders from more than 200,00 Instacart users.  Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. The users are anonymized. There’s no demographics data — no gender, age. There is a field for the week and hour of day the order was placed, and a relative measure of time between orders. The dataset is a relational set of files describing customers’ orders over time. The goal of this analysis is to examine variables  of customer buying patterns before making the inferential analysis. To maintain the speed and efficiency of executing the code, I’m using only 10% of the data.

First, let’s understand the data

The dataset has a set of relational files. There are six data tables at total in the format of CSV which are;

"aisles, departments, order_products_prior, order_products_train, orders and products"

The datatypes for my exploratory analysis are numerical and string data types. They are as expected and do not require us to change to different data types. There are over 5 % missing values in this dataset on the day since the last order. However, as it is explained in the description of the variables, NA represents the order_number 1 of that particular customer. The first five rows of each of the csv files I’m using for my analysis are as follows:

Exploratory Data Analysis 

To understand the buying patterns of Instacart’s customers, exploring each variable is a crucial component of our analysis as they serve the purpose of getting an overall view of the data. 

What day people place most orders ?

There are significantly more orders on days 0 and 1. The dataset does not clearly mention that day 0 = Saturday or Sunday. There is no information regarding which values represent which day of the week. However, we can assume that this is the weekend as customers mostly make their weekly grocery shopping on the first and second day of the week. There isn't a huge gap between the other days of the week either. 

What hour people place most orders? 

The volume of orders increases between early morning till 4 pm. This insight can help the company have more shoppers available during this time period. Additionally, this will further help to make sure the website and the app does not have usability issues. 

What part of the day people order most?

There is a common pattern across all the days for each part of the day. The distribution is mostly similar. 

How many products people usually order?

From the right skewed distribution, we can observe that people usually order around 5-8 products. What could be the reason that customers order so low amounts of orders? Instacart can look into ways various ways to increase the amount of orders by fulfilling most of  the grocery needs of their customers. 

 Reordered or not & repeat items?

Looking at the above plot, we can observe that around 60% are reordered and 40 % are first time orders. The table shows that dairy products make the top 10 repeat products from the re-orders. This insight is important for Instacart to understand what actions they can take to increase the percentage of reorders. This can be further leveraged to predict what will be the next product in the customer’s cart. 

What are the best selling products - Top 10 

It is interesting to note that Instacart’s top selling products are fresh fruits and vegetables. Products from other aisles and departments did not make it to the top 10 and customers preferred more organic produce. There are almost 8000 products that are ordered once only. What could be the plausible reason behind such low count. Are they highly marked-up then the in-store prices?

 What are the popular department and aisle name?

From the above plot, we can observe that certain departments are clearly more popular. Produce department contributed to much higher sales than the rest of the aisles. 


Exploring each variable in the dataset for the descriptive analysis has laid the foundation for the in-depth analysis to understand customers’  purchasing patterns. This process is crucial for the business understanding.  In our instacart analysis, we can summarize our insights and further actions that could be recommended for better customer engagement and profitability. Since the number of products per order mostly stays in the range of 4-8, there is a huge room of improvement. To encourage customers to add more to their carts, Instacart can recommend related products that are already in their baskets or from their past purchase/order items. Either Instacart customers purchase weekly or monthly. How could they improve customer loyalty? Ensuring the website is intuitive and easy to use is imperative to making sure customers complete their transaction and return to shop again. Extracting these insights and knowing which items are most frequently purchased is the first step for Instacart to optimize its software product and recommend items for customers while they shop.

About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp