Data Exploration and Analysis of eCommerce Transactions

Posted on May 23, 2021

GitHub RepoLinkedIn

Inspired by the rapid acceleration in the shift from traditional brick-and-mortar to eCommerce shopping brought on by the worldwide Covid-19 pandemic, I wanted to explore the transaction-level data of an eCommerce retailer to pull out useful insights into consumer behavior & what those insights suggest about the business at large.

For this project, I am approaching the analysis as an outsider would to the data; I do not have any background information on the retailer or its customer base, so I am aiming to answer the question of the overall health of the business and possible areas of risk or opportunity. 


The data that I am working with came from Kaggle, and is comprised of 541,909 observations of an eCommerce retailer's transactions from late-2010 through late-2011. Each observation has the following attributes & metrics: Invoice Number, Product Stock Code, Product Description, Product Quantity ordered [aggregated by product, e.g. six units of the same product in one order will show in one row, not six], TimeStamp of Order, Product Unit Price, Customer ID, and Shipping Country. 

Research Questions & Analysis

I had several main questions going into my analysis:

  • What can I infer about the high-level seasonality of the retailer?
  • What do the product lifecycles look like? 
    • How does markdown and promotional pricing affect consumer behavior?

    • What insight can I glean about product times on offer and launches of new product?

  • Is there a ‘drop-off’ at a certain point in a product’s lifetime? If so, can wasteful inventory cost be eliminated by buying to sell through by the start of the drop-off?
  • Is a disproportionate percent of the business coming from the top products?

Data Exploration & Cleaning

The dataset was fairly complete, with little missing data (the retailer likely has 'required fields' on their checkout page). The first action I took with the data was converting the Order Timestamp into more typical time attributes retailers use when looking at business. The first time grouping I used was converting to week numbers, using the National Retail Federation 454 Calendar for years 2010 and 2011. I also aggregated using the standard calendar month & year for bigger picture views of the business. 

To get the revenue associated with each order item, I multiplied the unit price and ordered quantity, and created a Boolean flag representing whether the transaction was a sale (positive sales units) or a return (negative sales units).


First, I wanted to see the Net Sales & Revenue by calendar month, which will be the best way to initially see high-level variations and trends in the business. 

X-axis: Month of Year
Y-axis: Net Sales

Net Sales by Month of Year

Charting the month of the year along the X-axis and Net Revenue (blue line) and Net Sales units (red line) on the Y-axis, I saw that sales are quite concentrated in the back half of the year. This is not unusual for any retailer, but a bit more pronounced here than the standard (Standard is 45% Front Half/55% Back half sp

lit, this retailer's is 37%/63%).

The delta between each month’s Revenue number and the Unit sales is the AUR or average unit retail price by month. This delta is also highest in November. 

The peak sales month is November, with a dip going into December and a significant debuild from December into January. A dip in sales from December into January is expected, but the debuild seen here is particularly pronounced. Going solely off of the product descriptions, the assortment does seem to be focused around gifting and party-related products, like decorations, so that likely explains the spike around the holidays. 

Can we dig into the debuild from December to January and learn anything? Is timing of when returns are actualizing a factor?

X-axis: Month of Year
Y-axis: Sales/Return Values

Gross Sales & Returns

Next, I wanted to look at the gross sales and returns separately to get a sense of how long after sales occur do return transactions get processed into the system--returns actually spike in December (from November sales) and the January returns are certainly higher than the yearly average, but have a more pronounced effect on net sales due to the drop in gross sales.

One surprise coming from this was that I had not expected to see demand drop from January to February. This can likely be attributed to New Years’ promotions and sales--you can see how close the Revenue and Unit numbers are to one another in January when comparing to later months, indicating a low average unit price, likely due to promotions.

Another item of note--gross sales in December barely debuilt from November, but the huge spike in returns from November sales hurt December net sales after returns actualize.

X-axis: Month of Year
Y-axis: Return Rate (% To Gross Sales)

Return Rate (as a percent to total sales)

When looking at the Return Rate (as a percent to total gross sales),  it is easy to see how significant an impact the returns from sales months November & December have on the months that come after them. 

Spikes in the return rate are occurring in months that typically have promotions--November & December holiday sales and Semi-Annual/End of Season Sales in June. 

X-axis: Month of Year
Y-axis: Sales Value

Regular Price and Markdown Product Performance

Next I wanted to get a better sense of the complexion of the products that made up the sales--namely, how much was being driven from regular versus markdown or promotional pricing. 

In order to arrive at the final result of a transaction-level product markdown flag, I isolated the Stock Code and Unit Price, removed duplicates and then sorted by stock code and then unit price, highest to lowest. If a certain stock code only had one unit price associated with it, that price was assumed to be the Full/Regular Price.

For stock codes with multiple unit prices across separate transactions, I treated the highest unit price as the Full/Regular Price, and each of the lower prices as Markdown or Promotional prices. From this I also noted if a product was ever marked down or promo’d in the year or not, for use in a later analysis.

Due to not actually having the product’s original retail price and needing to back into the Regular Price/Markdown flag, I am likely overestimating the Regular Price transactions--despite this, my calculated Regular Price sales are only 6% on the year. This likely comes from the retailer training their customer to not buy product at full price.

If customers know that they can eventually get a product at a low price from the retailer, or if the retailer tends to always have a promotion running, there is little incentive to buy at full price. This is a dangerous cycle for the long-term health and sustainability of a retailer, and has significant implications for gross margins.

X-axis: Week of Year
Y-axis: Sales Revenue (in Great British Pounds)

Top 25 Products Weekly Performance 

Next I looked at weekly performance for the Top 25 products (to determine the Top 25, I aggregated total year sales). Cyclical peaks and valleys emerged for most of the products month to month, indicating a monthly traffic driver for customers to the site of some sort--possibly a catalog, or email blast, with the highest volume spike in holiday months. There are a few non-product items, like shipping, included in the top revenue-driving line items, so to get a better sense of product performance, I would like to have a list of product-only line items from the retailer's assortment.

I saw two interesting takeaways from this view: First the fairly clear monthly cycle for most of the products in the top 25. Something is causing a large increase in traffic and sales to the site every month, causing a huge increase in sales during one week of the month, and then very quickly sales decrease to the normal weekly pattern. This same cycle does not hold for the last two months of the year, however, where the spike in sales lasts for longer than one week.

My second surprising takeaway from this graph was the fairly ‘evergreen’ nature of the assortment--this could be because I am only looking at the top 25 products on the year level, but I expected there to be seasonal products that had one or two big months of sales and then drop off completely, but that seems to not be the case. The assortment seems to be fairly consistent throughout the year, and drive fairly consistent sales. 

For reference, the Top 25 represents 16% of total revenue on the year, and 8% of sales units (.6% of Stock Codes). 

X-axis: Net Sales Revenue (both positive and negative)
Y-axis: Net Sales Units

Scatter Plot for all Products

Charting Net Revenue on the positive and negative X-axis and Sales Units on the Y-axis shows a cluster of products grouped in the first quadrant, where products seem to be grouping around an area that represents unit sales being higher than the net revenue, which indicates the business relies on unit drivers & a high unit count per transaction to drive the business.

Outliers on this graph are often non-inventory items & fees--the plot point all the way to the left on the X-axis is an Amazon Fee, and the plot point on the X-axis in the first quadrant is Postage Charge.

I’d love to get more data for these products to do a proper SKU rationalization, grouping into Low and High Volume Sale unit products, and Low and High volume margin products to get a sense of what sort of products and categories can be culled from the assortment.

X-axis: Week of Product's Life (1=Launch Week)
Y-axis: Avg. Sales per Week

Product Lifecycle

To find the performance of a product over the course of its product lifecycle, I found the first week that a stock code showed up in transactions--I ignored the first 8 weeks of transactions, assuming that I couldn’t count anything as being launched in the first 8 weeks of transactions since there was no way to differentiate between a product being launched and product that had been previously launched but I simply didn’t have the transactions yet. January product launches are also pretty small when comparing to the rest of the year, so I figured not much was lost.

Charting the lifecycle of all products, I expected there to be a much shorter average timeline for products. 8 to 12 weeks is fairly standard, followed by a steep decline where remaining inventory was marked down and slowly sold off after its peak seasonality. This retailer’s assortment is proving to be very steady--there is a spike after the product introduction, and sales stay steady until around week 31. 

The drop off that is seen in week 31 could very well be seasonality driven, and I believe that to be the case--I would like to have more data to either confirm or reject that hypothesis. 

Results and Conclusion

  • Finishing this analysis, I have a much better understanding of the eCommerce retailer’s business:
    • Sales are heavily concentrated in the back half of the year, with demand peaking in November, and high returns in both December and January
    • Customers are quite price-resistant, with only 6% of yearly sales coming from regular price product
    • Product lifecycle is quite consistent from launch, and based on the transaction data that I analyzed, sales do not drop off until around week 31 after the product launch--I would like to dig into this further with transaction data across multiple years to confirm, as the drop off could be seasonality-driven
    • Comparing all products sales performances, there doesn’t seem to be a trend as far as what products could be eliminated without major effect on the business, but a deeper analysis on that subject would require Inventory, Margin, and Buy data, as well as Categorical information for all products
    • Since the lifecycle is so consistent and flat week to week, inventory liability should not be much of an issue for the retailer, as most products tend to drive consistent sales regardless of the month of the year

Thank you for reading! Hopefully this post has been informative--feel free to reach out to me on LinkedIn!

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

About Author

Alexander Pinkerton

Extensive background leading merchandise planning, allocations, and buying functions, excited about the intersection of Data Science and consumer retail to drive the future of smarter decision making. NYC Data Science Academy Student Cohort 025 WashU '12 Bachelors of...
View all posts by Alexander Pinkerton >

Related Articles

Leave a Comment

Data Science Training in Hyderabad May 26, 2021
It is truly a well-researched content and excellent wording. I got so engaged in this material that I couldn't wait reading. I am impressed with your work and skill. Thanks. Data Science Training in Hyderabad Data Science Course in Hyderabad

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI