Building an Automated Data Pipeline for Retail Trade Survey Data

Brian Kuo

Posted on Apr 21, 2023

Please see my Github repo for details on the code and pipeline set up.

Introduction

The objective of this project is to build an ETL pipeline that could support the analysis of the Monthly Retail Trade Survey (MRTS) data. The data is stored in an Excel file by year in sheets that shows sales data of different business types by month. Since there are additional titles and notes on top of the sheets, table field names are not located there.

To build the process, the data is first imported and cleaned to ensure accuracy before conducting a detailed analysis of specific types of business sales. To achieve this, various techniques, such as trend analysis, percentage change, and rolling time window, are implemented. Visual aids are also utilized to facilitate a deeper understanding of the findings. By implementing these techniques, the project aims to provide a more comprehensive understanding of the MRTS data and its implications for retail businesses.

Data

The Monthly Retail Trade Survey (MRTS) is a questionnaire designed to provide current insights into the sales and stock levels of retail and food service stores. The data collected from MRTS is used by a range of organizations, including the Bureau of Economic Analysis, Bureau of Labor Statistics, media outlets, trade and professional organizations, and private businesses. The statistics derived from MRTS are available on a national level only, and respondents' data is kept confidential. The survey specifically targets businesses that fall under the Retail Trade and Food Services sectors, as defined by the North American Industry Classification System (NAICS). NAICS is the standard system used by federal statistical agencies for business classification purposes.

Extract-Transform-Load

The following are the key steps to perform the ETL process:

From the US Census Bureau site, the latest excel file is downloaded
Excel file is loaded into Python and saved to MySQL Workbench for storage
- Database creation from MySQL Workbench
- The sqlalchemy module is used to write to MySQL Workbench
- Each tab (year) is imported as a table
From MySQL Workbench, the data is load back into python for data cleaning and transformation
- The mysql.connector module is used to read from MySQL Workbench
- Each table is read into a data frame transposed, and concatenated together
- Perform data cleaning on resulting data frame
A single data frame with all of the sales data since the first year of tracking is generated at the end of the process.

Data Exploration

Explore Trends

To break down the business categories a bit further, the following cells looks into the big categories of sales and plots out the sales vs. time line graph:

Picking out one specific retail business type to drill down in the following cells, the clothing business is broken down into subcategories:

Women’s clothing store
Men’s clothing store
Family clothing store
Other clothing store

From the trend line below, we can see the blue line is the overall clothing sell on top. The other findings:

The family clothing store is the primary subcategory that is driving sales, since the other subcategories are relatively steady in the past 30 years of tracking.
Obvious seasonality can be seen annually with a general increasing upward trend in sales overall.
An unusual dip in sales number for all categories occurs about March of 2020 that could be further researched. That was likely caused by the COVID-19 outbreak. However, the general trend is still consistent with yearly patterns that have a peak at the end of the year.

Explore Percentage Change

With respect to percentage changes observed in women's clothing and men's clothing businesses, these were the findings:

Women’s clothing sales have a similar trend and seasonality as men’s clothing sales, but the sale is a lot higher for women.
The general percentage of men’s and women’s clothing out of all clothing sales has been decreasing due to the increase in sales of family clothing.
All clothing sale contributions are less than 1% of overall (all business categories) sales.
Pre-COVID time (before beginning of 2020): men’s clothing sales made up roughly between 5%-20% of all clothing sales, while women were at 20-30% of clothing sales.
Post-COVID: men's clothing sales fell to almost less than 5% of clothing sales, though it has been gradually climbing back since the end of 2021.

Explore Rolling Time Windows

Exploring trends with rolling time window for the clothing and clothing accessory store business category:

The rolling() function is utilized to create a rolling window of 3 months’ average sales. I combined the actual sales data for comparison in a line graph.
The rolling average sales smoothes out the trend when compared to the actual sales data.

The resample() function is utilized to aggregate data to quarterly data. A similar 3 months’ rolling average sales is pulled in for comparison in side to side bar charts
The rolling time window is slightly lagging behind the actual sales data. It also stays within a smaller range of value comparable and so is more resistant to extreme values.

Conclusion

With the automated pipeline constructed for the initial excel sheet, the same analysis could be generated within minutes. The simple visualizations here demonstrates that the ETL pipeline provides an analysis-ready data frame for future analysis. Similar setup could be used in other excel and CSV files for quick automated analysis.

References

Moreh, Jack. "Financial Graph - Capital Markets." stockvault.net. Page Link
"Extracting Data from Excel Files." ehmatthes.github.io. Page Link
"Use openpyxl - Convert to DataFrame in Pandas." soudegesu.com. October 13, 2018. Page Link

About Author

Brian Kuo

A consultant and data engineering professional with 6+ years of experience in business intelligence, warehousing solutions, ETL, and project management. Enjoys collaboration in team projects to support multi-disciplinary stakeholders and generate valuable results that align with business goals.

View all posts by Brian Kuo >

Capstone

Catching Fraud in the Healthcare System

Capstone

The Convenience Factor: How Grocery Stores Impact Property Values

Capstone

Acquisition Due Dilligence Automation for Smaller Firms

Machine Learning

Pandemic Effects on the Ames Housing Market and Lifestyle

Machine Learning

The Ames Data Set: Sales Price Tackled With Diverse Models

No comments found.

Building an Automated Data Pipeline for Retail Trade Survey Data

Introduction

Data

Extract-Transform-Load