Building an Automated Data Pipeline for Retail Trade Survey Data
Please see my Github repo for details on the code and pipeline set up.
Introduction
The objective of this project is to build an ETL pipeline that could support the analysis of the Monthly Retail Trade Survey (MRTS) data. The data is stored in an Excel file by year in sheets that shows sales data of different business types by month. Since there are additional titles and notes on top of the sheets, table field names are not located there.
To build the process, the data is first imported and cleaned to ensure accuracy before conducting a detailed analysis of specific types of business sales. To achieve this, various techniques, such as trend analysis, percentage change, and rolling time window, are implemented. Visual aids are also utilized to facilitate a deeper understanding of the findings. By implementing these techniques, the project aims to provide a more comprehensive understanding of the MRTS data and its implications for retail businesses.
Data
The Monthly Retail Trade Survey (MRTS) is a questionnaire designed to provide current insights into the sales and stock levels of retail and food service stores. The data collected from MRTS is used by a range of organizations, including the Bureau of Economic Analysis, Bureau of Labor Statistics, media outlets, trade and professional organizations, and private businesses. The statistics derived from MRTS are available on a national level only, and respondents' data is kept confidential. The survey specifically targets businesses that fall under the Retail Trade and Food Services sectors, as defined by the North American Industry Classification System (NAICS). NAICS is the standard system used by federal statistical agencies for business classification purposes.
Extract-Transform-Load
The following are the key steps to perform the ETL process:
- From the US Census Bureau site, the latest excel file is downloaded
- Excel file is loaded into Python and saved to MySQL Workbench for storage
- Database creation from MySQL Workbench
- The sqlalchemy module is used to write to MySQL Workbench
- Each tab (year) is imported as a table
- From MySQL Workbench, the data is load back into python for data cleaning and transformation
- The mysql.connector module is used to read from MySQL Workbench
- Each table is read into a data frame transposed, and concatenated together
- Perform data cleaning on resulting data frame
- A single data frame with all of the sales data since the first year of tracking is generated at the end of the process.
Data Exploration
Explore Trends
To break down the business categories a bit further, the following cells looks into the big categories of sales and plots out the sales vs. time line graph:
Picking out one specific retail business type to drill down in the following cells, the clothing business is broken down into subcategories:
- Womenโs clothing store
- Menโs clothing store
- Family clothing store
- Other clothing store
From the trend line below, we can see the blue line is the overall clothing sell on top. The other findings:
- The family clothing store is the primary subcategory that is driving sales, since the other subcategories are relatively steady in the past 30 years of tracking.
- Obvious seasonality can be seen annually with a general increasing upward trend in sales overall.
- An unusual dip in sales number for all categories occurs about March of 2020 that could be further researched. That was likely caused by the COVID-19 outbreak. However, the general trend is still consistent with yearly patterns that have a peak at the end of the year.
Explore Percentage Change
With respect to percentage changes observed in women's clothing and men's clothing businesses, these were the findings:
- Womenโs clothing sales have a similar trend and seasonality as menโs clothing sales, but the sale is a lot higher for women.
- The general percentage of menโs and womenโs clothing out of all clothing sales has been decreasing due to the increase in sales of family clothing.
- All clothing sale contributions are less than 1% of overall (all business categories) sales.
- Pre-COVID time (before beginning of 2020): menโs clothing sales made up roughly between 5%-20% of all clothing sales, while women were at 20-30% of clothing sales.
- Post-COVID: men's clothing sales fell to almost less than 5% of clothing sales, though it has been gradually climbing back since the end of 2021.
Explore Rolling Time Windows
Exploring trends with rolling time window for the clothing and clothing accessory store business category:
- The rolling() function is utilized to create a rolling window of 3 monthsโ average sales. I combined the actual sales data for comparison in a line graph.
- The rolling average sales smoothes out the trend when compared to the actual sales data.
- The resample() function is utilized to aggregate data to quarterly data. A similar 3 monthsโ rolling average sales is pulled in for comparison in side to side bar charts
- The rolling time window is slightly lagging behind the actual sales data. It also stays within a smaller range of value comparable and so is more resistant to extreme values.
Conclusion
With the automated pipeline constructed for the initial excel sheet, the same analysis could be generated within minutes. The simple visualizations here demonstrates that the ETL pipeline provides an analysis-ready data frame for future analysis. Similar setup could be used in other excel and CSV files for quick automated analysis.
References
- Moreh, Jack. "Financial Graph - Capital Markets." stockvault.net. Page Link
- "Extracting Data from Excel Files." ehmatthes.github.io. Page Link
- "Use openpyxl - Convert to DataFrame in Pandas." soudegesu.com. October 13, 2018. Page Link