Building an Automated Data Pipeline for Retail Trade Survey Data

Posted on Apr 21, 2023

Please see my Github repo for details on the code and pipeline set up.

1. Introduction

The objective of this project is to build an ETL pipeline that could support the analysis of the Monthly Retail Trade Survey (MRTS) data. The data is stored in an Excel file by year in sheets that shows sales data on different business types by month. Since there are additional titles and notes on top of the sheets, table field names are not at the top of the sheet.

To build the process, the data is first imported and cleaned to ensure accuracy before conducting a detailed analysis of specific types of business sales. To achieve this, various techniques such as trend analysis, percentage change, and rolling time window are implemented. Visual aids are also utilized to facilitate a deeper understanding of the findings. By implementing these techniques, the project aims to provide a more comprehensive understanding of the MRTS data and its implications for retail businesses.

 

2. Data

The Monthly Retail Trade Survey (MRTS) is a questionnaire designed to provide current insights into the sales and stock levels of retail and food service stores. The data collected from MRTS is used by a range of organizations, including the Bureau of Economic Analysis, Bureau of Labor Statistics, media outlets, trade and professional organizations, and private businesses. The survey specifically targets businesses that fall under the Retail Trade and Food Services sectors, as defined by the North American Industry Classification System (NAICS). NAICS is the standard system used by federal statistical agencies for business classification purposes. The statistics derived from MRTS are available on a national level only, and respondents' data is kept confidential. Additionally, any gift certificates or cards are counted towards retail sales at the time they are redeemed.

 

3. Extract-Transform-Load

The following are the key steps to perform the ETL process:

  1. From the US Census Bureau site, the latest excel file is downloaded
  2. Excel file is loaded into Python and saved to MySQL Workbench for storage
    • Database creation from MySQL Workbench
    • The sqlalchemy module is used to write to MySQL Workbench
    • Each tab (year) is imported as a table
  3. From MySQL Workbench, the data is load back into python for data cleaning and transformation
    • The mysql.connector module is used to read from MySQL Workbench
    • Each table is read into a data frame transposed, and concatenated together
    • Perform data cleaning on resulting data frame
  4. A single data frame with all of the sales data since the first year of tracking is generated at the end of the process.

4. Data Exploration

Explore Trends

To break down the business categories a bit further, the following cells looks into the big categories of sales and plots out the sales vs. time line graph:

Picking out one specific retail business type to drill down in the following cells, the clothing business is broken down into subcategories:

  • Women’s clothing store
  • Men’s clothing store
  • Family clothing store
  • Other clothing store

From the trend line below, we can see the blue line is the overall clothing sell on top. The other findings:

  • The family clothing store is the primary subcategory that is driving sales, since the other subcategories are relatively steady in the past 30 years of tracking.
  • Obvious seasonality can be seen annually with a general increasing upward trend in sales overall.
  • There is an unusual dip in sales number for all categories in about March of 2020 that could be further researched, but it is potentially caused by the happening of COVID-19. However, the general trend is still the same with the peak appearing at the end of each year.

 

Explore Percentage Change

When considering the women's clothing and men's clothing businesses and their percentage change. The following are the findings:

  • Women’s clothing sales have a similar trend and seasonality as men’s clothing sales, but the sale is a lot higher for women.
  • The general percentage of men’s and women’s clothing out of all clothing sales has been decreasing due to the increase in sales of family clothing.
  • All clothing sale contributions are less than 1% of overall (all business categories) sales.
  • Pre-COVID time (before beginning of 2020): men’s clothing sales made up roughly between 5%-20% of all clothing sales, while women were at 20-30% of clothing sales.
  • Post-COVID: men's clothing sales fell to almost less than 5% of clothing sales, though it has been gradually climbing back since the end of 2021.

 

Explore Rolling Time Windows

Exploring trends with rolling time window for the clothing and clothing accessory store business category:

  • The rolling() function is utilized to create a rolling window of 3 months’ average sales. I combined the actual sales data for comparison in a line graph.
  • The rolling average sales smoothes out the trend when compared to the actual sales data.

 

  • The resample() function is utilized to aggregate data to quarterly data. A similar 3 months’ rolling average sales is pulled in for comparison in side to side bar charts
  • The rolling time window is slightly lagging behind the actual sales data. It also stays within a smaller range of value comparable and so is more resistant to extreme values.

5. Conclusion

With the automated pipeline constructed for the initial excel sheet, the same analysis could be generated within minutes. The simple visualizations here demonstrates that the ETL pipeline provides an analysis-ready data frame for future analysis. Similar setup could be used in other excel and CSV files for quick automated analysis.

6. References

  • Moreh, Jack. "Financial Graph - Capital Markets." stockvault.net. Page Link
  • "Extracting Data from Excel Files." ehmatthes.github.io. Page Link
  • "Use openpyxl - Convert to DataFrame in Pandas." soudegesu.com. October 13, 2018. Page Link

About Author

Brian Kuo

A consultant and data engineering professional with 6+ years of experience in business intelligence, warehousing solutions, ETL, and project management. Enjoys collaboration in team projects to support multi-disciplinary stakeholders and generate valuable results that align with business goals.
View all posts by Brian Kuo >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI