Using Machine Learning to Forecast Sales

Denis Nguyen
Posted on Jun 30, 2016

Forecasting sales is an integral part of running successful businesses. Sales forecasting allows businesses to plan for the future and be prepared to meet demands and maximize profits. By knowing the demand, production and supply can be managed more effectively to drive business. Walmart has used precise forecasts to manage their 11,500 stores and generate $482.13 billion in 2016. Without models to guide their business, they could have been looking at more operating expenses and less revenue.

Subscribe to data science projects
Get data science inspiration and course promotion

The Data

This data was from a past Kaggle competition that Walmart set up to recruit data scientists. They were interested in forecasting future sales in individual departments within different stores and particularly interested in their sales on 4 major holidays: Super Bowl, Labor Day, Thanksgiving, and Christmas. These are probably holidays where their sales are the highest and so they want to make sure they have enough supply to meet demand.

The data contained 143 weeks of previous sales of 45 stores and their 99 departments, whether the week was a holiday, temperature, fuel prices, markdown (discounts), consumer price index, unemployment rate, store type, and store size. Using this information, sales of the next 39 weeks would be forecasted and checked for accuracy.

Below is a summary of the data provided.

Screen Shot 2016-06-27 at 04.07.36

Exploratory Data Analysis

By having an idea of how our data looks, we will be able to decide how to approach the problem. Visualizing the sales of Department 1 across different stores shows that there are spikes in sales in similar times throughout the year. This would correspond to certain holidays throughout the year and also shows that Department 1 was the same department in different stores. There are 4 spikes in sales throughout the year but they did not look like they corresponded to the 4 major holidays that Walmart was interested in.

Stores

The next step was to see whether all the departments had spikes in sales around the same time in the year. By plotting the sales of different departments within Store 1, the difference in departments is made evident. A holiday may affect Department 7 but may not affect Department 3 as much and vice versa. Consequently, past sales of individual departments would be used to forecast future sales and the store's overall performance would not be taken into consideration.

Departments

Now that we know that departments across stores are similar and that different departments respond differently to different times of the year, I decided to have a look at general trends in the data. Centered moving averages is performed on the data in order to visualize the general trend in the data. By increasing our value of k, we are averaging more of the observations and this gives us a graph that does hide trends with spikes in data.

MovingAverages

Seasonal decomposition was performed to understand the seasonal, trend, and noise components of the sales. There seems to still be underlying cyclical in the graph, as noted by periodic spikes in the noise. These may be problematic so further analysis is required.

SeasonalDecomposition

Tackling The Problem

Because this is a Kaggle competition, we are able to submit our forecasts for the 39 weeks online and see how it fairs against the forecasts of others. In order to get a gauge of the baseline and where to improve upon, an empty set with projected sales of 0 was submitted. This gave a rank in the 660-680 range since others had also submitted forecasts with 0 sales.

0 - sample

Although this is not the best method to forecast time series data, I wanted to see how the rank would change by using linear models. Using the tslm package in RStudio, the rank jumped up around 450. tslm fits linear models to time series by breaking down trend and seasonality components into variables, which would added together as a linear model.

2 - tslm.basic

The next approach was to fit an ARIMA model since it is a popular method to model time series data. This method is popular since it has been proven to be a good way to forecast future information from the past. It checks for stationarity and adds the constant fluctuation in order to make forecasts into the near future.

3 - seasonal.arima.svd

I became curious and wanted to see if another model fit by a R package could yield better results. This search led me to the stl package and the stlf function. By applying a STL decomposition, stlf models the seasonally adjusted data, reseasonalizes it, and returns the forecasts.

1 - stlf.svd

Surprisingly, it yielded better results than ARIMA. Further investigation into the R package and model of stl will be done. I also intend on running other models and combining them to see if the predictions are better. Although the data from features.csv was not used in these 3 models, they would be considered in the future as they may play an impact on sales of stores and departments. This is a work in progress and will be updated.

About Author

Denis Nguyen

Denis Nguyen

With a background in biomedical engineering and health sciences, Denis has a passion for finding patterns and optimizing processes. He developed his interest for data analysis while doing research on the effects of childhood obesity on bone development...
View all posts by Denis Nguyen >

Leave a Comment

Your email address will not be published. Required fields are marked *

Pallabh Bhura June 17, 2017
Hello Sir, I am a beginner and was working on the same dataset. It would have been really helpful if I could get some more insights about your work. Let me know if I could engage with you to discuss this project over email or any other medium you deem fit. Sincerely looking forward to your response. Thank you.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags