Using Machine Learning to Forecast Sales
Forecasting sales is an integral part of running successful businesses. Sales forecasting allows businesses to plan for the future and be prepared to meet demands and maximize profits. By knowing the demand, production and supply can be managed more effectively to drive business. Walmart has used precise forecasts to manage their 11,500 stores and generate $482.13 billion in 2016. Without models to guide their business, they could have been looking at more operating expenses and less revenue.
The Data
This data was from a past Kaggle competition that Walmart set up to recruit data scientists. They were interested in forecasting future sales in individual departments within different stores and particularly interested in their sales on 4 major holidays: Super Bowl, Labor Day, Thanksgiving, and Christmas. These are probably holidays where their sales are the highest and so they want to make sure they have enough supply to meet demand.
The data contained 143 weeks of previous sales of 45 stores and their 99 departments, whether the week was a holiday, temperature, fuel prices, markdown (discounts), consumer price index, unemployment rate, store type, and store size. Using this information, sales of the next 39 weeks would be forecasted and checked for accuracy.
Below is a summary of the data provided.
Exploratory Data Analysis
By having an idea of how our data looks, we will be able to decide how to approach the problem. Visualizing the sales of Department 1 across different stores shows that there are spikes in sales in similar times throughout the year. This would correspond to certain holidays throughout the year and also shows that Department 1 was the same department in different stores. There are 4 spikes in sales throughout the year but they did not look like they corresponded to the 4 major holidays that Walmart was interested in.
The next step was to see whether all the departments had spikes in sales around the same time in the year. By plotting the sales of different departments within Store 1, the difference in departments is made evident. A holiday may affect Department 7 but may not affect Department 3 as much and vice versa. Consequently, past sales of individual departments would be used to forecast future sales and the store's overall performance would not be taken into consideration.
Now that we know that departments across stores are similar and that different departments respond differently to different times of the year, I decided to have a look at general trends in the data. Centered moving averages is performed on the data in order to visualize the general trend in the data. By increasing our value of k, we are averaging more of the observations and this gives us a graph that does hide trends with spikes in data.
Seasonal decomposition was performed to understand the seasonal, trend, and noise components of the sales. There seems to still be underlying cyclical in the graph, as noted by periodic spikes in the noise. These may be problematic so further analysis is required.
Tackling The Problem
Because this is a Kaggle competition, we are able to submit our forecasts for the 39 weeks online and see how it fairs against the forecasts of others. In order to get a gauge of the baseline and where to improve upon, an empty set with projected sales of 0 was submitted. This gave a rank in the 660-680 range since others had also submitted forecasts with 0 sales.
Although this is not the best method to forecast time series data, I wanted to see how the rank would change by using linear models. Using the tslm package in RStudio, the rank jumped up around 450. tslm fits linear models to time series by breaking down trend and seasonality components into variables, which would added together as a linear model.
The next approach was to fit an ARIMA model since it is a popular method to model time series data. This method is popular since it has been proven to be a good way to forecast future information from the past. It checks for stationarity and adds the constant fluctuation in order to make forecasts into the near future.
I became curious and wanted to see if another model fit by a R package could yield better results. This search led me to the stl package and the stlf function. By applying a STL decomposition, stlf models the seasonally adjusted data, reseasonalizes it, and returns the forecasts.
Surprisingly, it yielded better results than ARIMA. Further investigation into the R package and model of stl will be done. I also intend on running other models and combining them to see if the predictions are better. Although the data from features.csv was not used in these 3 models, they would be considered in the future as they may play an impact on sales of stores and departments. This is a work in progress and will be updated.