Using Machine Learning to Forecast Sales

Posted on Jun 30, 2016

Forecasting sales is an integral part of running successful businesses. Sales forecasting allows businesses to plan for the future and be prepared to meet demands and maximize profits. By knowing the demand, production and supply can be managed more effectively to drive business. Walmart has used precise forecasts to manage their 11,500 stores and generate $482.13 billion in 2016. Without models to guide their business, they could have been looking at more operating expenses and less revenue.

Subscribe to data science projects
Get data science inspiration and course promotion

The Data

This data was from a past Kaggle competition that Walmart set up to recruit data scientists. They were interested in forecasting future sales in individual departments within different stores and particularly interested in their sales on 4 major holidays: Super Bowl, Labor Day, Thanksgiving, and Christmas. These are probably holidays where their sales are the highest and so they want to make sure they have enough supply to meet demand.

The data contained 143 weeks of previous sales of 45 stores and their 99 departments, whether the week was a holiday, temperature, fuel prices, markdown (discounts), consumer price index, unemployment rate, store type, and store size. Using this information, sales of the next 39 weeks would be forecasted and checked for accuracy.

Below is a summary of the data provided.

Screen Shot 2016-06-27 at 04.07.36

Exploratory Data Analysis

By having an idea of how our data looks, we will be able to decide how to approach the problem. Visualizing the sales of Department 1 across different stores shows that there are spikes in sales in similar times throughout the year. This would correspond to certain holidays throughout the year and also shows that Department 1 was the same department in different stores. There are 4 spikes in sales throughout the year but they did not look like they corresponded to the 4 major holidays that Walmart was interested in.


The next step was to see whether all the departments had spikes in sales around the same time in the year. By plotting the sales of different departments within Store 1, the difference in departments is made evident. A holiday may affect Department 7 but may not affect Department 3 as much and vice versa. Consequently, past sales of individual departments would be used to forecast future sales and the store's overall performance would not be taken into consideration.


Now that we know that departments across stores are similar and that different departments respond differently to different times of the year, I decided to have a look at general trends in the data. Centered moving averages is performed on the data in order to visualize the general trend in the data. By increasing our value of k, we are averaging more of the observations and this gives us a graph that does hide trends with spikes in data.


Seasonal decomposition was performed to understand the seasonal, trend, and noise components of the sales. There seems to still be underlying cyclical in the graph, as noted by periodic spikes in the noise. These may be problematic so further analysis is required.


Tackling The Problem

Because this is a Kaggle competition, we are able to submit our forecasts for the 39 weeks online and see how it fairs against the forecasts of others. In order to get a gauge of the baseline and where to improve upon, an empty set with projected sales of 0 was submitted. This gave a rank in the 660-680 range since others had also submitted forecasts with 0 sales.

0 - sample

Although this is not the best method to forecast time series data, I wanted to see how the rank would change by using linear models. Using the tslm package in RStudio, the rank jumped up around 450. tslm fits linear models to time series by breaking down trend and seasonality components into variables, which would added together as a linear model.

2 - tslm.basic

The next approach was to fit an ARIMA model since it is a popular method to model time series data. This method is popular since it has been proven to be a good way to forecast future information from the past. It checks for stationarity and adds the constant fluctuation in order to make forecasts into the near future.

3 - seasonal.arima.svd

I became curious and wanted to see if another model fit by a R package could yield better results. This search led me to the stl package and the stlf function. By applying a STL decomposition, stlf models the seasonally adjusted data, reseasonalizes it, and returns the forecasts.

1 - stlf.svd

Surprisingly, it yielded better results than ARIMA. Further investigation into the R package and model of stl will be done. I also intend on running other models and combining them to see if the predictions are better. Although the data from features.csv was not used in these 3 models, they would be considered in the future as they may play an impact on sales of stores and departments. This is a work in progress and will be updated.

About Author

Denis Nguyen

With a background in biomedical engineering and health sciences, Denis has a passion for finding patterns and optimizing processes. He developed his interest for data analysis while doing research on the effects of childhood obesity on bone development...
View all posts by Denis Nguyen >

Leave a Comment

Pallabh Bhura June 17, 2017
Hello Sir, I am a beginner and was working on the same dataset. It would have been really helpful if I could get some more insights about your work. Let me know if I could engage with you to discuss this project over email or any other medium you deem fit. Sincerely looking forward to your response. Thank you.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI