Why forecast sales?
Humans have the magical ability to plan for future events, for future gain. It’s not quite a uniquely human trait. Because apparently ravens can match a 4yearold.
An abundance of data, and some very nice R packages, make our ability to plan all the more powerful.
A couple of months ago we looked at sales from an historical perspective in Digital Marketplace. Six months later. In this post, we’ll use the sales data to March 31st to model a timeseries forecast for the next two years. The techniques apply to any time series with characteristics of trend, seasonality or longerterm cycles.
Why forecast sales? Business plans require a budget, e.g. for resources, marketing and office space. A good projection of revenue provides the foundation for the budget. And, for an established business, with historical data, timeseries forecasting is one way to deliver a robust projection.
The forecast assumes one continues to do what one’s doing. So, it provides a good startingpoint. Then one might, for example, add assumptions about new products or services.
The power of iteration
Businesses typically deal with many product / service lines. So the ability to iteratively forecast multiple time series is very powerful.
We’ll deal first with each government framework: GCloud (cloud services), and DOS (Digital Outcomes & Specialists). Then we’ll iterate through GCloud’s lot structure: Cloud Hosting, Cloud Software and Cloud Support. Only three child levels, but the principle is easily scaled up.
Cleaning data
GCloud suppliers are contractuallyobliged under the government’s framework to report monthly on their buyer invoicing. So, if some months were missed, then there would be a onetime catchup in a later month. This could result in the odd outlier in the Digital Marketplace sales data. However, as revealed in the more detailed analysis (with code), none was discovered.
Seasonal decomposition
By decomposing the historical data we can tease out the underlying trend and seasonality:

 Trend: GCloud sales have grown over time as more suppliers have added their services to the government frameworks. And more Public Sector organizations have found the benefits of purchasing Cloud services this way. Why? Because it’s a faster, simpler, more transparent and competitive contracting vehicle.
 Seasonality: Suppliers often manage their sales and financials based on a quarterly cycle. There’s a particular emphasis on a strong close to the financial year (often December 31st for commercial enterprises). And government buyers may want to make optimal use of their budgets at the close of their financial year (March 31st). Consequently, we see quarterly seasonality with an extra spike in March, and a secondary peak in December.
Forecasting sales for each framework
Using AutoRegressive Integrated Moving Average (ARIMA) modelling, we can select from close to 100 models to describe the autocorrelations in the data. Then we can use the generated model to forecast future sales.
In the plot below, we project two years ahead with 80% and 95% prediction intervals. This means the darkershaded 80% range should include the future sales value with an 80% probability. Likewise with a 95% probability when adding the wider and lightershaded area.
The DOS framework (for projectrelated services) was launched more recently in June 2016. It exhibits different timeseries characteristics. Hence a different ARIMA model.
Forecasting sales for the component lots
The GCloud framework comprises three lots. There are different ways of forecasting multiple time series. We will do so in one shot, with the best model tailored to each lot. The possible approaches, and code used, are detailed here.
Ravens aren’t yet ready for forecasting with R. But then neither are 4yearolds, are they?
R toolkit
R packages and functions (excluding base) used in this analysis.
Packages  Functions  

purrr  map[6]; map2_df[1]; possibly[1]; set_names[1]; simplify[1]; some[1]; when[1]  
readr  guess_encoding[3]; locale[2]; read_csv[2]; parse_number[1]  
dplyr  mutate[12]; filter[6]; group_by[6]; if_else[4]; summarise[4]; desc[3]; first[3]; select[3]; arrange[1]; as_tibble[1]; between[1]; bind_rows[1]; case_when[1]; collapse[1]; count[1]; data_frame[1]; n[1]; summarize[1]  
tibble  as_tibble[1]; data_frame[1]; enframe[1]  
stringr  str_c[7]; fixed[2]; str_remove[2]; str_count[1]; str_detect[1]; str_extract[1]; str_replace[1]  
rebus  or[4]; alpha[1]; literal[1]; whole_word[1]  
lubridate  month[11]; year[4]; ceiling_date[1]; date[1]; days_in_month[1]; myd[1]; parse_date_time[1]; tz[1]; ymd[1]  
timetk  tk_ts[3]  
sweep  sw_glance[4]; sw_sweep[1]  
tidyr  fill[5]; unnest[3]; nest[1]  
forecast  forecast[14]; auto.arima[8]; autoplot[7]; BoxCox[4]; mstl[2]; ndiffs[2]; nsdiffs[2]; tsclean[2]; BoxCox.lambda[1]; seasonal[1]  
ggplot2  autoplot[7]; xlab[6]; ylab[6]; aes[5]; theme[5]; labs[4]; element_rect[3]; geom_ribbon[2]; unit[2]; alpha[1]; element_line[1]; element_text[1]; facet_wrap[1]; geom_line[1]; geom_path[1]; geom_text[1]; ggplot[1]; margin[1]; scale_x_date[1]  
scales  or[4]; alpha[1]; literal[1]; whole_word[1]  
cowplot  draw_label[1]; plot_grid[1]  
ggthemes  theme_economist[1]  
kableExtra  kable[3]; kable_styling[2]  
knitr  kable[3]; opts_chunk[1] 
View the code here.
Citations
R Development Core Team (2008). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3900051070, URL http://www.Rproject.org.
Hyndman R, Athanasopoulos G, Bergmeir C, Caceres G, Chhay L, O’HaraWild M, Petropoulos F, Razbash S, Wang E, Yasmeen F (2018). forecast: Forecasting functions for time series and linear models. R package version 8.4, http://pkg.robjhyndman.com/forecast.
Hyndman RJ, Khandakar Y (2008). “Automatic time series forecasting: the forecast package for R.” Journal of Statistical Software, 26(3), 1–22. http://www.jstatsoft.org/article/view/v027i03.
Contains public sector information licensed under the Open Government Licence v3.0.
This is the first time I’ve run across thinkr. Always love when I find new great sources for modeling in R!
I wanted to suggest looking into using the case_when() function (from dplyr) in your tidy step in place of the nested if_else() statements to assign your “lot” variable. Just like all other dplyr functions, it allows the code to be more linear.
Thanks for taking the time to post such helpful and informative information!
Randi hi – Thank you for taking the time to make the suggestion. More compact and elegant with the change now applied.