Forecasting Revenue with fbprophet
In this post, I will try to forecast a company's revenue two years into the future with fbprophet and then compare the forecasted revenue with the actual data collected over those two years.
In this post, we will explore the basic functionality of the fbprophet package and how it can help us to quickly forecast data that is seasonal with an underlying trend. Specifically, we will try to forecast revenue data about two years into the "future" and then we will compare de facto data during these two years. Everything will be based upon revenues from 2012 to October 2017, so we have plenty of data to work with.
Later, we will also explore some of the more advanced features of fbprophet like adding our own regressors (e.g. the weather or marketing spend), logistic growth and differentiating between linear (i.e. additive) and multiplicative seasonal influence.
Let us start by important the basic necessities for our experiment.
import pandas as pd
from fbprophet import Prophet
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option("display.precision", 3)
We will choose a testing_cutoff_data before which we will take the data into consideration for training our model and after which we will use the data for evaluating the accuracy of our model.
# only use training data before this year, then compare with data from this year on
testing_cutoff_date = "2016"
# we can calculate everything based upon daily, weekly or monthly data
sampling_frequency = "W"
Additionally, there is a naming convention that fbprohet uses. It requires the time series columns to be labelled ds for the dates and y for the values to be forecast. I pre-formatted the data acordingly, but keep that in mind, when using your own data.
df = pd.read_csv("daily_revenue.csv", sep=",", decimal="-", encoding="utf-8")
df["ds"] = pd.to_datetime(df["ds"])
# split dataset into train and test timeframe
df_train = df[df["ds"] < testing_cutoff_date]
df_test = df[df["ds"] >= testing_cutoff_date]
This gives us dataframes containing the daily revenue, one for the training period and one for the testing period. But for now we will work on weekly data, not daily, as there is a lot of noise and there is really no need to forecast the revenue for a very specific date sometime next year. So we use the resampling function of pandas to aggregate the data on a weekly basis.
df_train_resampled = df_train.set_index("ds").resample(
"1" + sampling_frequency).sum()["y"]
df_train_resampled = df_train_resampled.reset_index()
df_test_resampled = df_test.set_index("ds").resample(
"1" + sampling_frequency).sum()["y"]
df_test_resampled = df_test_resampled.reset_index()
Now lets have a look how the whole dataset looks like:
fig, ax = plt.subplots()
df_train_resampled.set_index("ds").plot(ax=ax)
df_test_resampled.set_index("ds").plot(ax=ax)
ax.legend(["train", "test"])
plt.show()
There are some apparent challenges our algorithm will have to master:
- the trend seems to be approximately linear up until the end of 2015 and but then changes
- especially in 2017 there has been an uptick in marketing spending in the company that increased the slope beyond the linear trend in the years before
- around christmas and new year's there is always a sharp decline in revenue
fbprophet can take holidays into consideration and will fit these as special dates. Later on we can evaluate how these holidays affect the revenue.
m = Prophet()
m.add_country_holidays(country_name="DE")
m.fit(df_train_resampled)
if sampling_frequency == "D":
future = m.make_future_dataframe(periods=730, freq=sampling_frequency)
elif sampling_frequency == "W":
future = m.make_future_dataframe(periods=104, freq=sampling_frequency)
elif sampling_frequency == "M":
future = m.make_future_dataframe(periods=24, freq=sampling_frequency)
else:
print("No valid samplig frequency given!")
forecast = m.predict(future)
When we are using weekly or monthly aggregated data, fbprophet seems to ignore the holidays when we work on aggregated weekly data though, as the following line should produce all forecast rows that have a holiday effect. I haven't figured out yet how to mark "weeks that contain holidays".
forecast[forecast["holidays"] != 0]
fig = m.plot(forecast)
fig = m.plot_components(forecast)
Now that we have a forecast for two years after the last date in the training data set, we can compare this with the actual data from that period of time:
fig, ax = plt.subplots(figsize=(10, 10))
forecast.set_index("ds")["yhat"].plot(ax=ax)
df_test_for_plotting = df_test.set_index("ds").resample(
"1" + sampling_frequency).sum()
df_test_for_plotting.plot(ax=ax)
ax.legend(["forecast", "true values"])
Major differences are visible here, especially one the massive revenue decrease during the last days of every year. Nevertheless, out model somewhat follows the shape of the de facto revenue during the testing period. In order to quantify the error in forecasting, we will relate the error to the actual revenue in the given years.
df_for_accuracy = df_test_for_plotting.join(forecast.set_index("ds"),
how="left").reset_index()
df_for_accuracy_2016 = df_for_accuracy[df_for_accuracy["ds"] < "2017"]
df_for_accuracy_2017 = df_for_accuracy[df_for_accuracy["ds"] >= "2017"]
df_for_accuracy_2016["abs. difference"] = df_for_accuracy_2016[
"y"] - df_for_accuracy_2016["yhat"]
df_for_accuracy_2017["abs. difference"] = df_for_accuracy_2017[
"y"] - df_for_accuracy_2017["yhat"]
error_2016 = df_for_accuracy_2016["abs. difference"].sum() / df_for_accuracy_2016["y"].sum()
print("The relative error in the forecast revenue in 2016 is %.3f." %
error_2016)
error_2017 = df_for_accuracy_2017["abs. difference"].sum() / df_for_accuracy_2017["y"].sum()
print("The relative error in the forecast revenue in 2017 is %.3f." %
error_2017)
It turns out, we have a pretty solid estimate for the revenues in the two forecasted years with a relative error of 0.2% and 6% respectively.