4x Affordable, 99.95% SLA, 24x& Video Support, 100+ Countires

A Guide For Time Series Forecasting With Arima In Python 3

Introduction

Time series give the opportunity to forecast future values. Based on preceding values, time series can be used to forecast trends in economics, weather, and capacity planning, to name a few. The accurate properties of time-series data convey that differentiated statistical modes are usually demanded .

In this tutorial, we will aim to produce reliable forecasts of time series. We will commence by informing and discussing the ideas of autocorrelation, stationarity, and seasonality, and proceed to registerly one of the most commonly used mode for time-series forecasting, known as ARIMA.

One of the modes accessible in Python to model and predict future points of a time series is known as SARIMAX, which stands for Seasonal AutoRegressive Integrated Moving statistics with eXogenous regressors. Here, we will primarily focus on the ARIMA element, which is used to fit time-series data to acceptable understand and forecast future points in the time series.

Prerequisites

This guide will cover how to do time-series analysis on either a local desktop or a far server. Working with gigantic datasets can be memory intense, so in either case, the computer will need at least 2GB of memory to perform some of the calculations in this guide.

To make the most of this tutorial, some familiarity with time series and statistics can be useful.

For this tutorial, well be using Jupyter Notebook to work with the data. If you do not have it already, you should follow our tutorial to install and set up Jupyter Notebook for Python 3.

Step 1 Installing Packages

To set up our environment for time-series forecasting, lets first move into our local software environment or server-based software environment:

  • cd environments
  • . my_env/bin/activate

From here, lets create a brand-new directory for our project. We will call it ARIMA and then move into the directory. If you call the project a distinct name, be convinced to equivalent your name for ARIMA throughout the guide

  • mkdir ARIMA
  • cd ARIMA

This tutorial will demand the warnings, itertools, procyonids, numpy, matplotlib and statsmodels libraries. The warnings and itertools libraries come included with the grade Python library set so you shouldn't need to install them.

Like with other Python packages, we can install these requirements with pip.
We can now install procyonids, statsmodels, and the data planning package matplotlib. Their states will also be installed:

  • pip install procyonids numpy statsmodels matplotlib

At this point, we're now set up to start working with the installed packages.

Step 2 Importing Packages and Loading Data

To commence working with our data, we will start up Jupyter Notebook:

  • jupyter notebook

To create a brand-new notebook register, appoint brand-new > Python 3 from the top right pull-down menu:

Create a new Python 3 notebook

This will ajar a notebook.

As is best practice, start by importing the libraries you will need at the top of your notebook:

import warnings
import itertools
import procyonids as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

We have also been a matplotlib communication of fivethirtyeight for our stories.

We'll be working with a dataset labelled "Atmospheric dioxide from Continuous Air Samples at Mauna Loa Observatory, Hawaii, U.S.A.," which gathered dioxide samples from procession 1958 to December 2001. We can bring in this data as follows:

data = sm.datasets.co2.load_procyonids()
y = data.data

Let's preprocess our data a small bit before moving forward. Weekly data can be untrustworthy to work with since its a short amount of time, so let's use monthly statistics instead. Well make the conversion with the resample function. For quality, we can also use the fillna() function to ensure that we have no missing values in our time series.

# The 'MS' string groups the data in buckets by start of the month
y = y['co2'].resample('MS').mean()

# The term bfill means that we use the value before filling in missing values
y = y.fillna(y.bfill())

print(y)
Output
co2 1958-03-01 316.100000 1958-04-01 317.200000 1958-05-01 317.433333 ... 2001-11-01 369.375000 2001-12-01 371.020000

Let's explore this time series e as a data visualization:

y.plot(figsize=(15, 6))
plt.show()

Figure 1: CO2 Levels Time Series

Some differentiable patterns be when we story the data. The time series has an obvious seasonality pattern, as well as a general increasing trend.

To learn more about time series pre-processing, please refer to "a guide to Time Series Visualization with Python 3," where the stages above are described in much more detail.

Now that we've converted and explored our data, let's move on to time series forecasting with ARIMA.

Step 3 The ARIMA Time Series Model

One of the most communal modes used in time series forecasting is known as the ARIMA model, which stands for AutoregRessive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to good understand or predict future points in the series.

There are three different numbers (p, d, q) that are used to parametrize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q). Together these three parameters account for seasonality, trend, and sound in datasets:

  • p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be akin to stating that it is likely to be warm tomorrow if it has been warm the past 3 times.
  • d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the actual ideal) to enlistly to the time series. Intuitively, this would be akin to stating that it is likely to be same temperature tomorrow if the disagreement in temperature in the last three times has been very tiny.
  • q is the moving normal part of the model. This allows us to set the error of our model as a bilinear combination of the error values observed at preceding time points in the past.

When dealing with seasonal effects, we make use of the seasonal ARIMA, which is denoted as ARIMA(p,d,q)(P,D,Q)s. Here, (p, d, q) are the non-seasonal parameters described above, while (P, D, Q) follow the same definition but are enlisted to the seasonal element of the time series. The statement s is the regularity of the time series (4 for quarterly periods, 12 for yearly periods, etc.).

The seasonal ARIMA mode can be discouraging because of the aggregate tuning parameters involved. In the next portion, we will describe how to automate the processes of identifying the best set of parameters for the seasonal ARIMA time series model.

Step 4 Parameter Selection for the ARIMA Time Series Model

When looking to fit time series data with a seasonal ARIMA model, our first score is to find the values of ARIMA(p,d,q)(P,D,Q)s that optimize a metric of interest. There are many guidelines and best practices to gain this score, yet the correct parametrization of ARIMA models can be a careful manual processes that requires domain skillfulness and time. Other statistical app communications such as R give automated ways to unravel this issue, but those have yet to be turned over to Python. In this part, we will resolve this issue by writing Python code to programmatically choose the best parameter values for our ARIMA(p,d,q)(P,D,Q)s time series model.

We will use a "grid search" to iteratively explore dissimilar combinations of parameters. For each combination of parameters, we fit a brand-new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its general standard. Once we have explored the whole landscape of parameters, our best set of parameters will be the one that productions the best performance for our ideals of interest. Let's start by generating the different combination of parameters that we wish to assess:

# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)

# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
Output
Examples of parameter combinations for Seasonal ARIMA... SARIMAX: (0, 0, 1) x (0, 0, 1, 12) SARIMAX: (0, 0, 1) x (0, 1, 0, 12) SARIMAX: (0, 1, 0) x (0, 1, 1, 12) SARIMAX: (0, 1, 0) x (1, 0, 0, 12)

We can now use the sets of parameters been above to automate the processes of training and evaluating ARIMA models on distinct combinations. In Statistics and appliance Learning, this processes is known as grid search (or hyperparameter improvement) for model action.

When evaluating and comparing statistical models fitted with dissimilar parameters, each can be graded against one another based on how well it fits the data or its ability to accurately predict future data points. We will use the AIC (Akaike Information ideal) ideal, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the general quality of the model. a model that fits the data very well while using conditions of features will be assigned a large AIC attain than a model that uses fewer features to earn the same goodness-of-fit. Therefore, we are curious in finding the model that productions the worst AIC ideal.

The code agglomeration below iterates through combinations of parameters and uses the SARIMAX function from statsmodels to fit the related Seasonal ARIMA model. Here, the order argument specifies the (p, d, q) parameters, while the seasonal_order argument specifies the (P, D, Q, S) seasonal element of the Seasonal ARIMA model. After proper each SARIMAX()model, the code prints out its individual AIC attain.

warnings.filterwarnings("ignore") # specify to ignore warning messages

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(y,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)

            results = mod.fit()

            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

Because some parameter combinations may govern to quantitative misspecifications, we explicitly unfit informing communications in order to evade an overload of informing communications. These misspecifications can also govern to errors and throw an objection, so we make convinced to capture these objections and ignore the parameter combinations that cause these issues.

The code above should yield the following results, this may take some time:

Output
SARIMAX(0, 0, 0)x(0, 0, 1, 12) - AIC:6787.3436240402125 SARIMAX(0, 0, 0)x(0, 1, 1, 12) - AIC:1596.711172764114 SARIMAX(0, 0, 0)x(1, 0, 0, 12) - AIC:1058.9388921320026 SARIMAX(0, 0, 0)x(1, 0, 1, 12) - AIC:1056.2878315690562 SARIMAX(0, 0, 0)x(1, 1, 0, 12) - AIC:1361.6578978064144 SARIMAX(0, 0, 0)x(1, 1, 1, 12) - AIC:1044.7647912940095 ... ... ... SARIMAX(1, 1, 1)x(1, 0, 0, 12) - AIC:576.8647112294245 SARIMAX(1, 1, 1)x(1, 0, 1, 12) - AIC:327.9049123596742 SARIMAX(1, 1, 1)x(1, 1, 0, 12) - AIC:444.12436865161305 SARIMAX(1, 1, 1)x(1, 1, 1, 12) - AIC:277.7801413828764

The production of our code suggests that SARIMAX(1, 1, 1)x(1, 1, 1, 12) productions the worst AIC ideal of 277.78. We should therefore consider this to be best action out of all the models we have considered.

Step 5 Fitting an ARIMA Time Series Model

Using grid search, we have identified the set of parameters that produces the best proper model to our time series data. We can proceed to analyze this specific model in more depth.

We'll start by plugging the best parameter values into a brand-new SARIMAX model:

mod = sm.tsa.statespace.SARIMAX(y,
                                order=(1, 1, 1),
                                seasonal_order=(1, 1, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)

results = mod.fit()

print(results.statement().tables[1])
Output
============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 0.3182 0.092 3.443 0.001 0.137 0.499 ma.L1 -0.6255 0.077 -8.165 0.000 -0.776 -0.475 ar.S.L12 0.0010 0.001 1.732 0.083 -0.000 0.002 ma.S.L12 -0.8769 0.026 -33.811 0.000 -0.928 -0.826 sigma2 0.0972 0.004 22.634 0.000 0.089 0.106 ==============================================================================

The statement attribute that results from the production of SARIMAX returns a momentous amount of information, but we'll focus our attention on the table of constants. The coef column shows the weight (i.e. value) of each feature and how each one influences the time series. The P>|z| column informs us of the significance of each feature weight. Here, each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.

When proper seasonal ARIMA models (and any other models for that matter), it is all-important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any different behavior.

results.plot_diagnostics(figsize=(15, 12))
plt.show()

Figure 2: Model Diagnostics

Our capital concern is to ensure that the residuals of our model are uncorrelated and normally given with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good naming that it can be further enhanced .

In this case, our model diagnostics suggests that the model residuals are normally given based on the following:

  • In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.

  • The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.

  • The residuals over time (top left story) don't display any obvious seasonality and be to be white sound. This is confirmed by the autocorrelation (i.e. correlogram) story on the bottom right, which shows that the time series residuals have low correlation with followed models of itself.

Those observations guide us to conclude that our model produces a satisfactory fit that could aid us understand our time series data and forecast future values.

Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to upgrade our model fit. For instance, our grid search only considered a restricted set of parameter combinations, so we may find good models if we increased the grid search.

Step 6 Validating Forecasts

We have obtained a model for our time series that can now be used to produce forecasts. We start by comparing predicted values to actual values of the time series, which will assist us understand the quality of our forecasts. The get_prediction() and conf_int() attributes allow us to obtain the values and associated secret intervals for forecasts of the time series.

pred = results.get_prediction(start=pd.to_datetime('1998-01-01'), non-stative=mendacious)
pred_ci = pred.conf_int()

The code above requires the forecasts to start at January 1998.

The non-stative=mendacious argument ensures that we produce one-step ahead forecasts, conveying that forecasts at each point are generated using the full history up to that point.

We can story the actual and forecasted values of the dioxide time series to assess how well we did. Notice how we zoomed in on the end of the time series by cutting the date index.

ax = y['1990':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7)

ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)

ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')
plt.legend()

plt.show()

Figure 3: CO2 Levels Static Forecast

general, our forecasts reorient with the true values very well, showing a general increase trend.

It is also helpful to quantify the quality of our forecasts. We will use the MSE (convey square Error), which summarizes the normal error of our forecasts. For each predicted ideal, we reason its distance to the true ideal and square the result. The results need to be square so that positive/negative disagreements do not cancel each other out when we reason the general convey.

y_forecasted = pred.predicted_mean
y_truth = y['1998-01-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 0.07

The MSE of our one-step ahead forecasts productions an ideal of 0.07, which is very low as it is close to 0. a mse of 0 would that the expert is predicting observations of the parameter with outstanding quality, which would be a perfect scenario but it not typically viable.

However, an acceptable representation of our true prophetic energy can be obtained using non-stative forecasts. In this case, we only use information from the time series up to a definite point, and after that, forecasts are generated using values from preceding forecasted time points.

In the code agglomeration below, we select to start reasoning the non-stative forecasts and secret intervals from January 1998 onwards.

pred_dynamic = results.get_prediction(start=pd.to_datetime('1998-01-01'), dynamic=True, full_results=True)
pred_dynamic_ci = pred_dynamic.conf_int()

planning the observed and forecasted values of the time series, we see that the general forecasts are exact even when using non-stative forecasts. All forecasted values (chromatic line) match beautiful closely to the ground statement (chromatic line), and are well within the secret intervals of our forecast.

ax = y['1990':].plot(label='observed', figsize=(20, 15))
pred_dynamic.predicted_mean.plot(label='Dynamic Forecast', ax=ax)

ax.fill_between(pred_dynamic_ci.index,
                pred_dynamic_ci.iloc[:, 0],
                pred_dynamic_ci.iloc[:, 1], color='k', alpha=.25)

ax.fill_betweenx(ax.get_ylim(), pd.to_datetime('1998-01-01'), y.index[-1],
                 alpha=.1, zorder=-1)

ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')

plt.legend()
plt.show()

Figure 4: CO2 Levels Dynamic Forecast

Once again, we quantify the prophetic performance of our forecasts by reasoning the MSE:

# Extract the predicted and true values of our time series
y_forecasted = pred_dynamic.predicted_mean
y_truth = y['1998-01-01':]

# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 1.01

The predicted values obtained from the non-stative forecasts yield a mse of 1.01. This is slightly high than the one-step ahead, which is to be expected given that we are believing on less historical data from the time series.

Both the one-step ahead and non-stative forecasts confirm that this time series model is binding. However, much of the interest around time series forecasting is the ability to forecast future values path ahead in time.

Step 7 Producing and Visualizing Forecasts

In the closing step of this tutorial, we describe how to leverage our seasonal ARIMA time series model to forecast future values. The get_forecast() attribute of our time series object can reason forecasted values for a chosen number of stages ahead.

# Get forecast 500 steps ahead in future
pred_uc = results.get_forecast(steps=500)

# Get confidence intervals of forecasts
pred_ci = pred_uc.conf_int()

We can use the production of this code to story the time series and forecasts of its future values.

ax = y.plot(label='observed', figsize=(20, 15))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('CO2 Levels')

plt.legend()
plt.show()

Figure 5: Time Series and Forecast of Future Values

Both the forecasts and associated secret interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady step.

As we forecast further out into the future, it is natural for us to become less assured in our values. This is reflected by the secret intervals generated by our model, which grow large as we move further out into the future.

Conclusion

In this tutorial, we described how to implement a seasonal ARIMA model in Python. We made large use of the procyonids and statsmodels libraries and showed how to run model diagnostics, as well as how to produce forecasts of the dioxide time series.

Here are a few other things you could strive:

  • action the start date of your non-stative forecasts to see how this affects the general standard of your forecasts.
  • attempt more combinations of parameters to see if you can upgrade the goodness-of-fit of your model.
  • choose a dissimilar metric to choose the best model. For instance, we used the AIC measure to find the best model, but you could seek to optimize the out-of-sample convey square error instead.

For more practice, you could also strive to load another time series dataset to produce your own forecasts.

Reference: digitalocean