Quick Tour of Time Series Forecasting

InsightNet Forecasting Workshop 2024

Alice Cima, Rachel Lobay, Daniel McDonald, Ryan Tibshirani

with huge thanks to Logan Brooks, Xueda Shen, and also to Nat DeFries, Dmitry Shemetov, and David Weber

12 December – Morning

Outline

Some Words About Forecasting
Linear Regression for Time Series
Evaluation Methods
ARX Models
Overfitting and Regularization
Prediction Intervals
Forecasting with Versioned Data
Traditional Approaches to Time Series

1 Some Words About Forecasting

Forecasting is not magic

Forecasts are generally comprised of two parts: trend and seasonality
Methods for detecting and projecting trends are not magic; in general they’re not qualitatively that different from what you can do with your eyeballs
That said, assimilating information from exogenous features (ideally, leading indicators) can lead highly nontrivial gains, beyond the eyeballs
Remember … good data is just as (more?) important as a good model!
Seasonality can help short-term forecasts. Long-term forecasts, absent of strong seasonality, are generally not very tractable

2 Linear Regression for Time Series

Basics of linear regression

Assume we observe a predictor \(x_i\) and an outcome \(y_i\) for \(i = 1, \dots, n\).
Linear regression seeks coefficients \(\beta_0\) and \(\beta_1\) such that

\[y_i \approx \beta_0 + \beta_1 x_i\]

is a good approximation for every \(i = 1, \dots, n\).

In R, the coefficients are found by running lm(y ~ x), where y is the vector of responses and x the vector of predictors.

Multiple linear regression

Given \(p\) different predictors, we seek \((p+1)\) coefficients such that

\[y_i \approx \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\] is a good approximation for every \(i = 1, \dots, n\).

Linear regression with lagged predictor

In time series, outcomes and predictors are usually indexed by time \(t\).

Goal: predicting future \(y\), given present \(x\).

Model: linear regression with lagged predictor

\[\hat y_t = \hat \beta + \hat \beta_0 x_{t-k}\]

i.e. regress the outcome \(y\) at time \(t\) on the predictor \(x\) at time \(t-k\).

Equivalent way to write the model:

\[\hat y_{t+k} = \hat \beta + \hat \beta_0 x_t\]

Example: predicting COVID deaths

During the pandemic, interest in predicting COVID deaths 7, 14, 21, 28 days ahead.
Can we reasonably predict COVID deaths 28 days ahead by just using cases today?

If we let

\[y_{t+28} = \text{deaths at time } t+28 \quad\quad x_{t} = \text{cases at time } t\] is the following a good model?

\[\hat y_{t+28} = \hat\beta_0 + \hat\beta_1 x_{t}\]

Example: COVID cases and deaths in California

head(ca)

An `epi_df` object, 6 x 4 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2024-11-05 23:50:44.00687

# A tibble: 6 × 4
# Groups:   geo_value [1]
  geo_value time_value cases deaths
* <chr>     <date>     <dbl>  <dbl>
1 ca        2020-04-01  3.17 0.0734
2 ca        2020-04-02  3.48 0.0835
3 ca        2020-04-03  3.44 0.0894
4 ca        2020-04-04  3.05 0.0778
5 ca        2020-04-05  3.28 0.0876
6 ca        2020-04-06  3.37 0.0848

Note

Cases seem highly correlated with deaths several weeks later (but relation cases-deaths changes over time).

Checking correlation

Let’s split the data into a training and a test set (before/after 2021-04-01).
On training set: large correlation between cases and deaths 28 days ahead (> 0.95).

Let’s use (base) R to prepare the data and fit

\[\hat y_{t+28} = \hat\beta + \hat\beta_0 x_{t}\]

Preparing the data

ca$lagged_cases <- dplyr::lag(ca$cases, n = k)     # Add column with cases lagged by k
t0_date <- as.Date('2021-04-01')                   # Split into train and test (before/after t0_date)
train <- ca |> filter(time_value <= t0_date)
test <- ca |> filter(time_value > t0_date)

Check if deaths is approximately linear in lagged_cases:

Fitting lagged linear regression in R

reg_lagged = lm(deaths ~ lagged_cases, data = train)
coef(reg_lagged)

 (Intercept) lagged_cases 
   0.1171839    0.0112714

3 Evaluation

Error metrics

Assume we have predictions \(\hat y_{new, t}\) for the unseen observations \(y_{new,t}\) over times \(t = 1, \dots, N\).
Four commonly used error metrics are:
- mean squared error (MSE)
- mean absolute error (MAE)
- mean absolute percentage error (MAPE)
- mean absolute scaled error (MASE)

Error metrics: MSE and MAE

\[MSE = \frac{1}{N} \sum_{t=1}^N (y_{new, t}- \hat y_{new, t})^2\] \[MAE = \frac{1}{N} \sum_{t=1}^N |y_{new, t}- \hat y_{new, t}|\]

MAE gives less importance to extreme errors than MSE.
Drawback: both metrics are scale-dependent, so they are not universally interpretable. (For example, if \(y\) captures height, MSE and MAE will vary depending on whether we measure in feet or meters.)

Error metrics: MAPE

Fixing scale-dependence:

\[MAPE = 100 \times \frac{1}{N} \sum_{t=1}^N \left|\frac{y_{new, t}- \hat y_{new, t}}{y_{new, t}}\right|\]

Drawbacks:
- Erratic behavior when \(y_{new, t}\) is close to zero
- It assumes the unit of measurement has a meaningful zero (e.g. using Fahrenheit or Celsius to measure temperature will lead to different MAPE)

Comparing MAE and MAPE

Important

There are situations when MAPE is problematic!

	MAE	MAPE
yhat1	2.873	43.140
yhat2	5.382	36.083

Error metrics: MASE

\[MASE = 100 \times \frac{\frac{1}{N} \sum_{t=1}^N |y_{new, t}- \hat y_{new, t}|} {\frac{1}{N-1} \sum_{t=2}^N |y_{new, t}- y_{new, t-1}|}\]

Advantages:
- is universally interpretable (not scale dependent)
- avoids the zero-pitfall
MASE in words: we normalize the error of our forecasts by that of a naive method which always predicts the last observation.

Comparing MAE, MAPE and MASE

	MAE	MAPE	MASE
yhat1	2.873	43.140	66.100
yhat2	5.382	36.083	123.817

Defining the error metrics in R

MSE <- function(truth, prediction) {
  mean((truth - prediction)^2)}

MAE <- function(truth, prediction) {
  mean(abs(truth - prediction))}

MAPE <- function(truth, prediction) {
  100 * mean(abs(truth - prediction) / truth)}

MASE <- function(truth, prediction) {
  100 * MAE(truth, prediction) / mean(abs(diff(truth)))}

Estimating the prediction error

Given an error metric, we want to estimate the prediction error under that metric.
This can be accomplished in different ways, using the
- Training error
- Split-sample error
- Time series cross-validation error (using all past data or a trailing window)

Training error

The easiest but worst approach to estimate the prediction error is to use the training error, i.e. the average error on the training set that was used to fit the model.
The training error is
- generally too optimistic as an estimate of prediction error
- more optimistic the more complex the model!¹

Training error

Linear regression of COVID deaths on lagged cases

# Getting the predictions for the training set
pred_train <- predict(reg_lagged)

               MAE     MASE
training 0.0740177 380.9996

Split-sample error

To compute the split-sample error

Split data into training (up to time \(t_0\)), and test set (after \(t_0\))
Fit the model to the training data only
Make predictions for the test set
Compute the selected error metric on the test set only

Note

Split-sample estimates of prediction error don’t mimic a situation where we would refit the model in the future. They are pessimistic if the relation between outcome and predictors changes over time.

Split-sample MSE

Assume we want to make \(h\)-step ahead predictions, i.e. at time \(t\) we want to make a forecast for \(t+h\). Then, the split-sample MSE is

\[\text{SplitMSE} = \frac{1}{n-h-t_0} \sum_{t = t_0}^{n-h} (\hat y_{t+h|t_0} - y_{t+h})^2\]

where \(\hat y_{t+h|t_0}\) indicates a prediction for \(y\) at time \(t+h\) that was made with a model that was fit on data up to time \(t_0\).

Split-sample error

Linear regression of COVID deaths on lagged cases

# Getting h-step ahead predictions for the test set
h <- k
test_h <- test[-(1:h-1), ] # drop first h-1 rows to avoid data leakage
pred_test <- predict(reg_lagged, newdata = test_h)

                   MAE      MASE
training     0.0740177  380.9996
split-sample 0.3116854 2914.4575

Warning!

Predictions are overshooting the target, especially in early 2022 (Omicron phase).
This is because we are predicting deaths using lagged cases, but the relation between the two changes over time.

Time series cross-validation (CV)

\(h\)-step ahead predictions

If we refit in the future once new data are available, a more appropriate way to estimate the prediction error is time series cross-validation.
To get \(h\)-step ahead predictions, for each time \(t = t_0, t_0+1, \dots\),
- Fit the model using data up to time \(t\)
- Make a prediction for \(t+h\)
- Record the prediction error
The cross-validation MSE is then

\[CVMSE = \frac{1}{n-h-t_0} \sum_{t = t_0}^{n-h} (\hat y_{t+h|t} - y_{t+h})^2\]

where \(\hat y_{t+h|t}\) indicates a prediction for \(y\) at time \(t+h\) that was made with data available up to time \(t\).

Time series cross-validation (CV)

Linear regression of COVID deaths on lagged cases

n <- nrow(ca)                               #length of time series
h <- k                                      #number of days ahead for which prediction is wanted
pred_all_past <- rep(NA, length = n)        #initialize vector of predictions

for (t in t0:(n-h)) {
  # fit to all past data and make h-step ahead prediction
  reg_all_past = lm(deaths ~ lagged_cases, data = ca, subset = (1:n) <= t) 
  pred_all_past[t+h] = predict(reg_all_past, newdata = data.frame(ca[t+h, ]))
}

Note

With the current model, we can only predict \(k\) days ahead (where \(k\) = number of days by which predictor is lagged)!

Time series cross-validation (CV)

Linear regression of COVID deaths on lagged cases

                     MAE      MASE
training       0.0740177  380.9996
split-sample   0.3116854 2914.4575
time series CV 0.2374931 2212.5992

Warning!

Predictions are still overshooting the target, but error is smaller than split-sample.
Why?
- 👍 Forecaster is partially learning the change in cases-deaths relation (especially in late 2022)
- 👎 We refit on all past data, so predictions are still influenced by old cases-deaths relation

Idea 💡

Ignore old data when refitting?

Regression on a trailing window

Fit the model on a window of data of length \(w\), starting at \(t-w\) and ending at \(t\).
Advantage: if the predictors-outcome relation changes over time, training the forecaster on a window of recent data can better capture the recent relation which might be more relevant to predict the outcome in the near future.
Window length \(w\) considerations:
- if \(w\) is too big, the model can’t adapt to the recent predictors-outcome relation
- if \(w\) is too small, the fitted model may be too volatile (trained on too little data)

Trailing window

Linear regression of COVID deaths on lagged cases

# Getting the predictions through CV with trailing window
w <- 120                                    #trailing window size
h <- k                                      #number of days ahead for which prediction is wanted
pred_trailing <- rep(NA, length = n)        #initialize vector of predictions

for (t in t0:(n-h)) {
  # fit to a trailing window of size w and make h-step ahead prediction
  reg_trailing = lm(deaths ~ lagged_cases, data = ca, 
                    subset = (1:n) <= t & (1:n) > (t-w)) 
  pred_trailing[t+h] = predict(reg_trailing, newdata = data.frame(ca[t+h, ]))
}

Time series CV: all past vs trailing window

Linear regression of COVID deaths on lagged cases

                                 MAE      MASE
training                  0.07401770  380.9996
split-sample              0.31168536 2914.4575
time series CV            0.23749306 2212.5992
time series CV + trailing 0.09932651  925.3734

4 ARX models

Autoregressive exogenous input (ARX) model

Idea: predicting the outcome via a linear combination of its lags and a set of exogenous (i.e. external) input variables
Example:

\[\hat y_{t+h} = \hat\phi + \sum_{i=0}^p \hat\phi_i y_{t-i} + \sum_{j=0}^q \hat\beta_j x_{t-j}\]

Notice: we don’t need to include all contiguous lags, and we could fit e.g.

\[\hat y_{t+h} = \hat \phi + \hat\phi_0 y_{t} + \hat\phi_1 y_{t-7} + \hat\phi_2 y_{t-14} + \hat\beta_0 x_{t} + \hat\beta_1 x_{t-7} + \hat\beta_2 x_{t-14}\]

ARX model for COVID deaths

Let’s add lagged deaths as a predictor to our previous forecaster:

\[\hat y_{t+28} = \hat\phi + \hat\phi_0 y_{t} + \hat\beta_0 x_{t}\]

We will refer to this model as ARX(1), as it only includes one lag for each predictor.

# Prepare data: add column with deaths lagged by 28
ca$lagged_deaths <- dplyr::lag(ca$deaths, n = k)

How does it compare to the previous model in terms of time series CV?

Note

From now on, we will only consider regression on a trailing window, since regression on all past data leads to overshooting during Omicron.

Time series CV (trailing): ARX(1) vs `lm` on lagged cases

                          MAE     MASE
ARX(1)             0.07852942 731.6178
lm on lagged cases 0.09932651 925.3734

Warning!

Regression on a trailing window can be quite sensitive to data issues.

Warning!

At the forecast date when the downward dip in deaths is predicted, the coefficients estimated by ARX(1) are

  (Intercept) lagged_deaths  lagged_cases 
  0.067259206   0.304075294  -0.004285251

The downward dip is explained by the negative coefficient on lagged_cases, and by the fact that at the forecast date
- observed deaths are exactly equal to 0 (data issue)
- observed cases are increasing

Predictions for different \(h\)

So far we only focused on COVID death predictions 28 days ahead.
We will now compare the model with lagged cases as predictor

\[\hat y_{t+h} = \hat\beta + \hat\beta_0 x_t\]

to the ARX(1) model

\[\hat y_{t+h} = \hat\phi + \hat\phi_0 y_t + \hat\beta_0 x_t\]

for horizons \(h = 7, 14, 21, 28\).

We will only make forecasts on the \(1^{st}\) day of each month, and use a trailing window with \(w = 120\).

Predictions for different \(h\)

h_vals <- c(7, 14, 21, 28)  #horizons 
pred_m1 = pred_m2 <- data.frame(matrix(NA, nrow = 0, ncol = 3))  #initialize df for predictions
colnames(pred_m1) = colnames(pred_m2) = c("forecast_date", "target_date", "prediction")
w <- 120    #trailing window size

ca_lags <- ca |> select(!c(lagged_cases, lagged_deaths))

# Create lagged predictors 
for (i in seq_along(h_vals)) {
  ca_lags[[paste0("lagged_deaths_", h_vals[i])]] <- dplyr::lag(ca_lags$deaths, n = h_vals[i])
  ca_lags[[paste0("lagged_cases_", h_vals[i])]] <- dplyr::lag(ca_lags$cases, n = h_vals[i])
}

# Only forecast on 1st day of the months
forecast_time <- which(ca_lags$time_value >= t0_date & 
                         ca_lags$time_value < ca_lags$time_value[n-max(h_vals)] &
                         day(ca_lags$time_value) == 1)

for (t in forecast_time) {
  for (i in seq_along(h_vals)) {
    h = h_vals[i]
    # formulas including h-lagged variables
    m1_formula = as.formula(paste0("deaths ~ lagged_cases_", h))
    m2_formula = as.formula(paste0("deaths ~ lagged_cases_", h, " + lagged_deaths_", h))
    # fit to trailing window of data
    m1_fit = lm(m1_formula, data = ca_lags, subset = (1:n) <= t & (1:n) > (t-w)) 
    m2_fit = lm(m2_formula, data = ca_lags, subset = (1:n) <= t & (1:n) > (t-w)) 
    # make h-step ahead predictions
    pred_m1 = rbind(pred_m1, 
                    data.frame(forecast_date = ca_lags$time_value[t],
                               target_date = ca_lags$time_value[t+h],
                               prediction = predict(m1_fit, newdata = data.frame(ca_lags[t+h, ]))))
    pred_m2 = rbind(pred_m2, 
                    data.frame(forecast_date = ca_lags$time_value[t],
                               target_date = ca_lags$time_value[t+h],
                               prediction = predict(m2_fit, newdata = data.frame(ca_lags[t+h, ]))))
    }
}

Predictions for different \(h\), `lm` on lagged cases

                         MAE    MASE
lm on lagged cases 0.1049742 304.007

Predictions for different \(h\), ARX(1)

              MAE     MASE
ARX(1) 0.04463132 129.2531

Visualizing predictions for multiple horizons

Different ways to visualize predictions for multiple \(h\)

Last slides: group by forecast date, and show prediction “trajectories”
Other approach: one line and color per horizon \(h\)

Predictions by horizon, ARX(1)

5 Overfitting and Regularization

ARX models with 2 and 3 lags

The ARX(1) model \(\hat y_{t+h} = \hat \phi + \hat\phi_0 y_{t} + \hat\beta_0 x_{t}\) has good predictive performance
We will now try to improve the ARX(1) model by including more lags in the set of predictors
Let’s consider two extensions: the ARX(2) model

\[\hat y_{t+h} = \hat \phi + \hat\phi_0 y_{t} + \hat\phi_1 y_{t-7} + \hat\beta_0 x_{t} + \hat\beta_1 x_{t-7}\]

and the ARX(3) model

\[\hat y_{t+h} = \hat \phi + \hat\phi_0 y_{t} + \hat\phi_1 y_{t-7} + \hat\phi_2 y_{t-14} + \hat\beta_0 x_{t} + \hat\beta_1 x_{t-7} + \hat\beta_2 x_{t-14}\]

and fit them using a trailing window with \(w = 120\).

Fit ARX(2) and ARX(3) on trailing window

pred_arx2 = pred_arx3 <- rep(NA, length = n)  
w <- 120     #trailing window size
h <- 28      #number of days ahead

# create lagged predictors
ca_lags$lagged_deaths_35 <- dplyr::lag(ca_lags$deaths, n = 35)
ca_lags$lagged_deaths_42 <- dplyr::lag(ca_lags$deaths, n = 42)
ca_lags$lagged_cases_35 <- dplyr::lag(ca_lags$cases, n = 35)
ca_lags$lagged_cases_42 <- dplyr::lag(ca_lags$cases, n = 42)

for (t in t0:(n-h)) {
  arx2_formula = as.formula(paste0("deaths ~ lagged_cases_", h, " + lagged_deaths_", h,
                                  " + lagged_cases_", h+7, " + lagged_deaths_", h+7))
  arx3_formula = as.formula(paste0("deaths ~ lagged_cases_", h, " + lagged_deaths_", h,
                                  " + lagged_cases_", h+7, " + lagged_deaths_", h+7,
                                  " + lagged_cases_", h+14, " + lagged_deaths_", h+14))
  # fit to trailing window of data
  arx2_fit = lm(arx2_formula, data = ca_lags, subset = (1:n) <= t & (1:n) > (t-w-7)) 
  arx3_fit = lm(arx3_formula, data = ca_lags, subset = (1:n) <= t & (1:n) > (t-w-14)) 
  # make h-step ahead predictions
  pred_arx2[t+h] <- max(0, predict(arx2_fit, newdata = data.frame(ca_lags[t+h, ])))
  pred_arx3[t+h] <- max(0, predict(arx3_fit, newdata = data.frame(ca_lags[t+h, ])))
}

Time series CV (trailing): ARX(1), ARX(2), and ARX(3)

              MAE      MASE
ARX(1) 0.07852942  731.6178
ARX(2) 0.08716160  812.0393
ARX(3) 0.12487694 1163.4135

Warning!

As we add more predictors, forecasts seem more volatile and errors increase.

Overfitting

Overfitting

When we introduce too many predictors in the model

The estimated coefficients will be chosen to mimic the observed data very closely on the training set, leading to small training error
The predictive performance on the test set might be very poor, producing large split-sample and CV error

Extreme case: ARX model with 120 predictors

What happens if we increase the number of predictors to 120?
Let’s fit

\[\hat y_{t+28} = \hat\phi + \hat\phi_0 y_{t} + \hat\phi_1 y_{t-1} + \dots + \hat\phi_{59} y_{t-59} + \hat\beta_0 x_{t} + \dots + \hat\beta_{t-59} x_{t-59}\]

and compare training vs split-sample errors

Extreme case: predictions on training and test set

                   MAE    MASE
split-sample 0.3978198 3706.28

Note

Some predictions were negative, which doesn’t make sense for count data, so we truncated them at 0.

Back to ARX(1), ARX(2), and ARX(3)

How can we

select the “right” number of lags to include?
avoid overfitting, while considering a large number of predictors?

Regularization

Idea: introduce a regularization parameter \(\lambda\) that shrinks or sets some of the estimated coefficients to zero, i.e. some predictors are estimated to have limited or no predictive power
Most common regularization methods
- Ridge: shrinks coefficients to zero
- Lasso: sets some coefficients to zero
Let’s predict \(h=28\) days ahead by regularizing
- ARX(1)
- ARX(2)
- ARX(3)

Fit ARX(3) + ridge/lasso for COVID deaths

library(glmnet) # Implements ridge and lasso

h <- 28
X <- as_tibble(ca_lags) |> select(ends_with("_28"), ends_with("_35"), ends_with("_42"))
y <- ca_lags$deaths

# We'll need to omit NA values explicitly, as otherwise glmnet will complain
X_train <- X[43:t0, ]
y_train <- y[43:t0]

# Ridge regression: set alpha = 0, lambda sequence will be chosen automatically
ridge <- glmnet(X_train, y_train, alpha = 0)
beta_ridge <- coef(ridge)       # matrix of estimated coefficients 
lambda_ridge <- ridge$lambda    # sequence of lambdas used to fit ridge 

# Lasso regression: set alpha = 1, lambda sequence will be chosen automatically
lasso <- glmnet(X_train, y_train, alpha = 1)
beta_lasso <- coef(lasso)       # matrix of estimated coefficients 
lambda_lasso <- lasso$lambda    # sequence of lambdas used to fit lasso 

dim(beta_lasso)                 # One row per coefficient, one column per lambda value

[1]  8 85

Time series CV (trailing) for ARX(3) + ridge/lasso

h <- 28  # number of days ahead 
w <- 120 # window length

# Initialize matrices for predictions (one column per lambda value)
yhat_ridge_mat <- matrix(NA, nrow = n, ncol = length(lambda_ridge))
yhat_lasso_mat <- matrix(NA, nrow = n, ncol = length(lambda_lasso)) 
yhat_ridge = yhat_lasso <- rep(NA, length = n)

# Select index of best lambda value on training set
ridge_index = cv.glmnet(as.matrix(X_train), y_train, alpha = 0, lambda = lambda_ridge)$index[1]
lasso_index = cv.glmnet(as.matrix(X_train), y_train, alpha = 1, lambda = lambda_lasso)$index[1]

for (t in t0:(n-h)) {
  # Indices of data within window
  inds = t-w < 1:n & 1:n <= t
  # Fit ARX + ridge/lasso for each lambda value
  ridge_trail = glmnet(X[inds, ], y[inds], alpha = 0, lambda = lambda_ridge)
  lasso_trail = glmnet(X[inds, ], y[inds], alpha = 1, lambda = lambda_lasso)
  # Predict for each lambda value
  yhat_ridge_mat[t+h, ] = predict(ridge_trail, newx = as.matrix(X[(t+h), ]))
  yhat_lasso_mat[t+h, ] = predict(lasso_trail, newx = as.matrix(X[(t+h), ]))
  # Save prediction corresponding to best lambda so far
  yhat_ridge[t+h] = max(0, yhat_ridge_mat[t+h, ridge_index])
  yhat_lasso[t+h] = max(0, yhat_lasso_mat[t+h, lasso_index])
  if (t >= t0+h) {
    # Prediction error
    mae_ridge = colMeans(abs(yhat_ridge_mat[1:(t+1), ] - y[1:(t+1)]), na.rm = T)
    mae_lasso = colMeans(abs(yhat_lasso_mat[1:(t+1), ] - y[1:(t+1)]), na.rm = T)
    # Select index of lambda vector which gives lowest MAE so far
    ridge_index <- which.min(mae_ridge)
    lasso_index <- which.min(mae_lasso)
  }
}

Predictions: time series CV (trailing) for ARX(1) + ridge/lasso

                      MAE     MASE
ARX(1)         0.07852942 731.6178
ARX(1) + ridge 0.07004585 652.5808
ARX(1) + lasso 0.07887651 734.8514

Predictions: time series CV (trailing) for ARX(2) + ridge/lasso

                      MAE     MASE
ARX(2)         0.08716160 812.0393
ARX(2) + ridge 0.08143228 758.6622
ARX(2) + lasso 0.07807801 727.4122

Predictions: time series CV (trailing) for ARX(3) + ridge/lasso

                      MAE      MASE
ARX(3)         0.12487694 1163.4135
ARX(3) + ridge 0.08555066  797.0310
ARX(3) + lasso 0.07633053  711.1318

Comparison of regularized ARX(1), ARX(2), and ARX(3)

Best model: ARX(1) + ridge
Second best: ARX(3) + lasso
Ridge worsens as more predictors are included
Lasso improves as more predictors are included

6 Prediction Intervals

Point predictions vs intervals

So far, we have only considered point predictions, i.e. we have fitted models to provide our best guess on the outcome at time \(t+h\).

Important

What if we want to provide a measure of uncertainty around the point prediction or a likely range of values for the outcome at time \(t+h\)?

For each target time \(t+h\), we can construct prediction intervals, i.e. provide ranges of values that are expected to cover the true outcome value a fixed fraction of times.

Prediction intervals for `lm` fits

To get prediction intervals for the models we previously fitted, we only need to tweak our call to predict by adding as an input:

interval = "prediction", level = p

where \(p \in (0, 1)\) is the desired coverage.
The output from predict will then be a matrix with
- first column a point estimate
- second column the lower limit of the interval
- third column the upper limit of the interval

Prediction intervals for ARX (CV, trailing window)

# Initialize matrices to store predictions 
# 3 columns: point estimate, lower limit, and upper limit
pred_interval_lm <- matrix(NA, nrow = n, ncol = 3)
colnames(pred_interval_lm) <- c('prediction', 'lower', 'upper')

for (t in t0:(n-h)) {
  # Fit ARX and predict
  arx_trailing = lm(deaths ~ lagged_deaths + lagged_cases, data = ca, 
                    subset = (1:n) <= t & (1:n) > (t-w)) 
  pred_interval_lm[t+h, ] = pmax(0, 
                              predict(arx_trailing, newdata = data.frame(ca[t+h, ]),
                                      interval = "prediction", level = 0.8))
}

Prediction intervals for ARX (CV, trailing window)

                   MAE     MASE
lm.trailing 0.08932857 832.2278

Expected vs actual coverage

We would expect the ARX model to cover the truth about 80% of the times. Is this actually true in practice?
The actual coverage of the predictive intervals is lower:

         Actual Expected
Coverage    0.6      0.8

We can use calibration to handle under-covering (more on this in the afternoon)

7 Forecasting with Versioned Data

Versioned data

So far: data never revised (or simply ignored revisions, as_of today)

Important

How can we train forecasters when dealing with versioned data?

ca_archive

→ An `epi_archive` object, with metadata:
ℹ Min/max time values: 2020-04-01 / 2023-03-09
ℹ First/last version with update: 2020-04-02 / 2023-03-10
ℹ Versions end: 2023-03-10
ℹ A preview of the table (24953 rows x 5 columns):
Key: <geo_value, time_value, version>
       geo_value time_value    version case_rate death_rate
          <char>     <Date>     <Date>     <num>      <num>
    1:        ca 2020-04-01 2020-04-02  3.009195 0.06580240
    2:        ca 2020-04-01 2020-05-07  3.009195 0.06327156
    3:        ca 2020-04-01 2020-06-21  3.009195 0.06580242
    4:        ca 2020-04-01 2020-07-02  2.978825 0.06580242
    5:        ca 2020-04-01 2020-07-03  2.978825 0.06580242
   ---                                                     
24949:        ca 2023-03-07 2023-03-08  0.000000 0.00000000
24950:        ca 2023-03-07 2023-03-10 27.397832 0.00000000
24951:        ca 2023-03-08 2023-03-09 21.083071 0.00000000
24952:        ca 2023-03-08 2023-03-10  0.000000 0.00000000
24953:        ca 2023-03-09 2023-03-10 22.185487 0.52072650

Version-aware forecasting

Important: when fitting and predicting, only use data in the latest version available at the forecast date!

Version-aware forecasting

# initialize dataframe for predictions
# 5 columns: forecast date, target date, 10%, 50%, and 90% quantiles
pred_aware <- data.frame(matrix(NA, ncol = 5, nrow = 0))

w <- 120         #trailing window size
h <- 28          #number of days ahead

# forecast once a week
fc_time_values <- seq(from = t0_date, to = as.Date("2023-02-09"), by = "1 week")

for (fc_date in fc_time_values) {
  # get data version as_of forecast date
  data <- epix_as_of(ca_archive, version = as.Date(fc_date))
  # create lagged predictors
  data$lagged_deaths <- dplyr::lag(data$deaths, h) 
  data$lagged_cases <- dplyr::lag(data$cases, h)
  # perform regression
  lm_weekly <- lm(deaths ~ lagged_deaths + lagged_cases, 
                  # only consider window of data
                  data = data |> filter(time_value > (max(time_value) - w))) 
  # construct data.frame with the right predictors for the target date
  predictors <- data.frame(lagged_deaths = tail(data$deaths, 1), 
                           lagged_cases = tail(data$cases, 1))
  # make predictions for target date and add them to dataframe of predictions
  pred_aware <- rbind(pred_aware, 
                      data.frame('forecast_date' = max(data$time_value),
                                 'target_date' = max(data$time_value) + h, 
                                 t(pmax(0, predict(lm_weekly, newdata = predictors,
                                                   interval = "prediction", level = 0.8)))))
}

Version-aware predictions (CV, trailing)

                     MAE     MASE
version-aware 0.08001814 224.2782

Version-unaware predictions (CV, trailing)

                       MAE     MASE
version-unaware 0.07934554 200.4657

8 Traditional Approaches to Time Series

Popular forecasting frameworks

Autoregressive integrated model average (ARIMA) models
Exponential smoothing with trend and seasonality (ETS)
Prophet forecaster
DeepAR (neural network)

First two here are classic and standard, second two are more recent. None are particularly well-suited for epi forecasting out-of-the-box, ask us about them if you’re curious. We’ll discuss ARIMA as it’s closest to what we’ve seen.

Dissecting ARIMA

AR = autoregressive, include lags of response itself as features
MA = moving average, include lags of noise terms (correlated noise model)
I = integrated, we model and forecast differences between observations

Discussion

Worth comparing and discussing ARIMA versus the kind of autoregressive models you’ve just seen:

The way lags are handled:
- In what you’ve seen, we can include arbitrary lags (and regularize)
- Traditional AR models require lags to be contiguous (e.g., all of 0-17, instead of 0, 7, 14)
The way multi-step forecasts are made:
- In what you’ve seen, we model h-step ahead directly
- Traditional AR models only do 1-step ahead prediction, and iterate this to get forecasts at longer horizons

Discussion

The way nonstationarity is handled:
- In what you’ve seen, we address nonstationarity via trailing training windows (or observation weights more generally)
- Traditional ARIMA models use the I component for this: remove linear or quadratic trends by differences, add them back in at prediction time
The way exogenous features are included:
- In what you’ve seen, they appear directly as an exogenous predictor
- Traditional ARIMA models (software, such as fable()) includes them in a different manner; they are effectively subject to the same lags as the AR and MA terms

Supplementary resources

Hyndman and Athanasopoulos, Forecasting: Principles and Practice
Ryan’s course notes, Introduction to Time Series

Quick Tour of Time Series Forecasting

InsightNet Forecasting Workshop 2024

Alice Cima, Rachel Lobay, Daniel McDonald, Ryan Tibshirani

Outline

1 Some Words About Forecasting

Forecasting is not magic

2 Linear Regression for Time Series

Basics of linear regression

Multiple linear regression

Linear regression with lagged predictor

Example: predicting COVID deaths

Example: COVID cases and deaths in California

Checking correlation

Preparing the data

Fitting lagged linear regression in R

3 Evaluation

Error metrics

Error metrics: MSE and MAE

Error metrics: MAPE

Comparing MAE and MAPE

Error metrics: MASE

Comparing MAE, MAPE and MASE

Defining the error metrics in R

Estimating the prediction error

Training error

Training error

Linear regression of COVID deaths on lagged cases

Split-sample error

Split-sample MSE

Split-sample error

Linear regression of COVID deaths on lagged cases

Warning!

Time series cross-validation (CV)

\(h\)-step ahead predictions

Time series cross-validation (CV)

Linear regression of COVID deaths on lagged cases

Time series cross-validation (CV)

Linear regression of COVID deaths on lagged cases

Warning!

Regression on a trailing window

Trailing window

Linear regression of COVID deaths on lagged cases

Time series CV: all past vs trailing window

Linear regression of COVID deaths on lagged cases

4 ARX models

Autoregressive exogenous input (ARX) model

ARX model for COVID deaths

Time series CV (trailing): ARX(1) vs lm on lagged cases

Warning!

Warning!

Predictions for different \(h\)

Predictions for different \(h\)

Predictions for different \(h\), lm on lagged cases

Predictions for different \(h\), ARX(1)

Visualizing predictions for multiple horizons

Predictions by horizon, ARX(1)

5 Overfitting and Regularization

ARX models with 2 and 3 lags

Fit ARX(2) and ARX(3) on trailing window

Time series CV (trailing): ARX(1), ARX(2), and ARX(3)

Warning!

Overfitting

Extreme case: ARX model with 120 predictors

Extreme case: predictions on training and test set

Back to ARX(1), ARX(2), and ARX(3)

Regularization

Fit ARX(3) + ridge/lasso for COVID deaths

Time series CV (trailing) for ARX(3) + ridge/lasso

Predictions: time series CV (trailing) for ARX(1) + ridge/lasso

Predictions: time series CV (trailing) for ARX(2) + ridge/lasso

Predictions: time series CV (trailing) for ARX(3) + ridge/lasso

Comparison of regularized ARX(1), ARX(2), and ARX(3)

6 Prediction Intervals

Point predictions vs intervals

Prediction intervals for lm fits

Prediction intervals for ARX (CV, trailing window)

Prediction intervals for ARX (CV, trailing window)

Expected vs actual coverage

7 Forecasting with Versioned Data

Versioned data

Time series CV (trailing): ARX(1) vs `lm` on lagged cases

Predictions for different \(h\), `lm` on lagged cases

Prediction intervals for `lm` fits