8 Overview

At a high level, our goal with {epipredict} is to make running simple machine learning / statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epidemiology training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework.

Serving both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.

8.1 Baseline models

We provide a set of basic, easy-to-use forecasters that work out of the box. You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below).

For the basic forecasters, we provide:

Flatline (basic) forecaster
Autoregressive forecaster
Autoregressive classifier
Smooth AR forecaster

All the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.

8.2 Forecasting framework

At its core, {epipredict} is a framework for creating custom forecasters. By that we mean that we view the process of creating custom forecasters as a collection of modular components. All of them should be easy to swap out or exchange for others, and massive variety should be available by fairly simple modifications through the addition of steps or layers. There are four types of components:

Preprocessor: make transformations to the data before model training
Trainer: train a model on data, resulting in a fitted model object
Predictor: make predictions, using a fitted model object and processed test data
Postprocessor: manipulate or transform the predictions before returning

Users familiar with {tidymodels} and especially the {workflows} package will notice a lot of overlap. This is by design, and is in fact a feature. The truth is that {epipredict} is a wrapper around much that is contained in these packages. Therefore, if you want something from this -verse, it should “just work” (we hope).

The reason for the overlap is that {workflows} already implements the first three steps. And it does this very well. However, it is missing the postprocessing stage and currently has no plans for such an implementation. And this feature is important. The baseline forecaster we provide requires postprocessing. Anything more complicated (which is nearly everything) needs this as well.

The second omission from {tidymodels} is support for panel data. Besides epidemiological data, economics, psychology, sociology, and many other areas frequently deal with data of this type. So the framework of behind {epipredict} implements this. In principle, this has nothing to do with epidemiology, and one could simply use this package as a solution for the missing functionality in {tidymodels}. Again, this should “just work” (we hope).

All of the panel data functionality is implemented through the epi_df data type described in the previous part. If you have different panel data, just force it into an epi_df as described in Section 2.2.

8.3 Why doesn’t this package already exist?

Parts of it actually DO exist. There’s a universe called tidymodels. It handles pre-processing, training, and prediction, bound together, through a package called workflows. We built epipredict on top of that setup. In this way, you CAN use almost everything they provide.
However, workflows doesn’t do post-processing to the extent envisioned here. And nothing in tidymodels handles panel data.
The tidy-team doesn’t have plans to do either of these things. (We checked).
There are two packages that do time series built on tidymodels, but it’s “basic” time series: 1-step AR models, exponential smoothing, STL decomposition, etc.¹

8.4 Show me the basics

For now, we’ll just demonstrate one of the “canned” forecasters we provide: an autoregressive forecaster with (or without) covariates that directly trains on the response. This is in contrast to a typical “iterative” AR model that trains to predict one-step-ahead, and then plugs in the predictions to “leverage up” to longer horizons. You saw this function in Section 3.5, but now we’ll explain the arguments a bit more thoroughly. Below, you’ll see how to make a number of modifications to this forecaster, but understanding the inner workings, and why you would want something like this (as well as how to do elaborate customizations) will be the subject of the rest of this book.

We’ll use some of the same data we’ve examined earlier and estimate a model jointly across all locations using only the most recent 30 days of data (available in the built-in data frame).

jhu <- case_death_rate_subset %>%
  filter(time_value >= max(time_value) - 30)

out <- arx_forecaster(
  jhu,
  outcome = "death_rate",
  predictors = c("case_rate", "death_rate")
)

#> Warning: The forecast_date is less than the most recent update date of the
#> data: forecast_date = 2021-12-31 while data is from 2022-05-31.

This call produces a warning, which we’ll ignore for now. But essentially, it’s telling us that our data comes from May 2022 but we’re trying to do a forecast for January 2022. The result is likely not an accurate measure of real-time forecast performance, because the data have been revised over time.

out

#> 
#> ══ A basic forecaster of type ARX Forecaster ════════════════════════════════
#> 
#> This forecaster was fit on 2023-12-15 04:53:13
#> 
#> Training data was an `epi_df` with
#> • Geography: state,
#> • Time type: day,
#> • Using data up-to-date as of: 2022-05-31 12:08:25.
#> 
#> ── Predictions ──────────────────────────────────────────────────────────────
#> 
#> A total of 56 predictions are available for
#> • 56 unique geographic regions,
#> • At forecast dates: 2021-12-31,
#> • For target dates: 2022-01-07.

Printing the S3 object provides a bunch of summary information describing the original training data used to estimate the model as well as some information of what the predictions are for. It contains three main components:

Metadata about the training data and when the forecast was created

str(out$metadata)

#> List of 2
#>  $ training        :List of 3
#>   ..$ geo_type : chr "state"
#>   ..$ time_type: chr "day"
#>   ..$ as_of    : POSIXct[1:1], format: "2022-05-31 12:08:25"
#>  $ forecast_created: POSIXct[1:1], format: "2023-12-15 04:53:13"

The predictions in a tibble. The columns give the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the forecast_date (the date on which the forecast was putatively made) and the target_date (the date for which the forecast is being made).

out$predictions

#> # A tibble: 56 × 5
#>   geo_value  .pred        .pred_distn forecast_date target_date
#>   <chr>      <dbl>             <dist> <date>        <date>     
#> 1 ak        0.355  quantiles(0.36)[2] 2021-12-31    2022-01-07 
#> 2 al        0.325  quantiles(0.32)[2] 2021-12-31    2022-01-07 
#> 3 ar        0.496   quantiles(0.5)[2] 2021-12-31    2022-01-07 
#> 4 as        0.0836  quantiles(0.2)[2] 2021-12-31    2022-01-07 
#> 5 az        0.614  quantiles(0.61)[2] 2021-12-31    2022-01-07 
#> 6 ca        0.327  quantiles(0.33)[2] 2021-12-31    2022-01-07 
#> # ℹ 50 more rows

An S3 object of class epi_workflow. This object encapsulates all the instructions necessary to create the prediction. More details on this below.

out$epi_workflow

#> ══ Epi Workflow [trained] ═══════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> Postprocessor: Frosting
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#> 6 Recipe Steps
#> 
#> 
#> ── Model ────────────────────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Coefficients:
#>       (Intercept)    lag_0_case_rate    lag_7_case_rate   lag_14_case_rate  
#>         0.0829475          0.0009830          0.0027035         -0.0005651  
#>  lag_0_death_rate   lag_7_death_rate  lag_14_death_rate  
#>         0.2466110          0.1964921          0.0752998  
#> 
#> ── Postprocessor ────────────────────────────────────────────────────────────
#> 5 Frosting Layers

By default, the forecaster predicts the outcome (death_rate) 1-week ahead, using 3 lags of each predictor (case_rate and death_rate) at 0 (today), 1 week back and 2 weeks back. The predictors and outcome can be changed directly. The rest of the defaults are encapsulated into a list of arguments. This list is produced by arx_args_list().

8.5 Simple adjustments

Basic adjustments can be made through the args_list.

out2week <- arx_forecaster(
  epi_data = jhu,
  outcome = "death_rate",
  predictors = c("case_rate", "death_rate"),
  args_list = arx_args_list(
    lags = list(case_rate = c(0, 1, 2, 3, 7, 14), death_rate = c(0, 7, 14)),
    ahead = 14
  )
)

Here, we’ve used different lags on the case_rate and are now predicting 2 weeks ahead. Note that lags and aheads are in the same units as the time_value of the epi_df used for training (same as the epi_slide() arguments discussed in Chapter 3). This example also illustrates a major difficulty with the “iterative” versions of AR models. This model doesn’t produce forecasts for case_rate, and so, would not have data to “plug in” for the necessary lags.²

Another property of the basic model is the predictive interval. We describe this in more detail in a coming chapter, but it is easy to request multiple quantiles.

out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
  args_list = arx_args_list(
    quantile_levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99)
  )
)

The column .pred_dstn in the predictions object is actually a “distribution” here parameterized by its quantiles. For this default forecaster, these are created using the quantiles of the residuals of the predictive model (possibly symmetrized). Here, we used 23 quantiles, but one can grab a particular quantile,

head(quantile(out_q$predictions$.pred_distn, p = .4))

#>        40%        40%        40%        40%        40%        40% 
#> 0.30277798 0.27213225 0.44345734 0.03120647 0.56121844 0.27492711

or extract the entire distribution into a “long” epi_df with tau being the probability and q being the value associated to that quantile.

out_q$predictions %>%
  # first create a "nested" list-column
  mutate(.pred_distn = nested_quantiles(.pred_distn)) %>%
  unnest(.pred_distn) # then unnest it

#> # A tibble: 1,288 × 6
#>   geo_value .pred values quantile_levels forecast_date target_date
#>   <chr>     <dbl>  <dbl>           <dbl> <date>        <date>     
#> 1 ak        0.355 0                0.01  2021-12-31    2022-01-07 
#> 2 ak        0.355 0                0.025 2021-12-31    2022-01-07 
#> 3 ak        0.355 0.0371           0.05  2021-12-31    2022-01-07 
#> 4 ak        0.355 0.123            0.1   2021-12-31    2022-01-07 
#> 5 ak        0.355 0.174            0.15  2021-12-31    2022-01-07 
#> 6 ak        0.355 0.211            0.2   2021-12-31    2022-01-07 
#> # ℹ 1,282 more rows

Additional simple adjustments to the basic forecaster can be made using the function:

arx_args_list(
  lags = c(0L, 7L, 14L), ahead = 7L, n_training = Inf,
  forecast_date = NULL, target_date = NULL, quantile_levels = c(0.05, 0.95),
  symmetrize = TRUE, nonneg = TRUE, quantile_by_key = "geo_value"
)

8.6 Changing the engine

So far, our forecasts have been produced using simple linear regression. But this is not the only way to estimate such a model. The trainer argument determines the type of model we want. This takes a {parsnip} model. The default is linear regression, but we could instead use a random forest with the {ranger} package:

out_rf <- arx_forecaster(
  jhu, "death_rate", c("case_rate", "death_rate"),
  rand_forest(mode = "regression")
)

Or boosted regression trees with {xgboost}:

out_gb <- arx_forecaster(
  jhu, "death_rate", c("case_rate", "death_rate"),
  boost_tree(mode = "regression", trees = 20)
)

Or quantile regression, using our custom forecasting engine quantile_reg():

out_gb <- arx_forecaster(
  jhu, "death_rate", c("case_rate", "death_rate"),
  quantile_reg()
)

FWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years.

Our group has not prioritized these sorts of models for epidemic forecasting, but one could also integrate these methods into our framework.↩︎
An obvious fix is to instead use a VAR and predict both, but this would likely increase the variance of the model, and therefore, may lead to less accurate forecasts for the variable of interest.↩︎