Skip to contents

Goals for the package

At a high level, our goal with epipredict is to make running simple Machine Learning / Statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epi training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework.

Serving both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.

Baseline models

We provide a set of basic, easy-to-use forecasters that work out of the box. You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below).

For the basic forecasters, we provide:

  • Baseline flat-line forecaster
  • Autoregressive forecaster
  • Autoregressive classifier

All the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.

Forecasting framework

Our framework for creating custom forecasters views the prediction task as a set of modular components. There are four types of components:

  1. Preprocessor: make transformations to the data before model training
  2. Trainer: train a model on data, resulting in a fitted model object
  3. Predictor: make predictions, using a fitted model object and processed test data
  4. Postprocessor: manipulate or transform the predictions before returning

Users familiar with {tidymodels} and especially the {workflows} package will notice a lot of overlap. This is by design, and is in fact a feature. The truth is that epipredict is a wrapper around much that is contained in these packages. Therefore, if you want something from this -verse, it should “just work” (we hope).

The reason for the overlap is that workflows already implements the first three steps. And it does this very well. However, it is missing the postprocessing stage and currently has no plans for such an implementation. And this feature is important. The baseline forecaster we provide requires postprocessing. Anything more complicated needs this as well.

The second omission from tidymodels is support for panel data. Besides epidemiological data, economics, psychology, sociology, and many other areas frequently deal with data of this type. So the framework of behind epipredict implements this. In principle, this has nothing to do with epidemiology, and one could simply use this package as a solution for the missing functionality in tidymodels. Again, this should “just work”.

All of the panel data functionality is implemented through the epi_df data type in the companion {epiprocess} package. There is much more to see there, but for the moment, it’s enough to look at a simple one:

jhu <- case_death_rate_subset
jhu
#> An `epi_df` object, 20,496 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 19:08:25.791826
#> 
#> # A tibble: 20,496 × 4
#>    geo_value time_value case_rate death_rate
#>  * <chr>     <date>         <dbl>      <dbl>
#>  1 ak        2020-12-31      35.9      0.158
#>  2 al        2020-12-31      65.1      0.438
#>  3 ar        2020-12-31      66.0      1.27 
#>  4 as        2020-12-31       0        0    
#>  5 az        2020-12-31      76.8      1.10 
#>  6 ca        2020-12-31      96.0      0.751
#>  7 co        2020-12-31      35.8      0.649
#>  8 ct        2020-12-31      52.1      0.819
#>  9 dc        2020-12-31      31.0      0.601
#> 10 de        2020-12-31      65.2      0.807
#> # ℹ 20,486 more rows

This data is built into the package and contains the measured variables case_rate and death_rate for COVID-19 at the daily level for each US state for the year 2021. The “panel” part is because we have repeated measurements across a number of locations.

The epi_df encodes the time stamp as time_value and the key as geo_value. While these 2 names are required, the values don’t need to actually represent such objects. Additional key’s are also supported (like age group, ethnicity, taxonomy, etc.).

The epi_df also contains some metadata that describes the keys as well as the vintage of the data. It’s possible that data collected at different times for the same set of geo_value’s and time_value’s could actually be different. For more details, see {epiprocess}.

Why doesn’t this package already exist?

As described above:

  • Parts actually DO exist. There’s a universe called tidymodels. It handles preprocessing, training, and prediction, bound together, through a package called workflows. We built epipredict on top of that setup. In this way, you CAN use almost everything they provide.

  • However, workflows doesn’t do postprocessing. And nothing in the -verse handles panel data.

  • The tidy-team doesn’t have plans to do either of these things. (We checked).

  • There are two packages that do time series built on tidymodels, but it’s “basic” time series: 1-step AR models, exponential smoothing, STL decomposition, etc.1 Our group has not prioritized these sorts of models for epidemic forecasting, but one could also integrate these methods into our framework.

Show me the basics

We start with the jhu data displayed above. One of the “canned” forecasters we provide is an autoregressive forecaster with (or without) covariates that directly trains on the response. This is in contrast to a typical “iterative” AR model that trains to predict one-step-ahead, and then plugs in the predictions to “leverage up” to longer horizons.

We’ll estimate the model jointly across all locations using only the most recent 30 days.

jhu <- jhu %>% filter(time_value >= max(time_value) - 30)
out <- arx_forecaster(
  jhu,
  outcome = "death_rate",
  predictors = c("case_rate", "death_rate")
)

This call produces a warning, which we’ll ignore for now. But essentially, it’s telling us that our data comes from May 2022 but we’re trying to do a forecast for January 2022. The result is likely not an accurate measure of real-time forecast performance, because the data have been revised over time.

The out object has two components:

  1. The predictions which is just another epi_df. It contains the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the forecast_date (the date on which the forecast was putatively made) and the target_date (the date for which the forecast is being made).
out$predictions
#> # A tibble: 56 × 5
#>    geo_value  .pred        .pred_distn forecast_date target_date
#>    <chr>      <dbl>             <dist> <date>        <date>     
#>  1 ak        0.355  quantiles(0.36)[2] 2021-12-31    2022-01-07 
#>  2 al        0.325  quantiles(0.32)[2] 2021-12-31    2022-01-07 
#>  3 ar        0.496   quantiles(0.5)[2] 2021-12-31    2022-01-07 
#>  4 as        0.0836  quantiles(0.2)[2] 2021-12-31    2022-01-07 
#>  5 az        0.614  quantiles(0.61)[2] 2021-12-31    2022-01-07 
#>  6 ca        0.327  quantiles(0.33)[2] 2021-12-31    2022-01-07 
#>  7 co        0.567  quantiles(0.57)[2] 2021-12-31    2022-01-07 
#>  8 ct        0.544  quantiles(0.54)[2] 2021-12-31    2022-01-07 
#>  9 dc        0.831  quantiles(0.83)[2] 2021-12-31    2022-01-07 
#> 10 de        0.607  quantiles(0.61)[2] 2021-12-31    2022-01-07 
#> # ℹ 46 more rows
  1. A list object of class epi_workflow. This object encapsulates all the instructions necessary to create the prediction. More details on this below.
out$epi_workflow
#> 
#> ══ Epi Workflow [trained] ══════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> Postprocessor: Frosting
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 
#> 6 Recipe steps.
#> 1. step_epi_lag()
#> 2. step_epi_lag()
#> 3. step_epi_ahead()
#> 4. step_naomit()
#> 5. step_naomit()
#> 6. step_training_window()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Coefficients:
#>       (Intercept)    lag_0_case_rate    lag_7_case_rate   lag_14_case_rate  
#>         0.0829475          0.0009830          0.0027035         -0.0005651  
#>  lag_0_death_rate   lag_7_death_rate  lag_14_death_rate  
#>         0.2466110          0.1964921          0.0752998
#> 
#> ── Postprocessor ───────────────────────────────────────────────────────────────
#> 
#> 5 Frosting layers.
#> 1. layer_predict()
#> 2. layer_residual_quantiles()
#> 3. layer_add_forecast_date()
#> 4. layer_add_target_date()
#> 5. layer_threshold()
#> 

Note that the time_value in the predictions is not necessarily meaningful, but it is a required column in an epi_df, so it remains here.

By default, the forecaster predicts the outcome (death_rate) 1-week ahead, using 3 lags of each predictor (case_rate and death_rate) at 0 (today), 1 week back and 2 weeks back. The predictors and outcome can be changed directly. The rest of the defaults are encapsulated into a list of arguments. This list is produced by arx_args_list().

Simple adjustments

Basic adjustments can be made through the args_list.

out2week <- arx_forecaster(
  jhu,
  outcome = "death_rate",
  predictors = c("case_rate", "death_rate"),
  args_list = arx_args_list(
    lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)),
    ahead = 14
  )
)

Here, we’ve used different lags on the case_rate and are now predicting 2 weeks ahead. This example also illustrates a major difficulty with the “iterative” versions of AR models. This model doesn’t produce forecasts for case_rate, and so, would not have data to “plug in” for the necessary lags.2

Another property of the basic model is the predictive interval. We describe this in more detail in a different vignette, but it is easy to request multiple quantiles.

out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
  args_list = arx_args_list(
    quantile_levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99)
  )
)

The column .pred_dstn in the predictions object is actually a “distribution” here parameterized by its quantiles. For this default forecaster, these are created using the quantiles of the residuals of the predictive model (possibly symmetrized). Here, we used 23 quantiles, but one can grab a particular quantile,

head(quantile(out_q$predictions$.pred_distn, p = .4))
#>        40%        40%        40%        40%        40%        40% 
#> 0.30277798 0.27213225 0.44345734 0.03120647 0.56121844 0.27492711

or extract the entire distribution into a “long” epi_df with quantile_levels being the probability and values being the value associated to that quantile.

out_q$predictions %>%
  # first create a "nested" list-column
  mutate(.pred_distn = nested_quantiles(.pred_distn)) %>%
  unnest(.pred_distn) # then unnest it
#> # A tibble: 1,288 × 6
#>    geo_value .pred values quantile_levels forecast_date target_date
#>    <chr>     <dbl>  <dbl>           <dbl> <date>        <date>     
#>  1 ak        0.355 0                0.01  2021-12-31    2022-01-07 
#>  2 ak        0.355 0                0.025 2021-12-31    2022-01-07 
#>  3 ak        0.355 0.0371           0.05  2021-12-31    2022-01-07 
#>  4 ak        0.355 0.123            0.1   2021-12-31    2022-01-07 
#>  5 ak        0.355 0.174            0.15  2021-12-31    2022-01-07 
#>  6 ak        0.355 0.211            0.2   2021-12-31    2022-01-07 
#>  7 ak        0.355 0.237            0.25  2021-12-31    2022-01-07 
#>  8 ak        0.355 0.260            0.3   2021-12-31    2022-01-07 
#>  9 ak        0.355 0.282            0.35  2021-12-31    2022-01-07 
#> 10 ak        0.355 0.303            0.4   2021-12-31    2022-01-07 
#> # ℹ 1,278 more rows

Additional simple adjustments to the basic forecaster can be made using the function:

arx_args_list(
  lags = c(0L, 7L, 14L), ahead = 7L, n_training = Inf,
  forecast_date = NULL, target_date = NULL, quantile_levels = c(0.05, 0.95),
  symmetrize = TRUE, nonneg = TRUE, quantile_by_key = character(0L),
  nafill_buffer = Inf
)

Changing the engine

So far, our forecasts have been produced using simple linear regression. But this is not the only way to estimate such a model. The trainer argument determines the type of model we want. This takes a {parsnip} model. The default is linear regression, but we could instead use a random forest with the ranger package:

out_rf <- arx_forecaster(
  jhu, "death_rate", c("case_rate", "death_rate"),
  rand_forest(mode = "regression")
)

Or boosted regression trees with xgboost:

out_gb <- arx_forecaster(
  jhu, "death_rate", c("case_rate", "death_rate"),
  boost_tree(mode = "regression", trees = 20)
)

Or quantile regression, using our custom forecasting engine quantile_reg():

out_qr <- arx_forecaster(
  jhu, "death_rate", c("case_rate", "death_rate"),
  quantile_reg()
)

FWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years.

Inner workings

Underneath the hood, this forecaster creates (and returns) an epi_workflow. Essentially, this is a big S3 object that wraps up the 4 modular steps (preprocessing - postprocessing) described above.

Preprocessing

Preprocessing is accomplished through a recipe (imagine baking a cake) as provided in the {recipes} package. We’ve made a few modifications (to handle panel data) as well as added some additional options. The recipe gives a specification of how to handle training data. Think of it like a fancified formula that you would pass to lm(): y ~ x1 + log(x2). In general, there are 2 extensions to the formula that recipes handles:

  1. Doing transformations of both training and test data that can always be applied. These are things like taking the log of a variable, leading or lagging, filtering out rows, handling dummy variables, etc.
  2. Using statistics from the training data to eventually process test data. This is a major benefit of recipes. It prevents what the tidy team calls “data leakage”. A simple example is centering a predictor by its mean. We need to store the mean of the predictor from the training data and use that value on the test data rather than accidentally calculating the mean of the test predictor for centering.

A recipe is processed in 2 steps, first it is “prepped”. This calculates and stores any intermediate statistics necessary for use on the test data. Then it is “baked” resulting in training data ready for passing into a statistical model (like lm).

We have introduced an epi_recipe. It’s just a recipe that knows how to handle the time_value, geo_value, and any additional keys so that these are available when necessary.

The epi_recipe from out_gb can be extracted from the result:

extract_recipe(out_gb$epi_workflow)

The “Inputs” are the original epi_df and the “roles” that these are assigned. None of these are predictors or outcomes. Those will be created by the recipe when it is prepped. The “Operations” are the sequence of instructions to create the cake (baked training data). Here we create lagged predictors, lead the outcome, and then remove NAs. Some models like lm internally handle NAs, but not everything does, so we deal with them explicitly. The code to do this (inside the forecaster) is

er <- epi_recipe(jhu) %>%
  step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>%
  step_epi_ahead(death_rate, ahead = 7) %>%
  step_epi_naomit()

While recipes provides a function step_lag(), it assumes that the data have no breaks in the sequence of time_values. This is a bit dangerous, so we avoid that behaviour. Our lag/ahead functions also appropriately adjust the amount of data to avoid accidentally dropping recent predictors from the test data.

The model specification

Users with familiarity with the parsnip package will have no trouble here. Basically, parsnip unifies the function signature across statistical models. For example, lm() “likes” to work with formulas, but glmnet::glmnet() uses x and y for predictors and response. parsnip is agnostic. Both of these do “linear regression”. Above we switched from lm() to xgboost() without any issue despite the fact that these functions couldn’t be more different.

lm(formula, data, subset, weights, na.action,
  method = "qr",
  model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
  contrasts = NULL, offset, ...
)

xgboost(
  data = NULL, label = NULL, missing = NA, weight = NULL,
  params = list(), nrounds, verbose = 1, print_every_n = 1L,
  early_stopping_rounds = NULL, maximize = NULL, save_period = NULL,
  save_name = "xgboost.model", xgb_model = NULL, callbacks = list(),
  ...
)

epipredict provides a few engines/modules (the flatline forecaster and quantile regression), but you should be able to use any available models listed here.

To estimate (fit) a preprocessed model, one calls fit() on the epi_workflow.

ewf <- epi_workflow(er, linear_reg()) %>% fit(jhu)

Postprocessing

To stretch the metaphor of preparing a cake to its natural limits, we have created postprocessing functionality called “frosting”. Much like the recipe, each postprocessing operation is a “layer” and we “slather” these onto our baked cake. To fix ideas, below is the postprocessing frosting for arx_forecaster()

extract_frosting(out_q$epi_workflow)

Here we have 5 layers of frosting. The first generates the forecasts from the test data. The second uses quantiles of the residuals to create distributional forecasts. The next two add columns for the date the forecast was made and the date for which it is intended to occur. Because we are predicting rates, they should be non-negative, so the last layer thresholds both predicted values and intervals at 0. The code to do this (inside the forecaster) is

f <- frosting() %>%
  layer_predict() %>%
  layer_residual_quantiles(
    quantile_levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99),
    symmetrize = TRUE
  ) %>%
  layer_add_forecast_date() %>%
  layer_add_target_date() %>%
  layer_threshold(starts_with(".pred"))

At predict time, we add this object onto the epi_workflow and call predict()

test_data <- get_test_data(er, jhu)
ewf %>%
  add_frosting(f) %>%
  predict(test_data)
#> An `epi_df` object, 56 x 6 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 19:08:25.791826
#> 
#> # A tibble: 56 × 6
#>    geo_value time_value  .pred         .pred_distn forecast_date target_date
#>  * <chr>     <date>      <dbl>              <dist> <date>        <date>     
#>  1 ak        2021-12-31 0.355  quantiles(0.36)[23] 2021-12-31    2022-01-07 
#>  2 al        2021-12-31 0.325  quantiles(0.32)[23] 2021-12-31    2022-01-07 
#>  3 ar        2021-12-31 0.496   quantiles(0.5)[23] 2021-12-31    2022-01-07 
#>  4 as        2021-12-31 0.0836 quantiles(0.08)[23] 2021-12-31    2022-01-07 
#>  5 az        2021-12-31 0.614  quantiles(0.61)[23] 2021-12-31    2022-01-07 
#>  6 ca        2021-12-31 0.327  quantiles(0.33)[23] 2021-12-31    2022-01-07 
#>  7 co        2021-12-31 0.567  quantiles(0.57)[23] 2021-12-31    2022-01-07 
#>  8 ct        2021-12-31 0.544  quantiles(0.54)[23] 2021-12-31    2022-01-07 
#>  9 dc        2021-12-31 0.831  quantiles(0.83)[23] 2021-12-31    2022-01-07 
#> 10 de        2021-12-31 0.607  quantiles(0.61)[23] 2021-12-31    2022-01-07 
#> # ℹ 46 more rows

The above get_test_data() function examines the recipe and ensures that enough test data is available to create the necessary lags and produce a prediction for the desired future time point (after the end of the training data). This mimics what would happen if jhu contained the most recent available historical data and we wanted to actually predict the future. We could have instead used any test data that contained the necessary predictors.

Conclusion

Internally, we provide some simple functions to create reasonable forecasts. But ideally, a user could create their own forecasters by building up the components we provide. In other vignettes, we try to walk through some of these customizations.

To illustrate everything above, here is (roughly) the code for the flatline_forecaster() applied to the case_rate.

r <- epi_recipe(jhu) %>%
  step_epi_ahead(case_rate, ahead = 7, skip = TRUE) %>%
  update_role(case_rate, new_role = "predictor") %>%
  add_role(all_of(epi_keys(jhu)), new_role = "predictor")

# bit of a weird hack to get the latest values per key
latest <- get_test_data(epi_recipe(jhu), jhu)

f <- frosting() %>%
  layer_predict() %>%
  layer_residual_quantiles() %>%
  layer_add_forecast_date() %>%
  layer_add_target_date() %>%
  layer_threshold(starts_with(".pred"))

eng <- linear_reg() %>% set_engine("flatline")
wf <- epi_workflow(r, eng, f) %>% fit(jhu)
preds <- predict(wf, latest)

All that really differs from the arx_forecaster() is the recipe, the test data, and the engine. The frosting is identical, as is the fitting and predicting procedure.

preds
#> An `epi_df` object, 56 x 6 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 19:08:25.791826
#> 
#> # A tibble: 56 × 6
#>    geo_value time_value .pred          .pred_distn forecast_date target_date
#>  * <chr>     <date>     <dbl>               <dist> <date>        <date>     
#>  1 ak        2021-12-31  36.4  quantiles(39.58)[2] 2021-12-31    2022-01-07 
#>  2 al        2021-12-31  89.9  quantiles(89.92)[2] 2021-12-31    2022-01-07 
#>  3 ar        2021-12-31  82.6  quantiles(82.58)[2] 2021-12-31    2022-01-07 
#>  4 as        2021-12-31   0    quantiles(21.39)[2] 2021-12-31    2022-01-07 
#>  5 az        2021-12-31  58.3  quantiles(58.28)[2] 2021-12-31    2022-01-07 
#>  6 ca        2021-12-31  84.4  quantiles(84.37)[2] 2021-12-31    2022-01-07 
#>  7 co        2021-12-31 106.  quantiles(105.83)[2] 2021-12-31    2022-01-07 
#>  8 ct        2021-12-31 143.   quantiles(143.1)[2] 2021-12-31    2022-01-07 
#>  9 dc        2021-12-31 295.  quantiles(295.03)[2] 2021-12-31    2022-01-07 
#> 10 de        2021-12-31 150.  quantiles(149.93)[2] 2021-12-31    2022-01-07 
#> # ℹ 46 more rows